Applied Data Science and Big Data Analytics

Overview
Business success in the information age is predicated on the ability of organizations to convert raw data coming from various sources into highgrade business information.
To stay competitive, organizations have started adopting new approaches to data processing and analysis. For example, data scientists are turning to Apache Spark for processing massive amounts of data using Spark’s distributed compute capability along with its builtin machine learning library, or switching from proprietary and costly solutions to the free R programming language.

Who Should Take This Course
AUDIENCE
Data Scientists, Software Developers, IT Architects, and Technical Managers.
PREREQUISITES
Participants should have the general knowledge of statistics and programming.

Why You Should Take This Course
This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics. The course covers fundamental and advanced concepts and methods for deriving business insights from big” and/or “small” data. This training course is supplemented by handson labs that help attendees reinforce their theoretical knowledge of the learned material.

Schedule

Course Outline
CHAPTER 1. APPLIED DATA SCIENCE
 What is Data Science?
 Data Science Ecosystem
 Data Mining vs. Data Science
 Business Analytics vs. Data Science
 Who is a Data Scientist?
 Data Science Skill Sets Venn Diagram
 Data Scientists at Work
 Examples of Data Science Projects
 An Example of a Data Product
 Applied Data Science at Google
 Data Science Gotchas
 Summary
CHAPTER 2. GETTING STARTED WITH R
 Introduction
 Positioning of R in the Data Science Arena
 R Integrated Development Environments
 Running R
 Running RStudio
 Ending the Current R Session
 Getting Help
 Getting System Information
 General Notes on R Commands and Statements
 R Data Structures
 R Objects and Workspace
 Assignment Operators
 Assignment Example
 Arithmetic Operators
 Logical Operators
 System Date and Time
 Operations
 Userdefined Functions
 Userdefined Function Example
 R Code Example
 Type Conversion (Coercion)
 Control Statements
 Conditional Execution
 Repetitive Execution
 Repetitive execution
 Builtin Functions
 Reading Data from Files into Vectors
 Example of Reading Data from a File
 Writing Data to a File
 Example of Writing Data to a File
 Logical Vectors
 Character Vectors
 Matrix Data Structure
 Creating Matrices
 Working with Data Frames
 Matrices vs Data Frames
 A Data Frame Sample
 Accessing Data Cells
 Getting Info About a Data Frame
 Selecting Columns in Data Frames
 Selecting Rows in Data Frames
 Getting a Subset of a Data Frame
 Sorting (ordering) Data in Data Frames by Attribute(s)
 Applying Functions to Matrices and Data Frames
 Using the apply() Function
 Example of Using apply()
 Executing External R commands
 Loading External Scripts in RStudio
 Listing Objects in Workspace
 Removing Objects in Workspace
 Saving Your Workspace in R
 Saving Your Workspace in RStudio
 Saving Your Workspace in R GUI
 Loading Your Workspace
 Handson Exercises
 Getting and Setting the Working Directory
 Getting the List of Files in a Directory
 Diverting Output to a File
 Batch (Unattended) Processing
 Importing Data into R
 Exporting Data from R
 Handson Exercise
 Standard R Packages
 Extending R
 Extending R in R GUI
 Extending R in RStudio
 CRAN Page
 Summary
CHAPTER 3. R STATISTICAL COMPUTING FEATURES
 Statistical Computing Features
 Descriptive Statistics
 Basic Statistical Functions
 Examples of Using Basic Statistical Functions
 Nonuniformity of a Probability Distribution
 Writing Your Own skew and kurtosis Functions
 Handson Exercise
 Generating Normally Distributed Random Numbers
 Generating Uniformly Distributed Random Numbers
 Using the summary() Function
 Math Functions Used in Data Analysis
 Examples of Using Math Functions
 Correlations
 Correlation Example
 Testing Correlation Coefficient for Significance
 The cor.test() Function
 The cor.test() Example
 Regression Analysis
 Types of Regression
 Simple Linear Regression Model
 LeastSquares Method (LSM)
 LSM Assumptions
 Fitting Linear Regression Models in R
 Example of Using lm()
 Confidence Intervals for Model Parameters
 Example of Using lm() with a Data Frame
 Regression Models in Excel
 Handson Exercise
 Multiple Regression Analysis
 Finding the BestFitting Regression Model
 Comparing Regression Models
 Handson Exercise
 Summary
CHAPTER 4. DATA ANALYTICS LIFECYCLE PHASES
 Big Data Analytics Pipeline
 Data Discovery Phase
 Data Harvesting Phase
 Data Priming Phase
 Exploratory Data Analysis
 Model Planning Phase
 Model Building Phase
 Communicating the Results
 Production Rollout
 Summary
CHAPTER 5. DATA SCIENCE ALGORITHMS AND ANALYTICAL METHODS
 Supervised vs Unsupervised Machine Learning
 Supervised Machine Learning Algorithms
 Unsupervised Machine Learning Algorithms
 Choose the Right Algorithm
 Lifecycles of Machine Learning Development
 Classifying with kNearest Neighbors (SL)
 kNearest Neighbors Algorithm
 kNearest Neighbors Algorithm
 The Error Rate
 Handson Exercise
 Decision Trees (SL)
 Decision Tree Terminology
 Decision Trees in Pictures
 Decision Tree Classification in Context of Information Theory
 Information Entropy Defined
 The Shannon Entropy Formula
 The Simplified Decision Tree Algorithm
 Using Decision Trees
 Random Forests
 Naive Bayes Classifier (SL)
 Naive Bayesian Probabilistic Model in a Nutshell
 Bayes Formula
 Classification of Documents with Naive Bayes
 Unsupervised Learning Type: Clustering
 KMeans Clustering (UL)
 KMeans Clustering in a Nutshell
 Regression Analysis
 Simple Linear Regression Model
 Linear vs NonLinear Regression
 Linear Regression Illustration
 Major Underlying Assumptions for Regression Analysis
 LeastSquares Method (LSM)
 Locally Weighted Linear Regression
 Regression Models in Excel
 Multiple Regression Analysis
 Logistic Regression
 Regression vs Classification
 TimeSeries Analysis
 Decomposing TimeSeries
 MonteCarlo Simulation (Method)
 Who Uses MonteCarlo Simulation?
 MonteCarlo Simulation in a Nutshell
 MonteCarlo Simulation Example
 MonteCarlo Simulation Example
 Handson Exercise
 Summary
CHAPTER 6. VISUALIZING AND REPORTING PROCESSED RESULTS
 Data Visualization
 Data Visualization in R
 The ggplot2 Data Visualization Package
 Creating Bar Plots in R
 Creating Horizontal Bar Plots
 Using barplot() with Matrices
 Using barplot() with Matrices Example
 Customizing Plots
 Histograms in R
 Building Histograms with hist()
 Example of using hist()
 Pie Charts in R
 Examples of using pie()
 Generic XY Plotting
 Examples of the plot() function
 Dot Plots in R
 Saving Your Work
 Supported Export Options
 Plots in RStudio
 Saving a Plot as an Image
 The BIRT Project
 JavaFX
 Data Visualization with JavaFX
 Visualization with D3 JavaScript Library
 Examples of D3 Visualization
 Google Charts
 Summary
CHAPTER 7. TEXT MINING
 What is Text Mining?
 The Common Text Mining Tasks
 What is Natural Language Processing (NLP)?
 Some of the NLP Use Cases
 Machine Learning in Text Mining and NLP
 Machine Learning in NLP
 TFIDF
 The Feature Hashing Trick
 Stemming
 Example of Stemming
 Stop Words
 Popular Text Mining and NLP Libraries and Packages
 Summary
CHAPTER 8. INTRODUCTION TO FUNCTIONAL PROGRAMMING
 What is Functional Programming (FP)?
 Terminology: FirstClass and HigherOrder Functions
 Terminology: Lambda vs Closure
 A Short List of Languages that Support FP
 FP with Java
 FP With JavaScript
 Imperative Programming in JavaScript
 The JavaScript map (FP) Example
 The JavaScript reduce (FP) Example
 Using reduce to Flatten an Array of Arrays (FP) Example
 The JavaScript filter (FP) Example
 Common HighOrder Functions in Python
 Common HighOrder Functions in Scala
 Elements of FP in R
 Summary
CHAPTER 9. BIG DATA BUSINESS INTELLIGENCE AND ANALYTICS
 Traditional Business Intelligence and Analytics
 OLAP Tasks
 Data Mining Tasks
 Big Data / NoSQL Solutions
 NoSQL Data Querying and Processing
 The UnQL Specification
 MapReduce Defined
 MapReduce Explained
 Hadoop
 Hadoopbased Systems for Data Analysis
 Hadoop’s Streaming MapReduce
 Streaming Use Cases
 Making things simpler with Hadoop Pig Latin
 Pig Latin Script Example
 SQL Equivalent
 What is Hive?
 Interfacing with Hive
 Hive Data Definition Language
 Business Analytics with Hive
 What is Spark?
 Amazon Elastic MapReduce
 Big Data with Google App Engine (GAE)
 GAE Dashboard
 Example of Google AppEngine Java Datastore API
 Google Cloud Prediction API
 Summary
CHAPTER 10. INTRODUCTION TO APACHE SPARK
 What is Spark
 A Short History of Spark
 Where to Get Spark?
 The Spark Platform
 Spark Logo
 Common Spark Use Cases
 Languages Supported by Spark
 Running Spark on a Cluster
 The Driver Process
 Spark Applications
 Spark Shell
 The sparksubmit Tool
 The sparksubmit Tool Configuration
 The Executor and Worker Processes
 The Spark Application Architecture
 Interfaces with Data Storage Systems
 Limitations of Hadoop’s MapReduce
 Spark vs MapReduce
 Spark as an Alternative to Apache Tez
 The Resilient Distributed Dataset (RDD)
 Spark Streaming (Microbatching)
 Spark SQL
 Example of Spark SQL
 Spark Machine Learning Library
 GraphX
 Spark vs R
 Summary
CHAPTER 11. THE SPARK SHELL
 The Spark Shell
 The Spark Shell UI
 Spark Shell Options
 Getting Help
 The Spark Context (sc) and SQL Context (sqlContext)
 The Shell Spark Context
 Loading Files
 Saving Files
 Basic Spark ETL Operations
 Summary
CHAPTER 12. SPARK RDDS
 The Resilient Distributed Dataset (RDD)
 Ways to Create an RDD
 Custom RDDs
 Supported Data Types
 RDD Operations
 RDDs are Immutable
 Spark Actions
 RDD Transformations
 Other RDD Operations
 Chaining RDD Operations
 RDD Lineage
 The Big Picture
 What May Go Wrong
 Checkpointing RDDs
 Local Checkpointing
 Parallelized Collections
 More on parallelize() Method
 The Pair RDD
 Where do I use Pair RDDs?
 Example of Creating a Pair RDD with Map
 Example of Creating a Pair RDD with keyBy
 Miscellaneous Pair RDD Operations
 RDD Caching
 RDD Persistence
 The Tachyon Storage
 Summary
CHAPTER 13. PARALLEL DATA PROCESSING WITH SPARK
 Running Spark on a Cluster
 Spark Standalone Option
 The HighLevel Execution Flow in Standalone Spark Cluster
 Data Partitioning
 Data Partitioning Diagram
 Single Local File System RDD Partitioning
 Multiple File RDD Partitioning
 Special Cases for Smallsized Files
 Parallel Data Processing of Partitions
 Spark Application, Jobs, and Tasks
 Stages and Shuffles
 The “Big Picture”
 Summary
CHAPTER 14. INTRODUCTION TO SPARK SQL
 What is Spark SQL?
 Uniform Data Access with Spark SQL
 Hive Integration
 Hive Interface
 Integration with BI Tools
 Spark SQL is No Longer Experimental Developer API!
 What is a DataFrame?
 The SQLContext Object
 The SQLContext API
 Changes Between Spark SQL 1.3 to 1.4
 Example of Spark SQL (Scala Example)
 Example of Working with a JSON File
 Example of Working with a Parquet File
 Using JDBC Sources
 JDBC Connection Example
 Performance & Scalability of Spark SQL
 Summary
CHAPTER 15. GRAPH PROCESSING WITH GRAPHX
 What is GraphX?
 Supported Languages
 Vertices and Edges
 Graph Terminology
 Example of Property Graph
 The GraphX API
 The GraphX Views
 The Triplet View
 Graph Algorithms
 Graphs and RDDs
 Constructing Graphs
 Graph Operators
 Example of Using GraphX Operators
 GraphX Performance Optimization
 The PageRank Algorithm
 GraphX Support for PageRank
 Summary
CHAPTER 16. THE SPARK MACHINE LEARNING LIBRARY
 What is MLlib?
 Supported Languages
 MLlib Packages
 Dense and Sparse Vectors
 Labeled Point
 Python Example of Using the LabeledPoint Class
 LIBSVM format
 An Example of a LIBSVM File
 Loading LIBSVM Files
 Local Matrices
 Example of Creating Matrices in MLlib
 Distributed Matrices
 Example of Using a Distributed Matrix
 Classification and Regression Algorithm
 Clustering
 Summary
CHAPTER 17. MACHINE LEARNING WITH BIGML
 What is BigML?
 How BigML Service Works
 Data Files
 Data Sets
 Data Sets Example
 Models
 Predictions
 The Prediction UI Form
 Text Analysis in BigML
 REST API
 Summary
LAB EXERCISES
Lab 1. Learning the Lab Environment
Lab 2. Getting Started with R
Lab 3. Working with R
Lab 4. Data Import and Export in R
Lab 5. Creating Your Own Statistical Functions
Lab 6. Simple Linear Regression
Lab 7. Multiple Linear Regression
Lab 8. kNearest Neighbors Algorithm
Lab 9. MonteCarlo Simulation (Method)
Lab 10. Using R Graphics Package
Lab 11. Using the D3 JavaScript Visualization Library
Lab 12. Common Text Mining Tasks with the tm Library
Lab 13. Elements of Functional Programming with Python
Lab 14. The Spark Shell
Lab 15. RDD Performance Improvement Techniques
Lab 16. Spark ETL and HDFS Interface
Lab 17. Common Map / Reduce Programs in Spark
Lab 18. Spark SQL
Lab 19. Getting Started with GraphX
Lab 20. PageRank with GraphX
Lab 21. Using kmeans Algorithm from MLlib
Lab 22. Using Random Forests for Classification with Spark MLlib
Lab 23. Text Classification with Spark ML Pipeline 
FAQs
Is there a discount available for current students?
UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Online courses are excluded from this offer.
What is the cancellation and refund policy?
Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.