Data Engineering with PySpark

Overview

Data Engineering has become an important role in the Data Science space. For Data Analysts to do productive work, they need to have consistent datasets to analyze. A Data Engineer provides this consistency for analysts by accessing data in a variety of formats, using a variety of tools. This class will introduce programmers to tools for ETL applications as well as big data applications using Apache Spark. Participants will gain experience with PySpark, the Spark SQL module, and DataFrames.

Who Should Take This Course

Audience

This course is suitable for: Software Developers, Data Scientists, and anyone needing to manipulate large datasets.

Prerequisites

Students should have a general background in programming and/or data processing; ability to learn a new language (Python) by doing stepwise exercises.

Schedule

Course Outline

Chapter 1. Defining Data Engineering

• What is Data Engineering?
• How is it different from Data Science?

Chapter 2. The Data Engineer Role

• The scope of the DE role
• Data Scientists, Machine Learning Specialists, and Data Engineers

Chapter 3. Data Processing Phases

• Data Ingestion
• Data Cleansing

Chapter 4. Distributed Computing Concepts

• Data Physics
• CAP Theorem
• Hadoop

Chapter 5. Apache Spark

• Supported Languages
• Distributed Data Processing with PySpark

Chapter 6. Apache Spark Dev Environments

• Spark Shells
• Jupyter Notebooks

Chapter 7. Introduction to Functional Programming

• Why I need Functional Programming?
• Functional Programming with Python

Chapter 8. Functional Programming using Spark RDD API

• RDD Transformations and Actions
• Data Partitioning

Chapter 9. ETL Jobs with RDD

• Using map-reduce FP for Data Processing

Chapter 10. Spark SQL DataFrames

• What are DataFrames?
• Relationship with RDDs
• Ways to Create DataFrames
• Schema of Datasets
• Inferring the Schema

Chapter 11. SQL-centric Programming using DataFrames API

• Using the sql Method, and the Native DataFrame API
• Data Aggregation

Chapter 12. ETL Jobs with DataFrames

• Using Spark SQL DataFrame API
• Contrasting with Spark RDD API

Chapter 13. Repairing and Normalizing Data

• What May Be Wrong With My Data?
• Detecting and Removing Bad Data

Chapter 14. Data Visualization with seaborn

• Exploratory Data Analysis
• Available Options for producing graphs

Chapter 15. Working with Various File Formats: CSV, Parquet, ORC, and JSON

• What is Columnar Data Storage Formats?
• Comparing Various Formats
• Ways to Read and Store Data in Various Formats

FAQs

Is there a discount available for current students?

UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Asynchronous courses are excluded from this offer.

What is the cancellation and refund policy?

Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.