Python for Data Engineers
This course focuses on Python tools for data science initiatives. A primary goal is to leverage the powerful Python Data Science package, pandas. You will learn about data pipelines and build Extract, Transform, Load (ETL) processes using Python and pandas. ETL processes are often deployed by data engineers and data scientists to ingest data for use in an application and to manipulate that data for use by analysts. Participants will build ETL pipelines and ingest JSON and CSV files using APIs, Python and pandas. Focus is on using the pandas package for ingesting data into DataFrames, filtering and cleaning the data, and storing the final dataset either locally or in the cloud. Finally, participants will gain experience building a full pipeline data application using Python.
Audience
This course is suitable for anyone who has a firm understanding of basic programmatic structures in Python and who want to extend their application knowledge of Python to Data Engineering tasks such as data ingest, data wrangling and ETL processes.
Prerequisites
Students should have basic proficiency coding in Python with an understanding of Python data types, Boolean logic, control flow, looping constructs, as well as the basics of Python collections such as lists and dictionaries.
Important learning outcomes include:
● Understand the different components of modern data pipelines.
● Learn applications to Data Engineering tasks such as data ingest, data filtering and cleaning.
● Understand the use of Extract, Transform, and Load (ETL) processes on data.
● Use the Python Data Science library, pandas, to ingest CSV data into a data pipeline.
● Leverage pandas for standard data engineering tasks of filtering and cleaning data in an ETL process
● Work with data formats commonly used in Data Science by data engineers and developers including JSON, CSV.
● Use HTTP and the Python requests module to access data made available by APIs.
● Gain basic data analytics skills using the pandas library by working with large datasets
Getting Started with Jupyter Notebooks
● Running Python Notebooks from Google Colaboratory
The Data Pipeline and ETL (Extract, Transform, Load)
● What is a Data Pipeline?
● What is an ETL process?
Data Formats
● CSV and TSV File Formats
● Structure of JSON Data
● What is NDJSON?
● Using the csv and json Modules
Data Ingest via Application Programming Interfaces (APIs)
● Hypertext Transfer Protocol (HTTP)
● Application Programming Interfaces (APIs)
● Python requests Library
● Ingesting Data Sources via API Calls
Pandas Basics
● Why Pandas?
● Series
● DataFrames
● Populating DataFrames
● Importing CSV, Excel Data
● DataFrame Columns and Cells
● Manipulating Data in pandas DataFrames
Pandas and Data Wrangling
● Data Conversion
● Functions on DataFrames
● Sorting
● Statistics
● Data Cleaning
● Data Filtering
● Groupby
● Aggregate Functions
● Data Analysis
Full ETL Application
● Extracting the data that you need from Data Ingest
● Assembling NDJSON files
● Uploading files to Cloud Storage OR writing files locally
Is there a discount available for current students?
UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Online courses are excluded from this offer.
What is the cancellation and refund policy?
Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.