We're offering 20% off September Live Online classes! See which courses are applicable.   |   Details

  
AccountIcon BigDataIcon BlogIcon default_resource_icon CartIcon checkmark_icon cloud_devops_icon computer_network_admin_icon cyber_security_icon gsa_schedule_icon human_resources_icon location_icon phone_icon plus_icon programming_software_icon project_management_icon redhat_linux_icon search_icon sonography_icon sql_database_icon webinar_icon

Search UMBC Training Centers

Data Science

Hadoop With Spark

Group Training + View more dates & times

                 
Overview

Hadoop is a mature Big Data environment and Hive is the de-facto standard for the SQL interface. Today, the computations in Hadoop are usually done with Spark. Spark offers an optimized compute engine that includes batch, and real-time streaming, and machine learning.

This course covers Hadoop 3, Hive 3, and Spark 3.

Who Should Take This Course

AUDIENCE

This course is suitable for: Business analysts, Software developers, Managers.

PREREQUISITES

  • Basics of SQL
  • Exposure to software design
  • Some experience with a programming language such as C, Python or Java (Python preferred)
  • No prior Hadoop knowledge is required
Schedule
Course Outline

Why Hadoop?

  • The motivation for Hadoop
  • Use cases and case studies about Hadoop

The Hadoop platform

  • MapReduce, HDFS, YARN
  • New in Hadoop 3
  • Erasure Coding vs 3x replication

Hive Basics

  • Defining Hive Tables
  • SQL Queries over Structured Data
  • Filtering / Search
  • Aggregations / Ordering
  • Partitions
  • Joins
  • Text Analytics (Semi-Structured Data)

New in Hive 3

  • ACID tables
  • Hive Query Language (HQL)
  • How to run a good query?
  • How to trouble shoot queries?

HBase

  • Basics
  • HBase tables – design and use
  • Phoenix driver for HBase tables

Sqoop

  • Tool
  • Architecture
  • Use

The big picture

  • How Hadoop fits into your architecture
  • Hive vs HBase with Phoenix vs Excel

Spark Introduction

  • Big Data, Hadoop, Spark
  • Spark concepts and architecture
  • Spark components overview
  • Labs: Installing and running Spark

First Look at Spark

  • Spark shell
  • Spark web UIs
  • Analyzing dataset – part 1
  • Labs: Spark shell exploration

Spark Data structures

  • Partitions
  • Distributed execution
  • Operations: transformations and actions
  • Labs: Unstructured data analytics using RDDs

Caching

  • Caching overview
  • Various caching mechanisms available in Spark
  • In memory file systems
  • Caching use cases and best practices
  • Labs: Benchmark of caching performance

DataFrames and Datasets

  • DataFrames Intro
  • Loading structured data (JSON, CSV) using DataFrames
  • Using schema
  • Specifying schema for DataFrames
  • Labs: DataFrames, Datasets, Schema

Spark SQL

  • Spark SQL concepts and overview
  • Defining tables and importing datasets
  • Querying data using SQL
  • Handling various storage formats: JSON / Parquet / ORC
  • Labs: querying structured data using SQL; evaluating data formats

Spark and Hadoop

  • Hadoop + Spark architecture
  • Running Spark on Hadoop YARN
  • Processing HDFS files using Spark
  • Spark & Hive

Spark API

  • Overview of Spark APIs in Scala / Python
  • Life cycle of a Spark application
  • Spark APIs
  • Deploying Spark applications on YARN
  • Labs: Developing and deploying a Spark application

Spark ML Overview

  • Machine Learning primer
  • Machine Learning in Spark: MLib / ML
  • Spark ML overview (newer Spark2 version)
  • Algorithms overview: Clustering, Classifications, Recommendations
  • Labs: Writing ML applications in Spark

GraphX

  • GraphX library overview
  • GraphX APIs
  • Create a Graph and navigating it
  • Shortest distance
  • Pregel API
  • Labs: Processing graph data using Spark

Spark Streaming

  • Streaming concepts
  • Evaluating Streaming platforms
  • Spark streaming library overview
  • Streaming operations
  • Sliding window operations
  • Structured Streaming
  • Continuous streaming
  • Spark & Kafka streaming
  • Labs: Writing spark streaming applications

Workshops (Time permitting)

  • These are group workshops
  • Attendees will work on solving real world data analysis problems using Spark
FAQs
Is there a discount available for current students?

UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Asynchronous courses are excluded from this offer.

What is the cancellation and refund policy?

Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.

Contact Us