We're offering 20% off September Live Online classes! See which courses are applicable.   |   Details >

  
AccountIcon BigDataIcon BlogIcon default_resource_icon CartIcon checkmark_icon cloud_devops_icon computer_network_admin_icon cyber_security_icon gsa_schedule_icon human_resources_icon location_icon phone_icon plus_icon programming_software_icon project_management_icon redhat_linux_icon search_icon sonography_icon sql_database_icon webinar_icon

Search UMBC Training Centers

Big Data Analytics

Hadoop With Spark

+ View more dates & times
    
                     
  • Overview

    Hadoop is a mature Big Data environment and Hive is the de-facto standard for the SQL interface. Today, the computations in Hadoop are usually done with Spark. Spark offers an optimized compute engine that includes batch, and real-time streaming, and machine learning.

    This course covers Hadoop 3, Hive 3, and Spark 3.

  • Who Should Take This Course

    AUDIENCE

    This course is suitable for: Business analysts, Software developers, Managers.

    PREREQUISITES

    • Basics of SQL
    • Exposure to software design
    • Some experience with a programming language such as C, Python or Java (Python preferred)
    • No prior Hadoop knowledge is required
  • Schedule
  • Course Outline

    Why Hadoop?

    • The motivation for Hadoop
    • Use cases and case studies about Hadoop

    The Hadoop platform

    • MapReduce, HDFS, YARN
    • New in Hadoop 3
    • Erasure Coding vs 3x replication

    Hive Basics

    • Defining Hive Tables
    • SQL Queries over Structured Data
    • Filtering / Search
    • Aggregations / Ordering
    • Partitions
    • Joins
    • Text Analytics (Semi-Structured Data)

    New in Hive 3

    • ACID tables
    • Hive Query Language (HQL)
    • How to run a good query?
    • How to trouble shoot queries?

    HBase

    • Basics
    • HBase tables – design and use
    • Phoenix driver for HBase tables

    Sqoop

    • Tool
    • Architecture
    • Use

    The big picture

    • How Hadoop fits into your architecture
    • Hive vs HBase with Phoenix vs Excel

    Spark Introduction

    • Big Data, Hadoop, Spark
    • Spark concepts and architecture
    • Spark components overview
    • Labs: Installing and running Spark

    First Look at Spark

    • Spark shell
    • Spark web UIs
    • Analyzing dataset – part 1
    • Labs: Spark shell exploration

    Spark Data structures

    • Partitions
    • Distributed execution
    • Operations: transformations and actions
    • Labs: Unstructured data analytics using RDDs

    Caching

    • Caching overview
    • Various caching mechanisms available in Spark
    • In memory file systems
    • Caching use cases and best practices
    • Labs: Benchmark of caching performance

    DataFrames and Datasets

    • DataFrames Intro
    • Loading structured data (JSON, CSV) using DataFrames
    • Using schema
    • Specifying schema for DataFrames
    • Labs: DataFrames, Datasets, Schema

    Spark SQL

    • Spark SQL concepts and overview
    • Defining tables and importing datasets
    • Querying data using SQL
    • Handling various storage formats: JSON / Parquet / ORC
    • Labs: querying structured data using SQL; evaluating data formats

    Spark and Hadoop

    • Hadoop + Spark architecture
    • Running Spark on Hadoop YARN
    • Processing HDFS files using Spark
    • Spark & Hive

    Spark API

    • Overview of Spark APIs in Scala / Python
    • Life cycle of a Spark application
    • Spark APIs
    • Deploying Spark applications on YARN
    • Labs: Developing and deploying a Spark application

    Spark ML Overview

    • Machine Learning primer
    • Machine Learning in Spark: MLib / ML
    • Spark ML overview (newer Spark2 version)
    • Algorithms overview: Clustering, Classifications, Recommendations
    • Labs: Writing ML applications in Spark

    GraphX

    • GraphX library overview
    • GraphX APIs
    • Create a Graph and navigating it
    • Shortest distance
    • Pregel API
    • Labs: Processing graph data using Spark

    Spark Streaming

    • Streaming concepts
    • Evaluating Streaming platforms
    • Spark streaming library overview
    • Streaming operations
    • Sliding window operations
    • Structured Streaming
    • Continuous streaming
    • Spark & Kafka streaming
    • Labs: Writing spark streaming applications

    Workshops (Time permitting)

    • These are group workshops
    • Attendees will work on solving real world data analysis problems using Spark
  • FAQs
    Is there a discount available for current students?

    UMBC students and alumni, as well as students who have previously taken a public training course with UMBC Training Centers are eligible for a 10% discount, capped at $250. Please provide a copy of your UMBC student ID or an unofficial transcript or the name of the UMBC Training Centers course you have completed. Online courses are excluded from this offer.

    What is the cancellation and refund policy?

    Student will receive a refund of paid registration fees only if UMBC Training Centers receives a notice of cancellation at least 10 business days prior to the class start date for classes or the exam date for exams.

Contact Us