Mastering Apache Spark using Python

With this advanced course on Spark, learn how Spark helps us in dealing with Big Data, its internal working using RDDs and optimization techniques. Also learn how to use different Spark APIs like Spark SQL and Spark ML using Python.

Interested in this course? Email us at [email protected]

Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated.

We hope this course will help you leverage Apache Spark to tackle new problems easily and old problems efficiently.

In this course, we will learn about Big Data, its applications and its challenges, and how Spark helps us in dealing with Big Data. We will be covering the architecture of Spark, its internal working using RDDs and optimization techniques. We will learn how to use the different Spark APIs like Spark SQL and Spark ML using Python.

Market for Big Data Analytics is growing tremendously across the world and such a strong growth pattern followed by market demand is a great opportunity for all IT Professionals. Here are a few Professional IT groups, who are continuously enjoying the benefits and perks of moving into the Big Data domain.

Developers and Architects
BI /ETL/DW Professionals
Senior IT Professionals
Testing Professionals
Mainframe Professionals
Freshers
Big Data Enthusiasts
Software Architects, Engineers, and Developers
Data Scientists and Analytics Professionals

Pre-requisites

Good to have knowledge of any SQL platform like MySQL, PostgreSQL, Oracle etc.
Good to have knowledge of any programming language like Python, Java, Scala etc.
You should be familiar with Object Oriented Programming concepts like Classes, Objects, Inheritance, etc.
You should be familiar with concepts of lambda functions and higher order functions.
Good to have knowledge of machine learning concepts.
Good to have knowledge of any cloud technology like AWS, Azure etc.

Understanding the Big data challenges and applications
Understanding the architecture of Apache Spark
Familiarity with Spark’s basic abstractions like RDDs and DataFrames
Familiarity with Spark APIs like Spark SQL, Spark ML
Exploratory Data Analysis of any data set using PySpark
Building Machine Learning pipelines in PySpark

A working laptop/desktop with 8 GB RAM
A working Internet connection
Basic knowledge of Python, SQL and Linux.

1

Introduction to the Course
- Course Overview
- Pre-requisites
- Instructor Introduction
- Course Handouts
2

What is Big Data?
- What is Big Data?
- Challenges with Big Data
- Applications of Big Data
- Quiz: Big Data
- Distributed Systems
- Quiz: Distributed Systems
3

Introduction to Apache Hadoop
- Introduction to Apache Hadoop
- Components of Apache Hadoop
- Hadoop Ecosystem
- Quiz: Introduction to Hadoop
4

Introduction to Apache Spark
- What is Spark?
- Spark Ecosystem
- Quiz: Introduction to Apache Spark
5

Deep Dive into Spark
- Spark Architecture
- Quiz: Spark Architechture
- Spark Cluster Managers
- Running Spark Applications on YARN
- Spark Context and Spark Sesssion
- Quiz: Spark Cluster Managers
6

Introduction to Itversity Labs
- Itversity Credentials
- Introduction to Itversity
- Uploading data to Itversity
- HDFS common commands
7

RDDs in Spark
- What Are RDDs?
- How to create RDDs?
- Implementation: How to create RDDs?
- RDD Operations
- Implementation: RDD Operations(Part 1)
- Implementation: RDD Operations(Part 2)
- Quiz: RDD
- Pair RDDs
- Pair RDD Operations
- Implementation: Pair RDD Operations
- Implementation: GroupByKey Vs ReduceByKey
- Quiz: Pair RDD
- Caching and Persistence in Spark
- Implementation: Persistence
- Storage Levels in Spark
- Implementation: Storage Levels
- Quiz: Caching & Persistence
- Assignment: RDD Operations
8

DataFrames in Spark
- What are Spark DataFrames?
- Implementation: Creating Spark DataFrames
- Implementation: Basic Operations on DataFrames
- Implementation: Creating Columns in DataFrames
- Implementation: Manipulating Records in DataFrames
- RDDs Vs DataFrames - When to use?
- Quiz: DataFrames in Spark
- Assignment: Spark DataFrames
9

Understanding Spark Execution
- Jobs, Stages and Tasks
- Implementation: Jobs, Stages and Tasks
- Lineage
- Implementation: Lineage
- DAG
- Implementation: DAG
- Quiz: Spark Execution
10

Advanced Programming in Spark
- Shared Variables
- Implementation: Shared Variables
- Shuffling
- Partitioning
- Coalesce vs Repartition
- Implementation: Coalesce vs Repartition
- Quiz: Advance Programming in Spark
11

Spark SQL
- What is Spark SQL?
- Catalyst Optimizer
- Spark SQL Queries
- Implementation: Spark SQL Queries
- Why do we need Spark SQL?
- Quiz: Spark SQL
- Assignment: Spark SQL
12

Spark ML
- Scope of ML in this Course
- Introduction to Machine Learning
- Types of Machine Learning Problems
- Machine Learning in Spark
- Life Cycle of a ML Project
- Quiz: Machine Learning in Spark
- Understanding the Problem Statement
- Implementation: Introduction to the Data
- Implementation: Univariate Analysis
- Implementation: Bivariate Analysis
- Quiz: Analysis using Spark
- Encoding Categorical Variables
- Implementation: Preprocessing Data
- Quiz: Preprocessing data using spark ML
- Vector Assembler
- Implementation: Model Building
- Quiz: Model Building
- Model Improvement
- Implementation: Fine Tuning ML Models
- Quiz: Fine Tune ML Models
- Understanding ML Pipelines in Spark
- Implementation: Sample Pipelines in Spark
- Implementation: ML Pipeline for Click Prediction
- Quiz: ML Pipelines
- Assignment: Spark ML

Mastering Apache Spark using Python

With this advanced course on Spark, learn how Spark helps us in dealing with Big Data, its internal working using RDDs and optimization techniques. Also learn how to use different Spark APIs like Spark SQL and Spark ML using Python.

About the course

Pre-requisites

Key Takeaways from Spark Course

What do I need to start with the Apache Spark course?

A working laptop/desktop with 8 GB RAM

A working Internet connection

Basic knowledge of Python, SQL and Linux.

Course curriculum

Introduction to the Course

What is Big Data?

Introduction to Apache Hadoop

Introduction to Apache Spark

Deep Dive into Spark

Introduction to Itversity Labs

RDDs in Spark

DataFrames in Spark

Understanding Spark Execution

Advanced Programming in Spark

Spark SQL

Spark ML