About the course

Regardless of your job title, it is likely that the amount of data with which you are working is growing quickly. Your original solutions may need to be scaled, and your old techniques for solving new problems may need to be updated. 

We hope this course will help you leverage Apache Spark to tackle new problems easily and old problems efficiently.

In this course, we will learn about Big Data, its applications and its challenges, and how Spark helps us in dealing with Big Data. We will be covering the architecture of Spark, its internal working using RDDs and optimization techniques. We will learn how to use the different Spark APIs like Spark SQL and Spark ML using Python.

Market for Big Data Analytics is growing tremendously across the world and such a strong growth pattern followed by market demand is a great opportunity for all IT Professionals. Here are a few Professional IT groups, who are continuously enjoying the benefits and perks of moving into the Big Data domain.

  • Developers and Architects

  • BI /ETL/DW Professionals

  • Senior IT Professionals

  • Testing Professionals

  • Mainframe Professionals

  • Freshers

  • Big Data Enthusiasts

  • Software Architects, Engineers, and Developers

  • Data Scientists and Analytics Professionals


  • Good to have knowledge of any SQL platform like MySQL, PostgreSQL, Oracle etc.
  • Good to have knowledge of any programming language like Python, Java, Scala etc.
  • You should be familiar with Object Oriented Programming concepts like Classes, Objects, Inheritance, etc.
  • You should be familiar with concepts of lambda functions and higher order functions.
  • Good to have knowledge of machine learning concepts.
  • Good to have knowledge of any cloud technology like AWS, Azure etc.

Key Takeaways from Spark Course

  • Understanding the Big data challenges and applications

  • Understanding the architecture of Apache Spark

  • Familiarity with Spark’s basic abstractions like RDDs and DataFrames

  • Familiarity with Spark APIs like Spark SQL, Spark ML

  • Exploratory Data Analysis of any data set using PySpark

  • Building Machine Learning pipelines in PySpark

What do I need to start with the Apache Spark course?

  • A working laptop/desktop with 8 GB RAM

  • A working Internet connection

  • Basic knowledge of Python, SQL and Linux.

Course curriculum

  • 1
    Introduction to the Course
    • Course Overview
    • Pre-requisites
    • Instructor Introduction
    • Course Handouts
  • 2
    What is Big Data?
    • What is Big Data?
    • Challenges with Big Data
    • Applications of Big Data
    • Quiz: Big Data
    • Distributed Systems
    • Quiz: Distributed Systems
  • 3
    Introduction to Apache Hadoop
    • Introduction to Apache Hadoop
    • Components of Apache Hadoop
    • Hadoop Ecosystem
    • Quiz: Introduction to Hadoop
  • 4
    Introduction to Apache Spark
    • What is Spark?
    • Spark Ecosystem
    • Quiz: Introduction to Apache Spark
  • 5
    Deep Dive into Spark
    • Spark Architecture
    • Quiz: Spark Architechture
    • Spark Cluster Managers
    • Running Spark Applications on YARN
    • Spark Context and Spark Sesssion
    • Quiz: Spark Cluster Managers
  • 6
    Introduction to Itversity Labs
    • Itversity Credentials
    • Introduction to Itversity
    • Uploading data to Itversity
    • HDFS common commands
  • 7
    RDDs in Spark
    • What Are RDDs?
    • How to create RDDs?
    • Implementation: How to create RDDs?
    • RDD Operations
    • Implementation: RDD Operations(Part 1)
    • Implementation: RDD Operations(Part 2)
    • Quiz: RDD
    • Pair RDDs
    • Pair RDD Operations
    • Implementation: Pair RDD Operations
    • Implementation: GroupByKey Vs ReduceByKey
    • Quiz: Pair RDD
    • Caching and Persistence in Spark
    • Implementation: Persistence
    • Storage Levels in Spark
    • Implementation: Storage Levels
    • Quiz: Caching & Persistence
    • Assignment: RDD Operations
  • 8
    DataFrames in Spark
    • What are Spark DataFrames?
    • Implementation: Creating Spark DataFrames
    • Implementation: Basic Operations on DataFrames
    • Implementation: Creating Columns in DataFrames
    • Implementation: Manipulating Records in DataFrames
    • RDDs Vs DataFrames - When to use?
    • Quiz: DataFrames in Spark
    • Assignment: Spark DataFrames
  • 9
    Understanding Spark Execution
    • Jobs, Stages and Tasks
    • Implementation: Jobs, Stages and Tasks
    • Lineage
    • Implementation: Lineage
    • DAG
    • Implementation: DAG
    • Quiz: Spark Execution
  • 10
    Advanced Programming in Spark
    • Shared Variables
    • Implementation: Shared Variables
    • Shuffling
    • Partitioning
    • Coalesce vs Repartition
    • Implementation: Coalesce vs Repartition
    • Quiz: Advance Programming in Spark
  • 11
    Spark SQL
    • What is Spark SQL?
    • Catalyst Optimizer
    • Spark SQL Queries
    • Implementation: Spark SQL Queries
    • Why do we need Spark SQL?
    • Quiz: Spark SQL
    • Assignment: Spark SQL
  • 12
    Spark ML
    • Scope of ML in this Course
    • Introduction to Machine Learning
    • Types of Machine Learning Problems
    • Machine Learning in Spark
    • Life Cycle of a ML Project
    • Quiz: Machine Learning in Spark
    • Understanding the Problem Statement
    • Implementation: Introduction to the Data
    • Implementation: Univariate Analysis
    • Implementation: Bivariate Analysis
    • Quiz: Analysis using Spark
    • Encoding Categorical Variables
    • Implementation: Preprocessing Data
    • Quiz: Preprocessing data using spark ML
    • Vector Assembler
    • Implementation: Model Building
    • Quiz: Model Building
    • Model Improvement
    • Implementation: Fine Tuning ML Models
    • Quiz: Fine Tune ML Models
    • Understanding ML Pipelines in Spark
    • Implementation: Sample Pipelines in Spark
    • Implementation: ML Pipeline for Click Prediction
    • Quiz: ML Pipelines
    • Assignment: Spark ML