Course curriculum

  • 1
    Big Data Introduction
    • Overview
    • Data Variety
    • Distributed systems
    • What is Big Data?
    • Why do we need Big Data now?
    • Big Data applications - recommendations
    • Big Data applications - A/B testing
    • Big Data Customers
    • Big Data solutions
    • What is Apache Hadoop?
    • Overview of Apache Hadoop ecosystem
    • Apache Spark ecosystem walkthrough
    • Big Data Introduction - Resources
  • 2
    Lab Overview
    • Introduction
    • Getting Started
    • Ambari
    • Hue
  • 3
    Linux Basics
    • Linux Introduction
    • Linux Operating System
    • Linux Files & Processes
    • The Directory Structure
    • Seeing Inside File - cat, tail, head
    • Use find command
    • Use grep command
    • Use wc command
    • Permissions - Overview
    • Permissions - Using chmod To Change
    • Permissions - Numeric
    • Permissions - Advanced
    • Process
    • Find information about processes
    • Background Processes
    • More - Interacting with processes
    • Process hierarchy
    • Program returns value and arguments
    • Networking: Sockets and ports
    • Chaining Unix Commands
    • Filters
    • Improved Word Count Using Unix Commands
    • Permissions of Processes - setuid
    • Special System commands
    • Where is my program?
    • Environment variables
    • Setting Environment variables
  • 4
    Java Essentials
    • Java Essentials - Introduction
    • Java Essentials - Session 1
    • Java Essentials - Session 2
    • Java Essentials - Session 3
    • Java Essentials - Session 4
    • Java Essentials - Resources
  • 5
    Apache Zookeeper
    • ZooKeeper - Race Condition
    • ZooKeeper - Deadlock
    • ZooKeeper - Excercise
    • ZooKeeper - Coordination
    • Excercise
    • ZooKeeper - Introduction
    • Excercise
    • ZooKeeper Hands-on - Getting Started
    • ZooKeeper - Data Model
    • ZooKeeper - Data Model - Znode
    • Exercise
    • ZooKeeper - Hands-on - Znodes
    • Exercise
    • ZooKeeper - Architecture
    • ZooKeeper - Election & Majority
    • ZooKeeper - Election & Majority
    • Exercise
    • ZooKeeper - Sessions
    • ZooKeeper - Application
    • ZooKeeper - Guarantees
    • Exercise
    • ZooKeeper - Operations
    • Exercise
    • ZooKeeper - APIs
    • Exercise
    • ZooKeeper - Watches
    • ZooKeeper - ACL
    • Exercise
    • ZooKeeper - Use Cases
    • ZooKeeper - Resources
  • 6
    HDFS
    • Why HDFS?
    • Exercise
    • HDFS - NameNode & DataNodes
    • Exercise
    • HDFS - Design & Limitations
    • HDFS - Replication
    • Exercise
    • HDFS - File Reading - Writing
    • HDFS - Namenode Backup & Failover
    • Exercise
    • HDFS - Hands-On with Ambari
    • HDFS - Hands-On with Hue
    • HDFS - Hands-On with Console
    • HDFS - Hands-On - More Commands
    • Exercise
    • HDFS - Summary
    • HDFS - Resources
  • 7
    8. Yarn
    • YARN - Why?
    • YARN - Evolution from MapReduce 1.0
    • YARN - Architecture
    • Yarn - More On Architecture
    • YARN - Resources
  • 8
    MapReduce
    • MapReduce - Understanding Sorting
    • MapReduce - Overview
    • MapReduce - Thinking in MR - Programatic & SQL
    • MapReduce - Thinking in MR - Unix Pipeline
    • MapReduce - Thinking in MR - External Sort
    • MapReduce - Understanding The Paradigm
    • MapReduce - Examples
    • MapReduce - Multiple Reducers
    • MapReduce Basics - Resources
  • 9
    MapReduce Programming
    • Writing MapReduce Code Using Java
    • Building MapReduce project using Apache Ant
    • Writing MapReduce code using Eclipse
    • Writing MapReduce code using Eclipse (Window)
    • Run MapReduce jobs using Hadoop Streaming
    • Exercise
    • MapReduce Programming - Resources
  • 10
    Pig
    • Pig - Introduction
    • Pig - Modes
    • Pig - Data Types
    • Pig - Relational Operators - Load, Store and Dump
    • Pig - Lazy Evaluation
    • Exercise
    • Pig - Relational Operators - FOREACH
    • Pig - More Operators
    • Pig - Calculate Average Dividend - Hands-on
    • Pig - Summary
    • Exercise
    • Pig - Resources
  • 11
    Hive
    • Hive - Introduction
    • Hive - Data Types
    • Hive - Getting Started - Hands-on
    • Hive - Tables
    • Hive - Managed Tables - Hands-on
    • Hive - External Tables - Hands on
    • Hive - Select and Aggregation Queries
    • Hive - Saving Data
    • Hive - DDL - Alter Table
    • Hive - Partitions
    • Hive - Views
    • Hive - Load JSON Data
    • Hive - Sorting & Bucketing
    • Hive - ORC File Format
    • Hive - Quick Recap
    • Connect to Apache Hive using Tableau
    • Hive - Resources
  • 12
    14. Sqoop
    • Sqoop - Introduction
    • Sqoop Import - MySQL to HDFS
    • Sqoop Import - MySQL to Hive
    • Sqoop Import - MySQL to HBase
    • Sqoop Export - Hive to MySQL
    • Sqoop - Summary
    • Sqoop - Resources
  • 13
    15. Flume
    • Flume - Introduction
    • Flume - Agents
    • Flume - Sources & Delivery Reliability
    • Flume - Hands-on Demo on CloudxLab
    • Flume - Summary
    • Flume - Resources
  • 14
    16. Oozie
    • Oozie - Introduction
    • Running Oozie Workflow From Command Line
    • Running Oozie Workflow From Hue
    • Oozie workflow for Hive
    • Run Sqoop Action Using Oozie
    • Execute shell script using Oozie
    • Oozie - Summary
    • Oozie - Resources
  • 15
    17. NoSQL
    • NoSQL - Scaling Out / Up
    • NoSQL - ACID Properties and RDBMS Story
    • NoSQL - Types of NoSQL Stores
    • NoSQL - CAP Theorem
    • NoSQL - Column Oriented Databases
    • NoSQL - Resources
  • 16
    18. HBase
    • HBase Introduction
    • HBase Architecture
    • HBase Architecture - Regions
    • HBase - Data Model
    • HBase Design Guidelines
    • HBase - Data Model Example
    • HBase Hands On with Hue
    • HBase Hands-On with console
    • HBase - Data Location
    • HBase - Bloom Filter
    • HBase REST
    • HBase - Resources
  • 17
    19. Scala
    • Scala - Quick Introduction - Variables and Methods
    • Scala - Features
    • Scala - Installation on your own machine
    • Pending- Scala - Assessment - Hello, World! Program
    • Scala - Program Structure
    • Scala - Data Types
    • Scala - Operators
    • Scala - Variables and Type Inference
    • Scala - Strings
    • Scala - String Formatting and Interpolation
    • Scala - String Methods
    • Pending- Scala - Assessment - Write function to remove vowels
    • Scala - Quick Introduction - Conditions and Loops
    • Scala - Conditional Statements Examples
    • Scala - Loop Statements Examples
    • Scala - Classes and Objects
    • Scala - Class examples
    • Scala - Classes with multiple constructors
    • Pending- Scala - Assessment - Declare a class
    • Pending- Scala - Assessment - Declare a class with multiple constructors
    • Scala - Function Representations
    • Scala - Function Examples
    • Pending- Scala - Assessment - Function to calculate sum of consecutive numbers
    • Scala - Closures
    • Scala - Collections Overview
    • Scala - Sequences and Sets
    • Scala - Collections - Tuples and Maps
    • Scala - File IO
    • Scala - Higher Order Functions
    • Scala - Interaction with Java
    • Scala - Build Tool - SBT
    • Scala - Case Classes and Pattern Matching
    • Scala - Variable Examples
    • Scala - Resources
  • 18
    20. Apache Spark Basics
    • Apache Spark ecosystem walkthrough
    • Spark Introduction - Why Spark?
    • Getting Started with Spark using CloudxLab
    • Getting Started with Spark - Cluster Installation (optional)
    • Spark Introduction - What is RDD
    • Apache Spark - Creating RDD
    • Apache Spark - Counting Word Frequencies
    • Apache Spark - Transformations - map & filter
    • Apache Spark - Actions - take & saveTextFile
    • Apache Spark - Lazy Evaluation & Lineage Graph
    • Apache Spark - More Operations - Transformations & Actions
    • Apache Spark - Reduce, Commutative & Associative
    • Apache Spark - Problem Solving - Compute Average
    • (Check)-Apache Spark - Slides
    • Apache Spark - More RDD Operations
    • Apache Spark - More RDD Operations - Slides
  • 19
    21. Apache Spark - Key-Value or Pair RDD
    • Apache Spark - Key-Value or Pair RDD - Getting Started
    • Apache Spark - Key Value RDD - ReduceByKey
    • Apache Spark - Key-Value or Pair RDD - Slides
  • 20
    22. Loading and Saving Data
    • Loading and Saving Data - Reading from Common Data Sources
    • Loading and Saving Data - Common File Formats
    • Loading and Saving Data - Handling Sequence and Object Files
    • Loading and Saving Data - Handling Hadoop Formats
    • Loading and Saving Data - Protocol Buffers
    • Loading and Saving Data - Understanding Compression
    • Loading and Saving Data - Handling Various File Systems
  • 21
    23. Spark - Streaming
    • Apache Spark - Streaming - Introduction
    • Apache Spark - Streaming - DStream
    • Apache Spark - Streaming - Use Cases
    • Apache Spark - Streaming - Wordcount Hands-On
    • Apache Spark - Streaming - Understand master url
    • Apache Kafka Introduction
    • Apache Kafka Hands-on
    • Integrating Apache Spark Streaming & Apache Kafka - Hands-on
    • Apache Spark - Streaming - updateStateByKey Operation
    • Apache Spark - Streaming - Transform and Window Operations
    • Apache Spark - Streaming - Join and Output Operations
    • Apache Spark - Streaming - Slides
  • 22
    24. Spark On Cluster
    • Apache Spark - Running On Cluster - Architecture
    • Apache Spark - Running On Cluster - Launching
    • Apache Spark - Running On Cluster - Local Mode
    • Apache Spark - Running On Cluster - Cluster Mode - Standalone
    • Apache Spark - Running On Cluster - Cluster Mode - YARN
    • Apache Spark - Running On Cluster - Cluster Mode - Mesos+AWS
    • Apache Spark - Running On Cluster - Deployment Modes
    • Apache Spark - Running On Cluster - Slides
  • 23
    26. Writing Spark Applications
    • Preface - Different Kinds of Applications
    • General Workflow
    • Definitions - IDE
    • Definitions - Build Tools
    • Definitions - Source Code management tools
    • Definitions - Testing - Unit & Integration
    • Example Objective
    • Approach 1 - Using Spark Shell
    • Process for Large Spark Projects
    • Tutorial - Fork The Repository
    • Tutorial - Understand Code
    • Tutorial - Browsing Through The Code
    • Tutorial - Build In The Console (aka Shell, Terminal)
    • Tutorial - Unit Test cases
    • Tutorial - Setup Dev Machine
    • Tutorial - Setting up Dev Machine and fixing code on windows
  • 24
    27. Dataframe, Spark SQL, R
    • Spark SQL - Introduction
    • Spark SQL - Dataframe Introduction
    • Spark SQL - Getting Started
    • Spark SQL - Create Df from Json
    • Spark SQL - Dataframe Operations
    • Spark SQL - SQL Queries On Dataframes
    • Spark SQL - Understanding Datasets
    • Spark SQL - Rdd And Dataframe Interoperability
    • Spark SQL - Infer Schema Using Reflection
    • Spark SQL - Converting RDD to Dataframe Using Programmatic Schema
    • Spark SQL - Handling Avro
    • Spark SQL - Loading XML
    • Spark SQL - Handling Various Data Sources
    • Spark SQL - Using Hive tables
    • Spark SQL - Working With JDBC
    • Spark SQL - Resources
    • Overview of SparkR
    • SparkR - Resources
  • 25
    28. Machine Learning With Spark
    • Machine Learning Introduction
    • Applications Of Machine Learning
    • Types Of Machine Learning
    • MLlib Overview
    • Collaborative Filtering or Recommender using MLlib
    • MLlib Data Types And Libraries
    • MLllib - Resources
  • 26
    29. Graph Processing With Spark
    • GraphX Quick Walkthrough
    • GraphX - Resources
  • 27
    30. Spark Project - Log Parsing
    • Spark Project - Apache log parsing - Introduction
    • Spark - Project - Apache log parsing - Top 10 requested URLs
    • Spark - Project - Apache log parsing - Top 5 time frames for high traffic
    • Spark - Project - Apache log parsing - Top 5 time frames for least traffic
    • Spark - Project - Apache log parsing - Find HTTP codes
  • 28
    25. Adv Spark Programming
    • Adv Spark Programming - Understanding Persistence
    • Adv Spark Programming - Persistence StorageLevel
    • Adv Spark Programming - Data Partitioning
    • Adv Spark Programming - Partitioning HandsOn
    • Adv Spark Programming - Data Partitioning Example
    • Adv Spark Programming - Custom Partitioner
    • Exercise
    • Adv Spark Programming - Shared Variables
    • Exercise
    • Adv Spark Programming - Accumulators
    • Exercise
    • Adv Spark Programming - Custom Accumulators
    • Adv Spark Programming - Broadcast Variables
    • Exercise
    • Adv Spark Programming - Broadcast Variables Example
    • Adv Spark Programming - Key Performance Considerations - Parallelism
    • Exercise
    • Adv Spark Programming - Key Performance Considerations - Partitions
    • Exercise
    • Adv Spark Programming - Serialization Format
    • Exercise
    • Adv Spark Programming - Memory Management
    • Exercise
    • Adv Spark Programming - Hardware Provisioning
    • Exercise
    • Adv Spark Programming - Slides
  • 29
    13. Hive Project
    • Hive - Project - Sentiment Analysis
    • Exercise
    • Hive - Project - Sentiment Analysis - Visualization
    • Exercise