Analytics Vidhya - Big Data with Hadoop and Spark

1

Big Data Introduction
- Overview
- Data Variety
- Distributed systems
- What is Big Data?
- Why do we need Big Data now?
- Big Data applications - recommendations
- Big Data applications - A/B testing
- Big Data Customers
- Big Data solutions
- What is Apache Hadoop?
- Overview of Apache Hadoop ecosystem
- Apache Spark ecosystem walkthrough
- Big Data Introduction - Resources
- AI&ML Blackbelt Plus Program (Sponsored)
2

Lab Overview
- Introduction
- Getting Started
- Ambari
- Hue
3

Linux Basics
- Linux Introduction
- Linux Operating System
- Linux Files & Processes
- The Directory Structure
- Seeing Inside File - cat, tail, head
- Use find command
- Use grep command
- Use wc command
- Permissions - Overview
- Permissions - Using chmod To Change
- Permissions - Numeric
- Permissions - Advanced
- Process
- Find information about processes
- Background Processes
- More - Interacting with processes
- Process hierarchy
- Program returns value and arguments
- Networking: Sockets and ports
- Chaining Unix Commands
- Filters
- Improved Word Count Using Unix Commands
- Permissions of Processes - setuid
- Special System commands
- Where is my program?
- Environment variables
- Setting Environment variables
4

Java Essentials
- Java Essentials - Introduction
- Java Essentials - Session 1
- Java Essentials - Session 2
- Java Essentials - Session 3
- Java Essentials - Session 4
- Java Essentials - Resources
5

Apache Zookeeper
- ZooKeeper - Race Condition
- ZooKeeper - Deadlock
- ZooKeeper - Excercise
- ZooKeeper - Coordination
- Excercise
- ZooKeeper - Introduction
- Excercise
- ZooKeeper Hands-on - Getting Started
- ZooKeeper - Data Model
- ZooKeeper - Data Model - Znode
- Exercise
- ZooKeeper - Hands-on - Znodes
- Exercise
- ZooKeeper - Architecture
- ZooKeeper - Election & Majority
- ZooKeeper - Election & Majority
- Exercise
- ZooKeeper - Sessions
- ZooKeeper - Application
- ZooKeeper - Guarantees
- Exercise
- ZooKeeper - Operations
- Exercise
- ZooKeeper - APIs
- Exercise
- ZooKeeper - Watches
- ZooKeeper - ACL
- Exercise
- ZooKeeper - Use Cases
- ZooKeeper - Resources
6

HDFS
- Why HDFS?
- Exercise
- HDFS - NameNode & DataNodes
- Exercise
- HDFS - Design & Limitations
- HDFS - Replication
- Exercise
- HDFS - File Reading - Writing
- HDFS - Namenode Backup & Failover
- Exercise
- HDFS - Hands-On with Ambari
- HDFS - Hands-On with Hue
- HDFS - Hands-On with Console
- HDFS - Hands-On - More Commands
- Exercise
- HDFS - Summary
- HDFS - Resources
7

8. Yarn
- YARN - Why?
- YARN - Evolution from MapReduce 1.0
- YARN - Architecture
- Yarn - More On Architecture
- YARN - Resources
8

MapReduce
- MapReduce - Understanding Sorting
- MapReduce - Overview
- MapReduce - Thinking in MR - Programatic & SQL
- MapReduce - Thinking in MR - Unix Pipeline
- MapReduce - Thinking in MR - External Sort
- MapReduce - Understanding The Paradigm
- MapReduce - Examples
- MapReduce - Multiple Reducers
- MapReduce Basics - Resources
9

MapReduce Programming
- Writing MapReduce Code Using Java
- Building MapReduce project using Apache Ant
- Writing MapReduce code using Eclipse
- Writing MapReduce code using Eclipse (Window)
- Run MapReduce jobs using Hadoop Streaming
- Exercise
- MapReduce Programming - Resources
10

Pig
- Pig - Introduction
- Pig - Modes
- Pig - Data Types
- Pig - Relational Operators - Load, Store and Dump
- Pig - Lazy Evaluation
- Exercise
- Pig - Relational Operators - FOREACH
- Pig - More Operators
- Pig - Calculate Average Dividend - Hands-on
- Pig - Summary
- Exercise
- Pig - Resources
11

Hive
- Hive - Introduction
- Hive - Data Types
- Hive - Getting Started - Hands-on
- Hive - Tables
- Hive - Managed Tables - Hands-on
- Hive - External Tables - Hands on
- Hive - Select and Aggregation Queries
- Hive - Saving Data
- Hive - DDL - Alter Table
- Hive - Partitions
- Hive - Views
- Hive - Load JSON Data
- Hive - Sorting & Bucketing
- Hive - ORC File Format
- Hive - Quick Recap
- Connect to Apache Hive using Tableau
- Hive - Resources
12

14. Sqoop
- Sqoop - Introduction
- Sqoop Import - MySQL to HDFS
- Sqoop Import - MySQL to Hive
- Sqoop Import - MySQL to HBase
- Sqoop Export - Hive to MySQL
- Sqoop - Summary
- Sqoop - Resources
13

15. Flume
- Flume - Introduction
- Flume - Agents
- Flume - Sources & Delivery Reliability
- Flume - Hands-on Demo on CloudxLab
- Flume - Summary
- Flume - Resources
14

16. Oozie
- Oozie - Introduction
- Running Oozie Workflow From Command Line
- Running Oozie Workflow From Hue
- Oozie workflow for Hive
- Run Sqoop Action Using Oozie
- Execute shell script using Oozie
- Oozie - Summary
- Oozie - Resources
15

17. NoSQL
- NoSQL - Scaling Out / Up
- NoSQL - ACID Properties and RDBMS Story
- NoSQL - Types of NoSQL Stores
- NoSQL - CAP Theorem
- NoSQL - Column Oriented Databases
- NoSQL - Resources
16

18. HBase
- HBase Introduction
- HBase Architecture
- HBase Architecture - Regions
- HBase - Data Model
- HBase Design Guidelines
- HBase - Data Model Example
- HBase Hands On with Hue
- HBase Hands-On with console
- HBase - Data Location
- HBase - Bloom Filter
- HBase REST
- HBase - Resources
17

19. Scala
- Scala - Quick Introduction - Variables and Methods
- Scala - Features
- Scala - Installation on your own machine
- Pending- Scala - Assessment - Hello, World! Program
- Scala - Program Structure
- Scala - Data Types
- Scala - Operators
- Scala - Variables and Type Inference
- Scala - Strings
- Scala - String Formatting and Interpolation
- Scala - String Methods
- Pending- Scala - Assessment - Write function to remove vowels
- Scala - Quick Introduction - Conditions and Loops
- Scala - Conditional Statements Examples
- Scala - Loop Statements Examples
- Scala - Classes and Objects
- Scala - Class examples
- Scala - Classes with multiple constructors
- Pending- Scala - Assessment - Declare a class
- Pending- Scala - Assessment - Declare a class with multiple constructors
- Scala - Function Representations
- Scala - Function Examples
- Pending- Scala - Assessment - Function to calculate sum of consecutive numbers
- Scala - Closures
- Scala - Collections Overview
- Scala - Sequences and Sets
- Scala - Collections - Tuples and Maps
- Scala - File IO
- Scala - Higher Order Functions
- Scala - Interaction with Java
- Scala - Build Tool - SBT
- Scala - Case Classes and Pattern Matching
- Scala - Variable Examples
- Scala - Resources
18

20. Apache Spark Basics
- Apache Spark ecosystem walkthrough
- Spark Introduction - Why Spark?
- Getting Started with Spark using CloudxLab
- Getting Started with Spark - Cluster Installation (optional)
- Spark Introduction - What is RDD
- Apache Spark - Creating RDD
- Apache Spark - Counting Word Frequencies
- Apache Spark - Transformations - map & filter
- Apache Spark - Actions - take & saveTextFile
- Apache Spark - Lazy Evaluation & Lineage Graph
- Apache Spark - More Operations - Transformations & Actions
- Apache Spark - Reduce, Commutative & Associative
- Apache Spark - Problem Solving - Compute Average
- (Check)-Apache Spark - Slides
- Apache Spark - More RDD Operations
- Apache Spark - More RDD Operations - Slides
19

21. Apache Spark - Key-Value or Pair RDD
- Apache Spark - Key-Value or Pair RDD - Getting Started
- Apache Spark - Key Value RDD - ReduceByKey
- Apache Spark - Key-Value or Pair RDD - Slides
20

22. Loading and Saving Data
- Loading and Saving Data - Reading from Common Data Sources
- Loading and Saving Data - Common File Formats
- Loading and Saving Data - Handling Sequence and Object Files
- Loading and Saving Data - Handling Hadoop Formats
- Loading and Saving Data - Protocol Buffers
- Loading and Saving Data - Understanding Compression
- Loading and Saving Data - Handling Various File Systems
21

23. Spark - Streaming
- Apache Spark - Streaming - Introduction
- Apache Spark - Streaming - DStream
- Apache Spark - Streaming - Use Cases
- Apache Spark - Streaming - Wordcount Hands-On
- Apache Spark - Streaming - Understand master url
- Apache Kafka Introduction
- Apache Kafka Hands-on
- Integrating Apache Spark Streaming & Apache Kafka - Hands-on
- Apache Spark - Streaming - updateStateByKey Operation
- Apache Spark - Streaming - Transform and Window Operations
- Apache Spark - Streaming - Join and Output Operations
- Apache Spark - Streaming - Slides
22

24. Spark On Cluster
- Apache Spark - Running On Cluster - Architecture
- Apache Spark - Running On Cluster - Launching
- Apache Spark - Running On Cluster - Local Mode
- Apache Spark - Running On Cluster - Cluster Mode - Standalone
- Apache Spark - Running On Cluster - Cluster Mode - YARN
- Apache Spark - Running On Cluster - Cluster Mode - Mesos+AWS
- Apache Spark - Running On Cluster - Deployment Modes
- Apache Spark - Running On Cluster - Slides
23

26. Writing Spark Applications
- Preface - Different Kinds of Applications
- General Workflow
- Definitions - IDE
- Definitions - Build Tools
- Definitions - Source Code management tools
- Definitions - Testing - Unit & Integration
- Example Objective
- Approach 1 - Using Spark Shell
- Process for Large Spark Projects
- Tutorial - Fork The Repository
- Tutorial - Understand Code
- Tutorial - Browsing Through The Code
- Tutorial - Build In The Console (aka Shell, Terminal)
- Tutorial - Unit Test cases
- Tutorial - Setup Dev Machine
- Tutorial - Setting up Dev Machine and fixing code on windows
24

27. Dataframe, Spark SQL, R
- Spark SQL - Introduction
- Spark SQL - Dataframe Introduction
- Spark SQL - Getting Started
- Spark SQL - Create Df from Json
- Spark SQL - Dataframe Operations
- Spark SQL - SQL Queries On Dataframes
- Spark SQL - Understanding Datasets
- Spark SQL - Rdd And Dataframe Interoperability
- Spark SQL - Infer Schema Using Reflection
- Spark SQL - Converting RDD to Dataframe Using Programmatic Schema
- Spark SQL - Handling Avro
- Spark SQL - Loading XML
- Spark SQL - Handling Various Data Sources
- Spark SQL - Using Hive tables
- Spark SQL - Working With JDBC
- Spark SQL - Resources
- Overview of SparkR
- SparkR - Resources
25

28. Machine Learning With Spark
- Machine Learning Introduction
- Applications Of Machine Learning
- Types Of Machine Learning
- MLlib Overview
- Collaborative Filtering or Recommender using MLlib
- MLlib Data Types And Libraries
- MLllib - Resources
26

29. Graph Processing With Spark
- GraphX Quick Walkthrough
- GraphX - Resources
27

30. Spark Project - Log Parsing
- Spark Project - Apache log parsing - Introduction
- Spark - Project - Apache log parsing - Top 10 requested URLs
- Spark - Project - Apache log parsing - Top 5 time frames for high traffic
- Spark - Project - Apache log parsing - Top 5 time frames for least traffic
- Spark - Project - Apache log parsing - Find HTTP codes
28

25. Adv Spark Programming
- Adv Spark Programming - Understanding Persistence
- Adv Spark Programming - Persistence StorageLevel
- Adv Spark Programming - Data Partitioning
- Adv Spark Programming - Partitioning HandsOn
- Adv Spark Programming - Data Partitioning Example
- Adv Spark Programming - Custom Partitioner
- Exercise
- Adv Spark Programming - Shared Variables
- Exercise
- Adv Spark Programming - Accumulators
- Exercise
- Adv Spark Programming - Custom Accumulators
- Adv Spark Programming - Broadcast Variables
- Exercise
- Adv Spark Programming - Broadcast Variables Example
- Adv Spark Programming - Key Performance Considerations - Parallelism
- Exercise
- Adv Spark Programming - Key Performance Considerations - Partitions
- Exercise
- Adv Spark Programming - Serialization Format
- Exercise
- Adv Spark Programming - Memory Management
- Exercise
- Adv Spark Programming - Hardware Provisioning
- Exercise
- Adv Spark Programming - Slides
29

13. Hive Project
- Hive - Project - Sentiment Analysis
- Exercise
- Hive - Project - Sentiment Analysis - Visualization
- Exercise