Training Information
BIG DATA HADOOP
We are pleased to offer a comprehensive suite of training solutions tailored to meet your needs. Our services encompass both online and offline corporate training options, ensuring flexibility and accessibility for your team's professional development.
Course Content
Syllabus:
BIG DATA HADOOP
I: INTRODUCTION
What is Big Data?
What is Hadoop?
Need of Hadoop
Sources and Types of Data
Comparison with Other Technologies
Challenges with Big Data
i. Storage
ii. Processing
RDBMS vs Hadoop
Advantages of Hadoop
Hadoop Echo System components
II: HDFS (Hadoop Distributed File System)
Features of HDFS
Name node ,Data node ,Blocks
Configuring Block size,
HDFS Architecture ( 5 Daemons)
i. Name Node
ii. Data Node
iii. Secondary Name node
iv. Job Tracker
v. Task Tracker
Metadata management
Storage and processing
Replication in Hadoop
Configuring Custom Replication
Fault Tolerance in Hadoop
HDFS Commands
III: MAP REDUCE
Map Reduce Architecture
Processing Daemons of Hadoop
Job Tracker (Roles and Responsibilities)
Task Tracker(Roles and Responsibilities)
Phases of Map Reduce
i) Mapper phase
ii) Reducer phase
Input split
Input split vs Block size
Partitioner in Map Reduce
Groupings and Aggregations
Data Types in Map Reduce
Map Reduce Programming Model
Driver Code
Mapper Code
Reducer Code
Programming examples
File input formats
File output formats
Merging in Map Reduce
Speculative Execution Model
Speculative Job
IV: SQOOP (SQL + HADOOP)
Introduction to Sqoop
SQOOP Import
SQOOP Export
Importing Data From RDBMS to HDFS
Importing Data From RDBMS to HIVE
Importing Data From RDBMS to HBASE
Exporting From HASE to RDBMS
Exporting From HBASE to RDBMS
Exporting From HIVE to RDBMS
Exporting From HDFS to RDBMS
Transformations While Importing / Exporting
Filtering data while importing
Vertical and Horizontal merging while import
Working with delimiters while importing
Groupings and Aggregations while import
Incremental import
Examples and operations
Defining SQOOP Jobs
V: YARN
Introduction
Speculative Execution ,Speculative job and
Speculative Task.
Comparision of Hadoop1.xx with Hadoop2.xx
Comparision with previous versions
YARN Architecture Componets
i. Resource Manager
ii. Application Master
iii. Node Manager
iv. Application Manager
v. Resource Scheduler
vi. Job History Server
vii. Container
VI: NOSQL
What is “Not only SQL”
NOSQL Advantages
What is problem with RDBMS for Large
Data Scaling Systems
Types of NOSQL & Purposes
Key Value Store
Columer Store
Document Store
Graph Store
Introduction to cassandra – NOSQL Database
Introduction to MongoDB and CouchDB Database
Intergration of NOSQL Databases with Hadoop
VII: HBASE
Introduction to big table
What is NOSQL and colummer store Database
HBASE Introduction
Hbase use cases
Hbase basics
Column families
Scans
Hbase Architecture
Map Reduce Over Hbase
Hbase data Modeling
Hbase Schema design
Hbase CRUD operators
Hive & Hbaseinteragation
Hbase storage handlers
VIII: HIVE
Introduction
Hive Architecture
Hive Metastore
Hive Query Launguage
Difference between HQL and SQL
Hive Built in Functions
Loading Data From Local Files To Hive Tables
Loading Data From Hdfs Files To Hive Tables
Tables Types
Inner Tables
External Tables
Hive Working with unstructured data
Hive Working With Xml Data
Hive Working With Json Data
Hive Working With Urls And Weblog Data
Hive Unions
Hive Joins
Multi Table / File Inserts
Inserting Into Local Files
Inserting Into Hdfs Files
Hive UDF (user defined functions)
Hive UDAF (user defined Aggregated functions)
Hive UDTF (user defined table Generated functions
Partitioned Tables
Non – Partitioned Tables
Multi-column Partitioning
Dynamic Partitions In Hive
Performance Tuning mechanism
Bucketing in hive
Indexing in Hive
Hive Examples
Hive & Hbase Integration
PYSPARK
I ) PYSPARK INTRODUCTION
What is Apache Spark?
Why Pyspark?
Need for pyspark
spark Python Vs Scala
pyspark features
Real-life usage of PySpark
PySpark Web/Application
PySpark - SparkSession
PySpark – SparkContext
PySpark – RDD
PySpark – Parallelize
PySpark – repartition() vs coalesce()
PySpark – Broadcast Variables
PySpark – Accumulator
II) PYSPARK - RDD COMPUTATION
Operations on a RDD
Direct Acyclic Graph (DAG)
RDD Actions and Transformations
RDD computation
Steps in RDD computation
RDD persistence
Persistence features
II) PERSISTENCE Options:
1) MEMORY_ONLY
2) MEMORY_SER_ONLY
3) DISK_ONLY
4) DISK_SER_ONLY
5) MEMORY_AND_DISK_ONLY
III) PYSPARK - CORE COMPUTING
Fault Tolerence model in spark
Different ways of creating a RDD
Word Count Example
Creating spark objects(RDDs) from Scala Objects(lists).
Increasing the no of partitons
Aggregations Over Structured Data:
reduceByKey()
IV) GROUPINGS AND AGGREGATIONS
i) Single Grouping and Single Aggregation
ii) Single Grouping and multiple Aggregation
iii) multi Grouping and Single Aggregation
iv) Multi Grouping and Multi Aggregation
Differences b/w reduceByKey() and groupByKey()
Process of groupByKey
Process of reduceByKey
Reduce() function
Various Transformations
Various Built-in Functions
V) Various Actions and Transformations:
countByKey()
countByValue()
sortByKey()
zip()
Union()
Distinct()
Various count aggregation
Joins
-inner join
-outer join
Cartesian()
Cogroup()
Other actions and transformations
VI) PySpark SQL - DataFrame
Introduction
Making data Structured
Case Classes
ways to extract case class objects
1) using function
2) using map with multiple exressions
3) using map with single expression
Sql Context
Data Frames API
DataSet API
RDD vs DataFrame vs DataSet
PySpark – Create a DataFrame
PySpark – Create an empty DataFrame
PySpark – Convert RDD to DataFrame
PySpark – Convert DataFrame to Pandas
PySpark – show()
PySpark – StructType & StructField
PySpark – Row Class
PySpark – Column Class
PySpark – select()
PySpark – collect()
PySpark – withColumn()
PySpark – withColumnRenamed()
PySpark – where() & filter()
PySpark – drop() & dropDuplicates()
PySpark – orderBy() and sort()
PySpark – groupBy()
PySpark – join()
PySpark – union() & unionAll()
PySpark – unionByName()
PySpark – UDF (User Defined Function)
PySpark – map()
PySpark – flatMap()
pyspark – foreach()
PySpark – sample() vs sampleBy()
PySpark – fillna() & fill()
PySpark – pivot() (Row to Column)
PySpark – partitionBy()
PySpark – ArrayType Column (Array)
PySpark – MapType (Map/Dict)
VII) PySpark SQL Functions
PySpark – Aggregate Functions
PySpark – Window Functions
PySpark – Date and Timestamp Functions
PySpark – JSON Functions
PySpark – Read & Write JSON file
VIII) PySpark Built-In Functions
PySpark – when()
PySpark – expr()
PySpark – lit()
PySpark – split()
PySpark – concat_ws()
Pyspark – substring()
PySpark – translate()
PySpark – regexp_replace()
PySpark – overlay()
PySpark – to_timestamp()
PySpark – to_date()
PySpark – date_format()
PySpark – datediff()
PySpark – months_between()
PySpark – explode()
PySpark – array_contains()
PySpark – array()
PySpark – collect_list()
PySpark – collect_set()
PySpark – create_map()
PySpark – map_keys()
PySpark – map_values()
PySpark – struct()
PySpark – countDistinct()
PySpark – sum(), avg()
PySpark – row_number()
PySpark – rank()
PySpark – dense_rank()
PySpark – percent_rank()
PySpark – typedLit()
PySpark – from_json()
PySpark – to_json()
PySpark – json_tuple()
PySpark – get_json_object()
PySpark – schema_of_json()
Working Examples
IX) Pyspark External Sources
Working with sql statements
Spark and Hive Integration
Spark and mysql Integration
Working with CSV
Working with JSON
Transformations and actions on dataframes
Narrow, wide transformations
Addition of new columns, dropping of columns ,renaming columns
Addition of new rows, dropping rows
Handling nulls
Joins
Window function
Writing data back to External sources
Creation of tables fromDataframes (Internal tables, Temporary tables)
X) DEPLOYMENT MODES
Local Mode
Cluster Modes(Standalone , YARN
XI) PYSPARK APLLICATION
Stages and Tasks
Driver and Executor
Building spark applications/pipelines
Deploying spark apps to cluster and tuning
Performance tuning
PySpark Streaming Concepts
Integration with Kafka
PySpark-mllib