Training Information
Hadoop with Pyspark, Linux
We are pleased to offer a comprehensive suite of training solutions tailored to meet your needs. Our services encompass both online and offline corporate training options, ensuring flexibility and accessibility for your team's professional development.
Course Content
Syllabus:
BIG DATA HADOOP
I: INTRODUCTION
What is Big Data?
What is Hadoop?
Need of Hadoop
Sources and Types of Data
Comparison with Other Technologies
Challenges with Big Data
i. Storage
ii. Processing
RDBMS vs Hadoop
Advantages of Hadoop
Hadoop Echo System components
II: HDFS (Hadoop Distributed File System)
Features of HDFS
Name node ,Data node ,Blocks
Configuring Block size,
HDFS Architecture ( 5 Daemons)
i. Name Node
ii. Data Node
iii. Secondary Name node
iv. Job Tracker
v. Task Tracker
Metadata management
Storage and processing
Replication in Hadoop
Configuring Custom Replication
Fault Tolerance in Hadoop
HDFS Commands
III: MAP REDUCE
Map Reduce Architecture
Processing Daemons of Hadoop
Job Tracker (Roles and Responsibilities)
Task Tracker(Roles and Responsibilities)
Phases of Map Reduce
i) Mapper phase
ii) Reducer phase
Input split
Input split vs Block size
Partitioner in Map Reduce
Groupings and Aggregations
Data Types in Map Reduce
Map Reduce Programming Model
Driver Code
Mapper Code
Reducer Code
Programming examples
File input formats
File output formats
Merging in Map Reduce
Speculative Execution Model
Speculative Job
IV: SQOOP (SQL + HADOOP)
Introduction to Sqoop
SQOOP Import
SQOOP Export
Importing Data From RDBMS to HDFS
Importing Data From RDBMS to HIVE
Importing Data From RDBMS to HBASE
Exporting From HASE to RDBMS
Exporting From HBASE to RDBMS
Exporting From HIVE to RDBMS
Exporting From HDFS to RDBMS
Transformations While Importing / Exporting
Filtering data while importing
Vertical and Horizontal merging while import
Working with delimiters while importing
Groupings and Aggregations while import
Incremental import
Examples and operations
Defining SQOOP Jobs
V: YARN
Introduction
Speculative Execution ,Speculative job and
Speculative Task.
Comparision of Hadoop1.xx with Hadoop2.xx
Comparision with previous versions
YARN Architecture Componets
i. Resource Manager
ii. Application Master
iii. Node Manager
iv. Application Manager
v. Resource Scheduler
vi. Job History Server
vii. Container
VI: NOSQL
What is “Not only SQL”
NOSQL Advantages
What is problem with RDBMS for Large
Data Scaling Systems
Types of NOSQL & Purposes
Key Value Store
Columer Store
Document Store
Graph Store
Introduction to cassandra – NOSQL Database
Introduction to MongoDB and CouchDB Database
Intergration of NOSQL Databases with Hadoop
VII: HBASE
Introduction to big table
What is NOSQL and colummer store Database
HBASE Introduction
Hbase use cases
Hbase basics
Column families
Scans
Hbase Architecture
Map Reduce Over Hbase
Hbase data Modeling
Hbase Schema design
Hbase CRUD operators
Hive & Hbaseinteragation
Hbase storage handlers
VIII: HIVE
Introduction
Hive Architecture
Hive Metastore
Hive Query Launguage
Difference between HQL and SQL
Hive Built in Functions
Loading Data From Local Files To Hive Tables
Loading Data From Hdfs Files To Hive Tables
Tables Types
Inner Tables
External Tables
Hive Working with unstructured data
Hive Working With Xml Data
Hive Working With Json Data
Hive Working With Urls And Weblog Data
Hive Unions
Hive Joins
Multi Table / File Inserts
Inserting Into Local Files
Inserting Into Hdfs Files
Hive UDF (user defined functions)
Hive UDAF (user defined Aggregated functions)
Hive UDTF (user defined table Generated functions
Partitioned Tables
Non – Partitioned Tables
Multi-column Partitioning
Dynamic Partitions In Hive
Performance Tuning mechanism
Bucketing in hive
Indexing in Hive
Hive Examples
Hive & Hbase Integration
PYSPARK
I ) PYSPARK INTRODUCTION
What is Apache Spark?
Why Pyspark?
Need for pyspark
spark Python Vs Scala
pyspark features
Real-life usage of PySpark
PySpark Web/Application
PySpark - SparkSession
PySpark – SparkContext
PySpark – RDD
PySpark – Parallelize
PySpark – repartition() vs coalesce()
PySpark – Broadcast Variables
PySpark – Accumulator
II) PYSPARK - RDD COMPUTATION
Operations on a RDD
Direct Acyclic Graph (DAG)
RDD Actions and Transformations
RDD computation
Steps in RDD computation
RDD persistence
Persistence features
II) PERSISTENCE Options:
1) MEMORY_ONLY
2) MEMORY_SER_ONLY
3) DISK_ONLY
4) DISK_SER_ONLY
5) MEMORY_AND_DISK_ONLY
III) PYSPARK - CORE COMPUTING
Fault Tolerence model in spark
Different ways of creating a RDD
Word Count Example
Creating spark objects(RDDs) from Scala Objects(lists).
Increasing the no of partitons
Aggregations Over Structured Data:
reduceByKey()
IV) GROUPINGS AND AGGREGATIONS
i) Single Grouping and Single Aggregation
ii) Single Grouping and multiple Aggregation
iii) multi Grouping and Single Aggregation
iv) Multi Grouping and Multi Aggregation
Differences b/w reduceByKey() and groupByKey()
Process of groupByKey
Process of reduceByKey
Reduce() function
Various Transformations
Various Built-in Functions
V) Various Actions and Transformations:
countByKey()
countByValue()
sortByKey()
zip()
Union()
Distinct()
Various count aggregation
Joins
-inner join
-outer join
Cartesian()
Cogroup()
Other actions and transformations
VI) PySpark SQL - DataFrame
Introduction
Making data Structured
Case Classes
ways to extract case class objects
1) using function
2) using map with multiple exressions
3) using map with single expression
Sql Context
Data Frames API
DataSet API
RDD vs DataFrame vs DataSet
PySpark – Create a DataFrame
PySpark – Create an empty DataFrame
PySpark – Convert RDD to DataFrame
PySpark – Convert DataFrame to Pandas
PySpark – show()
PySpark – StructType & StructField
PySpark – Row Class
PySpark – Column Class
PySpark – select()
PySpark – collect()
PySpark – withColumn()
PySpark – withColumnRenamed()
PySpark – where() & filter()
PySpark – drop() & dropDuplicates()
PySpark – orderBy() and sort()
PySpark – groupBy()
PySpark – join()
PySpark – union() & unionAll()
PySpark – unionByName()
PySpark – UDF (User Defined Function)
PySpark – map()
PySpark – flatMap()
pyspark – foreach()
PySpark – sample() vs sampleBy()
PySpark – fillna() & fill()
PySpark – pivot() (Row to Column)
PySpark – partitionBy()
PySpark – ArrayType Column (Array)
PySpark – MapType (Map/Dict)
VII) PySpark SQL Functions
PySpark – Aggregate Functions
PySpark – Window Functions
PySpark – Date and Timestamp Functions
PySpark – JSON Functions
PySpark – Read & Write JSON file
VIII) PySpark Built-In Functions
PySpark – when()
PySpark – expr()
PySpark – lit()
PySpark – split()
PySpark – concat_ws()
Pyspark – substring()
PySpark – translate()
PySpark – regexp_replace()
PySpark – overlay()
PySpark – to_timestamp()
PySpark – to_date()
PySpark – date_format()
PySpark – datediff()
PySpark – months_between()
PySpark – explode()
PySpark – array_contains()
PySpark – array()
PySpark – collect_list()
PySpark – collect_set()
PySpark – create_map()
PySpark – map_keys()
PySpark – map_values()
PySpark – struct()
PySpark – countDistinct()
PySpark – sum(), avg()
PySpark – row_number()
PySpark – rank()
PySpark – dense_rank()
PySpark – percent_rank()
PySpark – typedLit()
PySpark – from_json()
PySpark – to_json()
PySpark – json_tuple()
PySpark – get_json_object()
PySpark – schema_of_json()
Working Examples
IX) Pyspark External Sources
Working with sql statements
Spark and Hive Integration
Spark and mysql Integration
Working with CSV
Working with JSON
Transformations and actions on dataframes
Narrow, wide transformations
Addition of new columns, dropping of columns ,renaming columns
Addition of new rows, dropping rows
Handling nulls
Joins
Window function
Writing data back to External sources
Creation of tables fromDataframes (Internal tables, Temporary tables)
X) DEPLOYMENT MODES
Local Mode
Cluster Modes(Standalone , YARN
XI) PYSPARK APLLICATION
Stages and Tasks
Driver and Executor
Building spark applications/pipelines
Deploying spark apps to cluster and tuning
Performance tuning
PySpark Streaming Concepts
Integration with Kafka
PySpark-mllib
PYTHON
1. Python Basics
What is Python
Why Python?
History of python
Applications of Python
Features of Python
Advantages of Python
Versions of Python
Installation of Python
Flavors of Python
Comparision b/w various programming languages C, Java and Python
2. Python Operations
Python Modes of Execution
Interactive mode of Execution
Batch mode of Execution
Python Editors and IDEs
Python Data Types
Python Constants
Python Variables
Comments in python
Output Print(),function
Input() Function :Accepting input
Type Conversion
Type(),Id() Functions
Comments in Python
Escape Sequences in Python
Strings in Python
String indices and slicing
3. Operators in Python
Arithmetic Operators
Comparision Operators
Logical Operators
Assignment Operators
Short Hand Assignment Operators
Bitwise Operators
Membership Operators
Identity Operators
4. Python IDE’s
Pycharm IDE Installation
Working with Pycharm
Pycharm components
Installing Anaconda
What is Conda?
Anaconda Prompt
Anaconda Navigator
Jupyter Notebook
Jupyter Features
Spyder IDE
Spyder Featueres
Conda and PIP
5. Flow Control statements
Block/clause
Indentation in Python
Conditional Statements
if stmt
if…else statement
if…elif…statement
6. Looping Statements
while loop,
while … else,
for loop
Range() in for loop
Nested for loop
Break statememt
Continue statement
Pass statement
7. Strings in Python
Creating Strings
String indexing
String slicing
String Concatenation
String Comparision
String splitting and joining
Finding Sub Strings
String Case Change
Split strings
String methods
8. Collections in Python
Introduction
Lists
Tuples
Sets
Dictionaries
Operations on collections
Functions for collections
Methods of collection
Nested collections
Differences b/w list tuple and set and Dictionary
9. Python Lists
List properties
List Creation
List indexing and slicing
List Operations
List addresses
List functions
Different ways of creating lists
Nested Lists
List modification
List insertion and deletion
List Methods
10. Python Tuples
Tuple properties
Tuple Creation
Tuple indexing and slicing
Different ways of creating tuples
Tuple Operations
Tuple Addresses
Tuple Functions
Nested Tuples
Tuple Methods
Differences b/w List and Tuple
11. Python Sets
Set properties
Set Creation
Set Operations
Set Functions
Set Addresses
Set Mathematical Operations
Set Methods
Insertion and Deletion operation
12. Python Dictionary
Dictionary properties
Dictionary Creation
Dictionary Operations
Dictionary Addresses
Nested Dictionaries
Dictionary Methods
Insertion and Deletion of elements
Differences b/w list tuple and set and Dictionary
13. Functions in Python
Defining a function
Calling a function
Properties of Function
Examples of Functions
Categories of Functions
Argument types
default arguments
non-default arguments
keyword arguments
non keyword arguments
Variable Length Arguments
Variables scope
Call by value and Call by Reference
Passing collections to function
Local and Global variables
Recursive Function
Boolean Function
Passing functions to function
Anonymous or Lamda function
Filter() and map() functions
Reduce Function
14. Modules in Python
What is a module?
Different types of module
Creating user defined module
Setting path
The import statement
Normal Import
From … Import
Module Aliases
Reloading a module
Dir function
Working with Standard modules -Math, Random, Date time and os modules,
15. Packages
Introduction to packages
Defining packages
Importing from packages
--init--.py file
Defining sub packages
Importing from sub packages
16. Errors and Exception Handling
Types of errors
Compile-Time Errors
Run-Time Errors
What is Exception?
Need of Exception handling
Predefined Exceptions
Try,Except, finally blocks
Nested blocks
Handling Multiple Exceptions
User defined Exceptions
Raise statement
17. File Handling
Introduction
Types of Files in Python
Opening a file
Closing a file
Writing data to files
Tell( ) and seek( ) methods
Reading a data from files
Appending data to files
With open stmt
Various functions
18. OOPs Concepts
OOPS Features
Encapsulation
Abstraction
Class
Object
Static and non static variables
Defining methods
Diff b/w functions & methods
Constructors
Parameterized Constructors
Built –in attributes
Object Reference count
Destructor
Garbage Collection
Inheritance
Types of Inheritances
Object class
Polymorphism
Over riding
Super() statement
19. Regular Expressions
What is regular expression?
Special characters
Forming regular expression
Compiling regular expressions
Grouping
Findall() function
Finditer() function
Sub() function
Match() function
Search() function
Matching vs searching
Splitting a string
Replacing text
validations
20. Database Access
Introduction
Installing mysql database
Creating database users,
Installing Oracle Python modules
Establishing connection with mysql
Closing database connections
Connection object
Cursor object
Executing SQL queries
Retrieving data from Database.
Using bind variables executing
SQL queries
Transaction Management
Handling errors
21. Python Date and Time
How to Use Date &DateTime Class
Time and date Objects
Calendar in Python
The Time Module
Python Calendar Module
22. Operating System Module
Introduction
getcwd
listdir
chdir
mkdir
rename file/dir
remove file/dir
rmtree()
Os help
Os operations
23. Advanced concepts
Python Iterator
Python Generator
Python closure
Python Decorators
Web Scraping
PIP
Working with CSV files
Working with XML files
Working with JSON files
Debugging
24. GUI Programming (tkinter)
Introduction
Components and events
Root window
Labels
Fonts and colors
Buttons, checkbox
Label widget
Message widget
Text widget
Radio button
image
25. Excel Workbook
Installing and working with Xlsx writer
Creating Excel Work book
Inserting into excel sheet
Insetting data into multiple excel sheets
Creating headers
Installing and working with xlrd module
Reading a specific cell or row or column
Reading specific rows and columns
26. Data Analytics
Introduction
pandas module
Numpy module
Matplotlib module
Working Examples
27. Introduction to Datascience
Machine Learning Introduction
Datasets
Supervised /Unsupervised Learning
Statistical Analysis
Data Analysis
Uni-variate/multi-variate analysis
Corelation Analysis
Algorithm types
Applications
28. Python Pandas
Introduction to Pandas
Creating Pandas Series
Creating Data Frames
Pandas Data Frames from dictionaries
Pandas Data Frames from list
Pandas Data Frames from series
Pandas Data Frames from CSV, Excel
Pandas Data Frames from JSON
Pandas Data Frames from Databases
Pandas Data Functionality
Pandas Timedelta
Creating Data Frames from Timedelta
Pandas Groupings and Aggregations
Converting Data Frames from list
Creating Functions
Converting Different Formats
Pandas and Matplotlib
Pandas usecases
29. Python Numpy
Introduction to Numpy
Numpy Arrays
Numpy Array Indexing
2-D and 3Dimensional Arrays
Numpy Mathematical operations
Numpy Flattening and reshaping
Numpy Horizontal and Vertical Stack
Numpy linespace and arrange
Numpy asarray and Random numbers
Numpy iterations and Transpose
Numpy Array Manipulation
Numpy and matplotlib
Numpy Linear Algebra
Numpy String Functions
Numpy operations and usecases
Numpy Working Examples
30. Python Matplotlib
Introduction to matplotlib
Installing matplotlib
Generating graphs
Normal plottings
Generating Bargraphs
Histograms
Scatter plots
Stack plots
Pie plots
Matplotlib working examples