Big Data Masters

Data is an essential part of any organization. Every organization generates a massive amount of real-time or batch data. This is where Big data plays a vital role irrespective of domain and industry. This complete course is designed to fulfill such requirements so that we will be able to work with a humongous amount of data. You will be able to create your Big Data Engine in your organization by implementing various big data stacks used across the industry.

Start Date: 26th June 2021
Class Timings: 10:30 AM - 12:30 PM IST Saturday and Sunday
Doubt Clearing Session: 8 PM IST Every Tuesday
4.20 (35 Reviews)
Language: English

Course Overview

Data is an essential part of any organization. Every organization generates a massive amount of real-time or batch data. This is where Big data plays a vital role irrespective of domain and industry. This complete course is designed to fulfill such requirements so that we will be able to work with a humongous amount of data. You will be able to create your Big Data Engine in your organization by implementing various big data stacks used across the industry.

What you'll learn
  • 30+ Big Data Technologies
  • Big Data Engine Creation
  • Streaming and Batch Processing of Data
  • Various SQL Databases
  • Various NOSQL Databases
  • Real-Time Implementation
  • Spark
  • Hive
  • Talend
  • Informatica
  • Hadoop Distributions
  • Deployment
  • DataBricks Implementation
Requirements
  • Minimum system requirement i3 or higher
  • Dedication

Course Curriculum

  • Why Is Data So Important?
  • Pre-Requisite – Data Scale
  • What Is Big Data?
  • Big Bank: Big Challenge
  • Common Problems
  • 3 Vs Of Big Data
  • Defining Big Data
  • Sources Of Data Flood
  • Exploding Data Problem
  • Redefining The Challenges Of Big Data
  • Possible Solutions: Scaling Up Vs. Scaling Out
  • Challenges Of Scaling Out
  • Solution For Data Explosion-Hadoop
  • Hadoop: Introduction
  • Hadoop In Layman's Term
  • Hadoop Ecosystem
  • Evolutionary Features Of Hadoop
  • Hadoop Timeline
  • Why Learn Big Data Technologies?
  • Who Is Using Big Data?
  • HDFS: Introduction
  • Design Of HDFS
  • Why Hadoop Cluster?
  • HDFS Blocks
  • Components Of Hadoop 3
  • NameNode And Hadoop Cluster
  • Arrangement Of Racks
  • Arrangement Of Machines And Racks
  • Local FS And HDFS
  • NameNode
  • Checkpointing
  • Replica Placement
  • Benefits-Replica Placement And Rack Awareness
  • URI
  • URL And URN
  • HDFS Commands
  • Problems With HDFS In Hadoop 1.X
  • HDFS Federation
  • High Availability
  • Anatomy Of File Read From HDFS
  • Data Read Steps
  • Important Java Classes To Write Data To HDFS
  • Anatomy Of File Write To HDFS
  • Writing File To HDFS: Steps
  • Building Principles
  • InputSplit
  • InputSplit And Data Blocks – Difference
  • Why Is The Block Size 128 MB?
  • RecordReader
  • InputFormat
  • Default Inputformat : TextInputFormat
  • OutputFormat
  • Using A Different OutputFormat
  • Important Points
  • Partitioner
  • Using Partitioner
  • Map Only Job
  • Flow Of Operations In MapReduce
  • Serialization In MapReduce
  • Schedulers In YARN
  • FIFO Scheduler
  • Capacity Scheduler
  • Fair Scheduler
  • Differences Between Hadoop 1.X And Hadoop 2.X and hadoop 3.X
  • Introduction
  • Hive DDL
  • Demo: Databases.Ddl
  • Demo: Tables.Ddl
  • Hive Views
  • Demo: Views.Ddl
  • Architecture
  • Primary Data Types
  • Data Load
  • Demo: ImportExport.Dml
  • Demo: HiveQueries.Dml
  • Demo: Explain.Hql Table Types
  • Demo: ExternalTable.Ddl
  • Complex Data Types
  • Demo: Working With Complex Datatypes
  • Hive Variables
  • Demo: Working With Hive Variables
  • Hive Variables And Execution Customisation
  • Working With Arrays
  • Sort By And Order By
  • Distribute By And Cluster By
  • Partitioning
  • Static And Dynamic Partitioning
  • Bucketing Vs Partitioning
  • Joins And Types
  • Bucket-Map Join
  • Sort-Merge-Bucket-Map Join
  • Left Semi Join
  • Demo: Join Optimisations
  • Input Formats In Hive
  • Sequence Files In Hive
  • RC File In Hive
  • File Formats In Hive
  • ORC Files In Hive
  • Inline Index In ORC Files
  • ORC File Configurations In Hive
  • SerDe In Hive
  • Demo: CSVSerDe
  • JSONSerDe
  • RegexSerDe
  • Analytic And Windowing In Hive
  • Demo: Analytics.Hql
  • Hcatalog In Hive
  • Demo: Using_HCatalog
  • Accessing Hive With JDBC
  • Demo: HiveQueries.Java
  • HiveServer2 And Beeline
  • Demo: Beeline
  • UDF In Hive
  • Demo: ToUpper.Java And Working_with_UDF
  • Optimizations In Hive
  • Demo: Optimizations
  • Challenges With Traditional RDBMS
  • Features Of NoSQL Databases
  • NoSQL Database Types
  • CAP Theorem
  • What Is HBase Regions
  • HBase HMaster ZooKeeper
  • HBase First Read
  • HBase Meta Table
  • Region Split
  • Apache HBase Architecture Benefits
  • HBase Vs. RDBMS
  • Shell Commands
  • Sqoop Architecture
  • Sqoop Features
  • Sqoop Hands On
  • Python Core
  • Introduction of python and comparison with other
  • Programming language
  • Installation of Anaconda Distribution and other python
  • IDE Python Objects, Number & Booleans, Strings
  • Container objects, Mutability of objects
  • Operators Arithmetic, Bitwise, C omparison and Assignment o perators, Operators Precedence and associativity
  • Conditions(If else,if elif else) Loops(While ,for)
  • Break and Continue statement and Range Function.
  • String Objects And Collections
  • String object basics
  • String methods
  • Splitting and Joining Strings
  • String format functions
  • List object basics
  • List as stack and Queues
  • List comprehensions
  • Tuples,Set ,Dictionaries Functions
  • Tuples,Sets Dictionary Object basics, Dictionary Object methods, Dictionary View Objects.
  • Functions basics, Parameter passing, Iterators Generator functions
  • Lambda functions
  • Map , Reduce, Filter functions
  • OOPS Concepts Working With Files
  • OOPS basic concepts
  • Creating classes and Objects Inheritance
  • Multiple Inheritance
  • Working with files
  • Reading and writing files
  • Buffered read and write
  • Other File methods
  • Exception Handling Database Programming
  • Using Standard Module
  • Creating new modules
  • Exceptions Handling with Try except
  • Creating ,inserting and retrieving Table
  • Updating and deleting the data
  • Installing and configuring MySQL
  • Install and Configure MySQL Client
  • DDL- Create database/table, Drop, Alter, etc
  • DML - INSERT, DELETE, UPDATE, MERGE etc
  • DML - INSERT, DELETE, UPDATE, MERGE etc
  • DQL - SELECT,etc
  • JOINS - One Many, Many Many
  • DISTINCT
  • ORDER BY
  • LIMIT
  • WILD CARDS
  • LOGICAL OPERATORS - LIKE, EQUAL, AND, OR etc
  • STRING Functions
  • DATE Functions
  • MATH Functions
  • COUNT, MIN and MAX
  • SUM
  • AVG
  • LAG and LEAD function Examples
  • Top N Analysis
  • ROW_NUMBER
  • RANK AND DENSE_RANK
  • CASE WHEN
  • PIVOT
  • LISTAGG
  • UNION
  • Sub-Queries
  • EXISTS
  • NOT EXISTS
  • WITH CLAUSE
  • Recursive WITH & CTE
  • Regular Expressions in SQL
  • Cassandra Introduction
  • Cassandra Installation in local system
  • DATASTAX Cassandra setup
  • Cassandra ArchitectureCassandra Queries
  • MondoDB Introduction
  • MondoDB Compass Setup
  • MongoDB Atlas Setup
  • MondoDB Architecture
  • MondoDB Queries
  • Introduction To Apache Spark
  • Map Reduce Limitations
  • RDD's
  • Spark Context - SQLContext And HiveContext
  • Programming With RDD's
  • Creating RDD's From Text-Files
  • Transformations And Actions
  • How Does Spark Execution Work
  • RDD API's - Filter
  • FlatMap
  • Fold
  • Foreach
  • Glom
  • GroupBy
  • Map
  • ReduceByKey
  • Zip
  • Persist
  • Unpersist
  • Read/Write From Storage
  • RDD Examples
  • RDD API's - Aggregate
  • Cartesian
  • Checkpoint
  • Coalesce
  • Reparition
  • Cogroup
  • CollectAsMap
  • CombineByKey
  • Count And CountApprox Functions
  • More RDD Examples
  • Schema - StructType
  • StructFields
  • DataType
  • DataFrame API's And Examples
  • Create Temporary Tables
  • SparkSQL
  • Spark Dataset
  • Parquet Vs Avro
  • Examples And Problem Solving On Real Data Using RDD And Converting The Same To Dataframe
  • Create A Spark Project
  • SBT / Maven
  • How Do Maven Repo Work
  • Accumulators
  • BroadCast Variables
  • Query Execution Plan
  • Internal Of Spark Workings
  • Databricks Introduction
  • Databricks Setup
  • Databricks Integration with cloud
  • Databricks OPS Pipeline
  • Databricks in Production
  • Introduction To Kafka
  • Kakfa Architecture
  • Kafka Key Consepts/Fundamentals
  • Overview Of Zookeeper And It’s Role In Kafka Cluster
  • Cluster, Nodes, Brokers, Topics Consumer, Producers, Logs, Partitions Consept Of Consumer Groups
  • Leader & Follower Partition
  • Installing One Node Kafka Cluster On Local Installing Multinode Kafka Cluster On Losal Command Line Producer And Consumer Replisation Consept For Fault Tolerance How Data Is Stored In Brokers
  • Log Segments, Message Offsets, Message Index
  • Isr List / Minimum Isr
  • Committed Vs Uncommited Messages Writing A Kafka Producer In Java Writing A Kafka Consumer In Java Scaling Up The Kafka Cluster Achieving Exactly Once Semantics
  • Integrating Kafka With Spark Structured Streaming.
  • Introduction To Airflow And Its Usage What Is Workflow
  • Cron-Job Creation Example Airflow Additional Features
  • Airflow Architecture And Components Airflow Installation Demo
  • Dags-Creating A Simple Helloworld Dag Introduction To Tasks And Operators
  • Viewing The DAG In Ui-Graph View, Tree View, Logs Viewing
  • Example Showcasing Bash Operators Usage Setting Precedence Among Various Tasks Lifecycle OfATask-Understanding Various Stages About Trigger_rules & Understanding With Example Airflow Artifact - More On Operators
  • Writing Our Own Custom Operators Walkthrough Of Airflow UI
  • Connections To Various Datastores & Variables
  • Working With Connections, Understanding Sensors — Demo
  • Building an end-to-end customer-360 pipeline using Airflow involving data collection from various sources, processing in spark, loading the processed data in hive and uploading the same to HBase and generating a notification about success of the pipeline to the downstream applications.
  • Kind of Processing
  • What is Real-time Processing
  • The Importance of Real-time Processing
  • Batch processing vs Real-tim Stream Processing Spark Streaming Data
  • Spark dissretized stream or DStream Batch & Batch Interval
  • Do Spark is a real-time streaming engine Stream Processing in Spark Transformed DStream
  • Understanding Producer & Consumer Practisal on Real•time Processing Stream Transformations
  • Stateless Transformations Stateful Transformations Window Operations
  • Batch Interval Window Size Sliding Interval
  • Practical on Stateless Transformation Practisal on Stateful Transformation reduceByKey vs updateStateByKey Working With Sliding Window reduceByKeyAndWindow Transformation reduceByWindow Transformation countByWindow Transformation
  • What Is Structured Streaming Requirement Of Strusture Streaming Limitations Of Spark Streaming Benefits Of Spark Structure Streaming
  • Practical • Wordcount Example On Structured Streaming
  • Dynamically Setting The ShuPle Partitions Data Stream Writer Output Modes
  • Datastream Output Modes - append, update & complete
  • Spark Streaming Graceful Shutdown
  • How Does Spark Streaming Code Executes Internally How a Job Converted to Micro batches
  • Trigger Point For Micro Batches
  • Types of Triggers • unspecified, time interval, one time, continuous
  • Types of Data Sourses • Sosket Source, Rate Source, File Source, Kafka Source
  • Limitations of socket source Prastisal on File Data Source
  • Types of Spark Streaming Output Data Options Fault Tolerance and Exastly Onse Guarantee Understanding Checkpoint Location
  • Stateful vs Stateless Transformations
  • Managed Stateful Operations vs UnManaged Stateful Operations
  • Types of Aggregations - Continuous Aggregations vs Time Bound Aggregations
  • Window Transformations
  • UpdateStateByKey, reduceByKeyAndWindow, reduceByWindow, countByWindow
  • Types of windows - Tumbling Time Window, Sliding Time Window
  • Dealing With Late Coming Records Using Watermark
  • State Store Cleanup
  • Calculating the Watermark Boundary Streaming Joins
  • Streaming Dataframe to static dataframe
  • Streaming Dataframe With Another Streaming Dataframes
  • AWS EMR (Elastic MapReduce):
  • What is a VM (Virtual Machine) On-Premise vs Cloud Setup
  • Major Vendors of Hadoop Distribution Why Cloud & Big Data on Cloud Major Cloud Providers of Bigdata What is EMR
  • Hdfs vs S3 What Is 53
  • Important Instances in AWS Kinds of Nodes in Cluster
  • Transient vs Long Running Cluster Running Spark Code on Emr
  • How to Track Your Job
  • Copy File From S3 to Local Zeppelin Notebook
  • Types of EC2 Instances How to Create a VM What is a Keypair Elastic IP
  • AWS Storage, Networking & CLI Instance Store
  • S3 & EBS
  • Public ip Vs Private Ip Network Switches Security Group
  • Aws Command Line Interface
  • Launch A Emr Cluster Using Advanced Options
  • AWS Athena
  • What is Athena?
  • When do we require Athena What problem Athena Solve How Athena Works
  • Athena Pricing
  • Athena Practical Demonstration
  • How to create a normal table manually on csv data residing in s3
  • How to minimize data scanning in Athena How to create partition table on Parquet file
  • Infering Schema automatically using AWS Glue
  • AWS Glue
  • What is AWS Glue? Introduction To Glue Features of Glue AWS Glue Benefits
  • AWS Glue Terminology
  • Pointing to Specific Data Stores and Endpoints Glue Data Catalogue
  • Crawlers
  • Connecting to Your Data Store Using Crawlers for Catalogue Tables
  • Overview and Working of Glue Jobs Adding New Jobs in Glue
  • Triggering Jobs and Their Scheduling
  • AWS Redshift
  • Database vs Data Warehouse vs Data Lake Introduction to Amazon Redshift
  • Benefits of Amazon Redshift Use Cases of Amazon Redshift
  • Redshift Master Slave Architecture Types of Nodes
  • Redshift Spectrum Redshift Fault Tolerance Redshift Sort Keys
  • Redshift Distribution Styles Practical Demonstration
  • Basic statistics
  • Data sources
  • Pipelines
  • Extracting, transforming and selecting features
  • Classification and Regression
  • Clustering
  • Collaborative filtering
  • Frequent Pattern Mining
  • Model selection and tuning
  • Advanced topics
  • Introduction to ETL from Talend Studio- Integration with HDFS, Hive, Sqoop, Spark etc
  • Introduction to ETL from Informatica BDM- Integration with HDFS, Hive, Sqoop, Spark etc
  • End-to-end Big Data Pipeline Engine PROJECT
  • Involving all Major components like
  • Sqoop, Hdfs, Hive, Hbase, Spark... etc.
  • Interview Preparation Tips
  • Sample Resume
  • 300+ Mock Interview Recordings
  • Mock Interview QA
  • Interview Questions
  • How to Handle Various Interview Round Qs
  • Career Guidance
  • One to One Resume Discussion
  • Certification
4.20 out of 5.0
1 Star 14.3%
2 Star 5.7%
3 Star 2.9%
4 Star 0.0%
5 Star 77.1%
Sudhanshu Kumar

Having 7+ years of experience in Big data, Data Science and Analytics with product architecture design and delivery. Worked in various product and service based Company. Having an experience of 5+ years in educating people and helping them to make a career transition.

Join Thousand of Happy Students!

Subscribe our newsletter & get latest news and updation!