Data Engineering free courses
Linked Data Engineering
π¬ Video Lessons
Rating βοΈ: 5 out of 5
Students π¨βπ: 9,973
Duration β°: 8 weeks long
Source: openHPI
π Course Link
Data Engineering
Credits β³: 15
Duration β°: 4 hours
πββοΈ Self paced
Source: Google cloud
π Course Link
Data Engineering Essentials using Spark, Python and SQL
π¬ 402 video lesson
πββοΈ Self paced
Teacher: itversity
Resource: Youtube
π Course Link
Data engineering with Azure Databricks
Modules β³: 5
Duration β°: 4-5 hours worth of material
πββοΈ Self paced
Source: Microsoft ignite
π Course Link
Perform data engineering with Azure Synapse Apache Spark Pools
Modules β³: 5
Duration β°: 2-3 hours worth of material
πββοΈ Self paced
Source: Microsoft Learn
π Course Link
Books
Data Engineering
The Data Engineers Guide to Apache Spark
All the best ππ
Linked Data Engineering
π¬ Video Lessons
Rating βοΈ: 5 out of 5
Students π¨βπ: 9,973
Duration β°: 8 weeks long
Source: openHPI
π Course Link
Data Engineering
Credits β³: 15
Duration β°: 4 hours
πββοΈ Self paced
Source: Google cloud
π Course Link
Data Engineering Essentials using Spark, Python and SQL
π¬ 402 video lesson
πββοΈ Self paced
Teacher: itversity
Resource: Youtube
π Course Link
Data engineering with Azure Databricks
Modules β³: 5
Duration β°: 4-5 hours worth of material
πββοΈ Self paced
Source: Microsoft ignite
π Course Link
Perform data engineering with Azure Synapse Apache Spark Pools
Modules β³: 5
Duration β°: 2-3 hours worth of material
πββοΈ Self paced
Source: Microsoft Learn
π Course Link
Books
Data Engineering
The Data Engineers Guide to Apache Spark
All the best ππ
π4β€2
π Mastering Spark: 20 Interview Questions Demystified!
1οΈβ£ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2οΈβ£ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3οΈβ£ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4οΈβ£ RDD Operations: Explore the various RDD operations that power Spark.
5οΈβ£ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6οΈβ£ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7οΈβ£ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8οΈβ£ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9οΈβ£ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
π spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1οΈβ£1οΈβ£ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1οΈβ£2οΈβ£ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1οΈβ£3οΈβ£ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1οΈβ£4οΈβ£ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1οΈβ£5οΈβ£ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1οΈβ£6οΈβ£ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1οΈβ£7οΈβ£ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1οΈβ£8οΈβ£ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1οΈβ£9οΈβ£ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2οΈβ£0οΈβ£ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
1οΈβ£ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2οΈβ£ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3οΈβ£ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4οΈβ£ RDD Operations: Explore the various RDD operations that power Spark.
5οΈβ£ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6οΈβ£ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7οΈβ£ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8οΈβ£ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9οΈβ£ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
π spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1οΈβ£1οΈβ£ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1οΈβ£2οΈβ£ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1οΈβ£3οΈβ£ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1οΈβ£4οΈβ£ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1οΈβ£5οΈβ£ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1οΈβ£6οΈβ£ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1οΈβ£7οΈβ£ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1οΈβ£8οΈβ£ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1οΈβ£9οΈβ£ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2οΈβ£0οΈβ£ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π7
We are now on WhatsApp as well
Follow for more data engineering resources: π https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
Follow for more data engineering resources: π https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π4β€1π₯1
Data Engineer Interview Questions for Entry-Level Data Engineerπ₯
1. What are the core responsibilities of a data engineer?
2. Explain the ETL process
3. How do you handle large datasets in a data pipeline?
4. What is the difference between a relational & a non-relational database?
5. Describe how data partitioning improves performance in distributed systems
6. What is a data warehouse & how is it different from a database?
7. How would you design a data pipeline for real-time data processing?
8. Explain the concept of normalization & denormalization in database design
9. What tools do you commonly use for data ingestion, transformation & storage?
10. How do you optimize SQL queries for better performance in data processing?
11. What is the role of Apache Hadoop in big data?
12. How do you implement data security & privacy in data engineering?
13. Explain the concept of data lakes & their importance in modern data architectures
14. What is the difference between batch processing & stream processing?
15. How do you manage & monitor data quality in your pipelines?
16. What are your preferred cloud platforms for data engineering & why?
17. How do you handle schema changes in a production data pipeline?
18. Describe how you would build a scalable & fault-tolerant data pipeline
19. What is Apache Kafka & how is it used in data engineering?
20. What techniques do you use for data compression & storage optimization?
1. What are the core responsibilities of a data engineer?
2. Explain the ETL process
3. How do you handle large datasets in a data pipeline?
4. What is the difference between a relational & a non-relational database?
5. Describe how data partitioning improves performance in distributed systems
6. What is a data warehouse & how is it different from a database?
7. How would you design a data pipeline for real-time data processing?
8. Explain the concept of normalization & denormalization in database design
9. What tools do you commonly use for data ingestion, transformation & storage?
10. How do you optimize SQL queries for better performance in data processing?
11. What is the role of Apache Hadoop in big data?
12. How do you implement data security & privacy in data engineering?
13. Explain the concept of data lakes & their importance in modern data architectures
14. What is the difference between batch processing & stream processing?
15. How do you manage & monitor data quality in your pipelines?
16. What are your preferred cloud platforms for data engineering & why?
17. How do you handle schema changes in a production data pipeline?
18. Describe how you would build a scalable & fault-tolerant data pipeline
19. What is Apache Kafka & how is it used in data engineering?
20. What techniques do you use for data compression & storage optimization?
β€4
Here are three PySpark questions:
Scenario 1: Data Aggregation
Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"
Candidate:
Scenario 2: Data Transformation
Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"
Candidate:
Scenario 3: Data Partitioning
Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"
Candidate:
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
Scenario 1: Data Aggregation
Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Handle missing values
df_filled = df.fillna(0)
# Aggregate data
from pyspark.sql.functions import sum, col
df_aggregated = df_filled.groupBy("category", "region").agg(sum(col("sales")).alias("total_sales"))
# Sort the results
df_aggregated_sorted = df_aggregated.orderBy("total_sales", ascending=False)
# Save the aggregated DataFrame
df_aggregated_sorted.write.csv("path/to/aggregated/data.csv", header=True)
Scenario 2: Data Transformation
Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Convert column to timestamp
from pyspark.sql.functions import to_timestamp, col
df_transformed = df.withColumn("date_column", to_timestamp(col("date_column"), "yyyy-MM-dd"))
# Handle invalid dates
df_transformed_filtered = df_transformed.filter(col("date_column").isNotNull())
# Extract date components
from pyspark.sql.functions import year, month, dayofmonth
df_transformed_extracted = df_transformed_filtered.withColumn("year", year(col("date_column"))).withColumn("month", month(col("date_column"))).withColumn("day", dayofmonth(col("date_column")))
# Save the transformed DataFrame
df_transformed_extracted.write.csv("path/to/transformed/data.csv", header=True)
Scenario 3: Data Partitioning
Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"
Candidate:
# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)
# Partition by date
df_partitioned = df.repartitionByRange("date_column")
# Save to parquet format
df_partitioned.write.parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])
# Optimize storage
df_partitioned.write.option("compression", "snappy").parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
π6β€5
fundamentals-of-data-engineering.pdf
7.6 MB
π The good book to start learning Data Engineering.
β You can download it for free here
βWith this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
β You can download it for free here
βWith this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
π5β€2
Life of a Data Engineer.....
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines πͺ
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines πͺ
π5π₯1
Don't aim for this:
SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%
Aim for this:
SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%
You don't need to know everything straight away.
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%
Aim for this:
SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%
You don't need to know everything straight away.
Here, you can find Data Engineering Resources π
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ππ
β€9π4
π₯ ETL vs ELT: What's the Difference?
When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!
πΉ ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)
βοΈ Key point: Data is transformed before being loaded into the storage.
πΉ ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseβs computational resources
βοΈ Key point: Data is loaded into the storage first, and transformation happens afterward.
π― When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.
Which one works best for your project? π€
When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!
πΉ ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)
βοΈ Key point: Data is transformed before being loaded into the storage.
πΉ ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseβs computational resources
βοΈ Key point: Data is loaded into the storage first, and transformation happens afterward.
π― When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.
Which one works best for your project? π€
π4π₯4π₯°1
Join our WhatsApp channel for more data engineering resources
ππ
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
ππ
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
π6
Working with PySpark Aggregations
What are Aggregations?
Aggregations in PySpark allow you to transform large datasets by computing statistics across specified groups. PySpark offers built-in functions for common aggregations, such as sum, avg, min, max, count, and more.
Common Aggregation Methods in PySpark
1. groupBy(): Groups data by one or more columns and allows applying aggregation functions on each group.
2. agg(): Lets you apply multiple aggregation functions simultaneously.
3. count(): Counts the number of non-null entries.
4. sum(): Adds up the values in a column.
5. avg(): Computes the average of a column.
Example: Using groupBy() and Aggregations
Letβs say you have a DataFrame with sales data and want to calculate the total and average sales per salesperson.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg
# Create Spark session
spark = SparkSession.builder.appName("AggregationExample").getOrCreate()
# Sample data
data = [("Alice", 100), ("Alice", 150), ("Bob", 200), ("Bob", 300)]
df = spark.createDataFrame(data, ["Salesperson", "Sales_Amount"])
# Aggregating data
agg_df = df.groupBy("Salesperson").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
)
agg_df.show()
In this example, we used groupBy("Salesperson") to group the data by each salesperson, and agg() to calculate the total and average sales for each.
Real-World Example: Aggregating Product Sales Data
Imagine you're analyzing sales data for a retail store. You might want to know the total sales per product category, the highest and lowest sales amounts, or the average sales per transaction. Aggregations allow you to gain these insights quickly:
# Group by product category and calculate total and average sales
sales_df.groupBy("Product_Category").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
).show()
Advanced Aggregation Functions
countDistinct(): Counts unique values in a column.
df.groupBy("Salesperson").agg(countDistinct("Product_ID").alias("Unique_Products_Sold")).show()
approx_count_distinct(): Uses an approximate algorithm to count distinct values, useful for very large datasets.
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct("Product_ID")).show()
Windowed Aggregations
Sometimes, aggregations are performed over a βwindowβ rather than over the entire dataset or specific groups. Weβve covered window functions, but itβs useful to know they can be combined with aggregations for tasks like rolling averages.
What are Aggregations?
Aggregations in PySpark allow you to transform large datasets by computing statistics across specified groups. PySpark offers built-in functions for common aggregations, such as sum, avg, min, max, count, and more.
Common Aggregation Methods in PySpark
1. groupBy(): Groups data by one or more columns and allows applying aggregation functions on each group.
2. agg(): Lets you apply multiple aggregation functions simultaneously.
3. count(): Counts the number of non-null entries.
4. sum(): Adds up the values in a column.
5. avg(): Computes the average of a column.
Example: Using groupBy() and Aggregations
Letβs say you have a DataFrame with sales data and want to calculate the total and average sales per salesperson.
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, avg
# Create Spark session
spark = SparkSession.builder.appName("AggregationExample").getOrCreate()
# Sample data
data = [("Alice", 100), ("Alice", 150), ("Bob", 200), ("Bob", 300)]
df = spark.createDataFrame(data, ["Salesperson", "Sales_Amount"])
# Aggregating data
agg_df = df.groupBy("Salesperson").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
)
agg_df.show()
In this example, we used groupBy("Salesperson") to group the data by each salesperson, and agg() to calculate the total and average sales for each.
Real-World Example: Aggregating Product Sales Data
Imagine you're analyzing sales data for a retail store. You might want to know the total sales per product category, the highest and lowest sales amounts, or the average sales per transaction. Aggregations allow you to gain these insights quickly:
# Group by product category and calculate total and average sales
sales_df.groupBy("Product_Category").agg(
sum("Sales_Amount").alias("Total_Sales"),
avg("Sales_Amount").alias("Avg_Sales")
).show()
Advanced Aggregation Functions
countDistinct(): Counts unique values in a column.
df.groupBy("Salesperson").agg(countDistinct("Product_ID").alias("Unique_Products_Sold")).show()
approx_count_distinct(): Uses an approximate algorithm to count distinct values, useful for very large datasets.
from pyspark.sql.functions import approx_count_distinct
df.agg(approx_count_distinct("Product_ID")).show()
Windowed Aggregations
Sometimes, aggregations are performed over a βwindowβ rather than over the entire dataset or specific groups. Weβve covered window functions, but itβs useful to know they can be combined with aggregations for tasks like rolling averages.
π6
Data Engineers
Working with PySpark Aggregations What are Aggregations? Aggregations in PySpark allow you to transform large datasets by computing statistics across specified groups. PySpark offers built-in functions for common aggregations, such as sum, avg, min, maxβ¦
Interview Questions
1. What are some common aggregation functions in PySpark, and how are they used?
2. Explain the difference between groupBy() and agg() in PySpark.
1. What are some common aggregation functions in PySpark, and how are they used?
2. Explain the difference between groupBy() and agg() in PySpark.
π3