Data Engineers
8.85K subscribers
345 photos
74 files
338 links
Free Data Engineering Ebooks & Courses
Download Telegram
HR: "What's your salary expectation?"
Candidate: $8,000 to 10,000 a month.

HR: You are the best-fit for the role but we can only offer $7000.
Candidate: Okay. $7,000 would be fine.

HR: How soon can you start?

Meanwhile the budget for that particular role is $15,000. HR feels like they did a great job in salary negotiation and management will be happy they cut cost for the organisation.

The new employee starts and notices the pay disparity. Guess what happens? Dissatisfaction. Disengagement. Disloyalty.

Two months later, the employee leaves the organization for a better job. The recruitment process starts all over again. Leading to further costs and performance gaps within the team and organisation.

In order to attract and retain top talent, please pay people what they are worth.
๐Ÿ‘22๐Ÿ‘1
- SQL + SELECT = Querying Data
- SQL + JOIN = Data Integration
- SQL + WHERE = Data Filtering
- SQL + GROUP BY = Data Aggregation
- SQL + ORDER BY = Data Sorting
- SQL + UNION = Combining Queries
- SQL + INSERT = Data Insertion
- SQL + UPDATE = Data Modification
- SQL + DELETE = Data Removal
- SQL + CREATE TABLE = Database Design
- SQL + ALTER TABLE = Schema Modification
- SQL + DROP TABLE = Table Removal
- SQL + INDEX = Query Optimization
- SQL + VIEW = Virtual Tables
- SQL + Subqueries = Nested Queries
- SQL + Stored Procedures = Task Automation
- SQL + Triggers = Automated Responses
- SQL + CTE = Recursive Queries
- SQL + Window Functions = Advanced Analytics
- SQL + Transactions = Data Integrity
- SQL + ACID Compliance = Reliable Operations
- SQL + Data Warehousing = Large Data Management
- SQL + ETL = Data Transformation
- SQL + Partitioning = Big Data Management
- SQL + Replication = High Availability
- SQL + Sharding = Database Scaling
- SQL + JSON = Semi-Structured Data
- SQL + XML = Structured Data
- SQL + Data Security = Data Protection
- SQL + Performance Tuning = Query Efficiency
- SQL + Data Governance = Data Quality
๐Ÿ‘15โค6๐Ÿฅฐ1
SQL is composed of five key components:

๐ƒ๐ƒ๐‹ (๐ƒ๐š๐ญ๐š ๐ƒ๐ž๐Ÿ๐ข๐ง๐ข๐ญ๐ข๐จ๐ง ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like CREATE, ALTER, DROP for defining and modifying database structures.
๐ƒ๐๐‹ (๐ƒ๐š๐ญ๐š ๐๐ฎ๐ž๐ซ๐ฒ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like SELECT for querying and retrieving data.
๐ƒ๐Œ๐‹ (๐ƒ๐š๐ญ๐š ๐Œ๐š๐ง๐ข๐ฉ๐ฎ๐ฅ๐š๐ญ๐ข๐จ๐ง ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like INSERT, UPDATE, DELETE for modifying data.
๐ƒ๐‚๐‹ (๐ƒ๐š๐ญ๐š ๐‚๐จ๐ง๐ญ๐ซ๐จ๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like GRANT, REVOKE for managing access permissions.
๐“๐‚๐‹ (๐“๐ซ๐š๐ง๐ฌ๐š๐œ๐ญ๐ข๐จ๐ง ๐‚๐จ๐ง๐ญ๐ซ๐จ๐ฅ ๐‹๐š๐ง๐ ๐ฎ๐š๐ ๐ž): Commands like COMMIT, ROLLBACK for managing transactions.

If you're an engineer, you'll likely need a solid understanding of all these components. If you're a data analyst, focusing on DQL will be more relevant. Tailor your learning to the topics that best fit your role.
๐Ÿ‘5๐Ÿ”ฅ4
โค8๐Ÿ”ฅ1
Data Engineering free courses   

Linked Data Engineering
๐ŸŽฌ Video Lessons
Rating โญ๏ธ: 5 out of 5     
Students ๐Ÿ‘จโ€๐ŸŽ“: 9,973
Duration โฐ:  8 weeks long
Source: openHPI
๐Ÿ”— Course Link  

Data Engineering
Credits โณ: 15
Duration โฐ: 4 hours
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Google cloud
๐Ÿ”— Course Link

Data Engineering Essentials using Spark, Python and SQL  
๐ŸŽฌ 402 video lesson
๐Ÿƒโ€โ™‚๏ธ Self paced
Teacher: itversity
Resource: Youtube
๐Ÿ”— Course Link  
 
Data engineering with Azure Databricks      
Modules โณ: 5
Duration โฐ:  4-5 hours worth of material
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Microsoft ignite
๐Ÿ”— Course Link

Perform data engineering with Azure Synapse Apache Spark Pools      
Modules โณ: 5
Duration โฐ:  2-3 hours worth of material
๐Ÿƒโ€โ™‚๏ธ Self paced       
Source:  Microsoft Learn
๐Ÿ”— Course Link

Books
Data Engineering
The Data Engineers Guide to Apache Spark

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘4โค2
๐Ÿ” Mastering Spark: 20 Interview Questions Demystified!

1๏ธโƒฃ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2๏ธโƒฃ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3๏ธโƒฃ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4๏ธโƒฃ RDD Operations: Explore the various RDD operations that power Spark.
5๏ธโƒฃ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6๏ธโƒฃ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7๏ธโƒฃ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8๏ธโƒฃ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9๏ธโƒฃ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
๐Ÿ”Ÿ spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1๏ธโƒฃ1๏ธโƒฃ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1๏ธโƒฃ2๏ธโƒฃ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1๏ธโƒฃ3๏ธโƒฃ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1๏ธโƒฃ4๏ธโƒฃ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1๏ธโƒฃ5๏ธโƒฃ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1๏ธโƒฃ6๏ธโƒฃ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1๏ธโƒฃ7๏ธโƒฃ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1๏ธโƒฃ8๏ธโƒฃ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1๏ธโƒฃ9๏ธโƒฃ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2๏ธโƒฃ0๏ธโƒฃ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.

Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘7
The four V's of big data
๐Ÿฅฐ7๐Ÿ‘1
Pandas Data Cleaning.pdf
14.9 MB
Pandas Data Cleaning.pdf
โค7๐Ÿ‘1
Data Pipeline Overview
โค7๐Ÿ‘2
We are now on WhatsApp as well

Follow for more data engineering resources: ๐Ÿ‘‡ https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘4โค1๐Ÿ”ฅ1
๐Ÿ”ฅ1
Data Engineer Interview Questions for Entry-Level Data Engineer๐Ÿ”ฅ


1. What are the core responsibilities of a data engineer?

2. Explain the ETL process

3. How do you handle large datasets in a data pipeline?

4. What is the difference between a relational & a non-relational database?

5. Describe how data partitioning improves performance in distributed systems

6. What is a data warehouse & how is it different from a database?

7. How would you design a data pipeline for real-time data processing?

8. Explain the concept of normalization & denormalization in database design

9. What tools do you commonly use for data ingestion, transformation & storage?

10. How do you optimize SQL queries for better performance in data processing?

11. What is the role of Apache Hadoop in big data?

12. How do you implement data security & privacy in data engineering?

13. Explain the concept of data lakes & their importance in modern data architectures

14. What is the difference between batch processing & stream processing?

15. How do you manage & monitor data quality in your pipelines?

16. What are your preferred cloud platforms for data engineering & why?

17. How do you handle schema changes in a production data pipeline?

18. Describe how you would build a scalable & fault-tolerant data pipeline

19. What is Apache Kafka & how is it used in data engineering?

20. What techniques do you use for data compression & storage optimization?
โค4
Here are three PySpark questions:


Scenario 1: Data Aggregation


Interviewer: "How would you aggregate data by category and calculate the sum of sales, handling missing values and grouping by multiple columns?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Handle missing values
df_filled = df.fillna(0)

# Aggregate data
from pyspark.sql.functions import sum, col
df_aggregated = df_filled.groupBy("category", "region").agg(sum(col("sales")).alias("total_sales"))

# Sort the results
df_aggregated_sorted = df_aggregated.orderBy("total_sales", ascending=False)

# Save the aggregated DataFrame
df_aggregated_sorted.write.csv("path/to/aggregated/data.csv", header=True)


Scenario 2: Data Transformation


Interviewer: "How would you transform a DataFrame by converting a column to timestamp, handling invalid dates and extracting specific date components?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Convert column to timestamp
from pyspark.sql.functions import to_timestamp, col
df_transformed = df.withColumn("date_column", to_timestamp(col("date_column"), "yyyy-MM-dd"))

# Handle invalid dates
df_transformed_filtered = df_transformed.filter(col("date_column").isNotNull())

# Extract date components
from pyspark.sql.functions import year, month, dayofmonth
df_transformed_extracted = df_transformed_filtered.withColumn("year", year(col("date_column"))).withColumn("month", month(col("date_column"))).withColumn("day", dayofmonth(col("date_column")))

# Save the transformed DataFrame
df_transformed_extracted.write.csv("path/to/transformed/data.csv", header=True)

Scenario 3: Data Partitioning


Interviewer: "How would you partition a large DataFrame by date and save it to parquet format, handling data skewness and optimizing storage?"


Candidate:


# Load the DataFrame
df = spark.read.csv("path/to/data.csv", header=True, inferSchema=True)

# Partition by date
df_partitioned = df.repartitionByRange("date_column")

# Save to parquet format
df_partitioned.write.parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

# Optimize storage
df_partitioned.write.option("compression", "snappy").parquet("path/to/partitioned/data.parquet", partitionBy=["date_column"])

Here, you can find Data Engineering Resources ๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
๐Ÿ‘6โค5
fundamentals-of-data-engineering.pdf
7.6 MB
๐Ÿš€ The good book to start learning Data Engineering.

โš You can download it for free here

โš™With this practical #book, you'll learn how to plan and build systems to serve the needs of your organization and your customers by evaluating the best technologies available through the framework of the #data #engineering lifecycle.
๐Ÿ‘5โค2
Life of a Data Engineer.....


Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.

Next day :

I quickly opened the dashboard to find the column in the existing dashboard's data sources.  -- column not found

Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).

Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.

Finally deploying to production and a simple email to the user that the filter has been added.

A small change in the front end but a lot of work in the backend to bring that column to life.

Never underestimate data engineers and data pipelines ๐Ÿ’ช
๐Ÿ‘5๐Ÿ”ฅ1
Don't aim for this:

SQL - 100%
Python - 0%
PySpark - 0%
Cloud - 0%

Aim for this:

SQL - 25%
Python - 25%
PySpark - 25%
Cloud - 25%

You don't need to know everything straight away.

Here, you can find Data Engineering Resources ๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C

All the best ๐Ÿ‘๐Ÿ‘
โค9๐Ÿ‘4
๐Ÿ”ฅ ETL vs ELT: What's the Difference?

When it comes to data processing, two key approaches stand out: ETL and ELT. Both involve transforming data, but the processes differ significantly!

๐Ÿ”น ETL (Extract, Transform, Load)
- Extract data from various sources (databases, APIs, etc.)
- Transform data before loading it into the storage (cleaning, aggregating, formatting)
- Load the transformed data into the data warehouse (DWH)

โœ๏ธ Key point: Data is transformed before being loaded into the storage.

๐Ÿ”น ELT (Extract, Load, Transform)
- Extract data from sources
- Load raw data into the data warehouse
- Transform the data after it's loaded, using the power of the data warehouseโ€™s computational resources

โœ๏ธ Key point: Data is loaded into the storage first, and transformation happens afterward.

๐ŸŽฏ When to use which?
- ETL is ideal for structured data and traditional systems where pre-processing is crucial.
- ELT is better suited for handling large volumes of data in modern cloud-based architectures.

Which one works best for your project? ๐Ÿค”
๐Ÿ‘4๐Ÿ”ฅ4๐Ÿฅฐ1
Join our WhatsApp channel for more data engineering resources
๐Ÿ‘‡๐Ÿ‘‡
https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
๐Ÿ‘6