Google is looking for Data Engineer Intern ๐๐
https://www.linkedin.com/posts/sql-analysts_google-intern-googleanalytics-activity-7144931636453847041-OgA_?utm_source=share&utm_medium=member_android
https://www.linkedin.com/posts/sql-analysts_google-intern-googleanalytics-activity-7144931636453847041-OgA_?utm_source=share&utm_medium=member_android
Kavitha's Journey to become a Data Engineer ๐๐
1. Startup to Dream Job Journey:
- Started at a startup in India, transitioned to Infosys, then grabbed UK opportunity.
- Shifted from legacy Mainframe to AWS Cloud, pursued Master's from illinoisstateu, and secured dream job at Statefarm.
2. Learn Fundamentals:
- Assess skills, understand role.
- Gain proficiency in Python, SQL.
- Learn data technologies.
3. Database and Modeling Skills:
- Understand databases, gain proficiency.
- Learn data modeling principles.
4. Master ETL, Warehousing, and Visualization:
- Understand ETL, data warehousing.
- Gain experience in building warehouses.
- Familiarize with visualization tools.
- Got Certified as AWS Solutions Architect.
5. Utilize LinkedIn for Job Search:
- Network and connect with professionals.
- Showcase skills and achievements.
- Utilize job search feature, leading to dream job at Statefarm.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
1. Startup to Dream Job Journey:
- Started at a startup in India, transitioned to Infosys, then grabbed UK opportunity.
- Shifted from legacy Mainframe to AWS Cloud, pursued Master's from illinoisstateu, and secured dream job at Statefarm.
2. Learn Fundamentals:
- Assess skills, understand role.
- Gain proficiency in Python, SQL.
- Learn data technologies.
3. Database and Modeling Skills:
- Understand databases, gain proficiency.
- Learn data modeling principles.
4. Master ETL, Warehousing, and Visualization:
- Understand ETL, data warehousing.
- Gain experience in building warehouses.
- Familiarize with visualization tools.
- Got Certified as AWS Solutions Architect.
5. Utilize LinkedIn for Job Search:
- Network and connect with professionals.
- Showcase skills and achievements.
- Utilize job search feature, leading to dream job at Statefarm.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
โค2๐2
๐ Mastering Spark: 20 Interview Questions Demystified!
1๏ธโฃ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2๏ธโฃ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3๏ธโฃ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4๏ธโฃ RDD Operations: Explore the various RDD operations that power Spark.
5๏ธโฃ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6๏ธโฃ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7๏ธโฃ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8๏ธโฃ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9๏ธโฃ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
๐ spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1๏ธโฃ1๏ธโฃ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1๏ธโฃ2๏ธโฃ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1๏ธโฃ3๏ธโฃ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1๏ธโฃ4๏ธโฃ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1๏ธโฃ5๏ธโฃ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1๏ธโฃ6๏ธโฃ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1๏ธโฃ7๏ธโฃ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1๏ธโฃ8๏ธโฃ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1๏ธโฃ9๏ธโฃ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2๏ธโฃ0๏ธโฃ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
1๏ธโฃ MapReduce vs. Spark: Learn how Spark achieves 100x faster performance compared to MapReduce.
2๏ธโฃ RDD vs. DataFrame: Unravel the key differences between RDD and DataFrame, and discover what makes DataFrame unique.
3๏ธโฃ DataFrame vs. Datasets: Delve into the distinctions between DataFrame and Datasets in Spark.
4๏ธโฃ RDD Operations: Explore the various RDD operations that power Spark.
5๏ธโฃ Narrow vs. Wide Transformations: Understand the differences between narrow and wide transformations in Spark.
6๏ธโฃ Shared Variables: Discover the shared variables that facilitate distributed computing in Spark.
7๏ธโฃ Persist vs. Cache: Differentiate between the persist and cache functionalities in Spark.
8๏ธโฃ Spark Checkpointing: Learn about Spark checkpointing and how it differs from persisting to disk.
9๏ธโฃ SparkSession vs. SparkContext: Understand the roles of SparkSession and SparkContext in Spark applications.
๐ spark-submit Parameters: Explore the parameters to specify in the spark-submit command.
1๏ธโฃ1๏ธโฃ Cluster Managers in Spark: Familiarize yourself with the different types of cluster managers available in Spark.
1๏ธโฃ2๏ธโฃ Deploy Modes: Learn about the deploy modes in Spark and their significance.
1๏ธโฃ3๏ธโฃ Executor vs. Executor Core: Distinguish between executor and executor core in the Spark ecosystem.
1๏ธโฃ4๏ธโฃ Shuffling Concept: Gain insights into the shuffling concept in Spark and its importance.
1๏ธโฃ5๏ธโฃ Number of Stages in Spark Job: Understand how to decide the number of stages created in a Spark job.
1๏ธโฃ6๏ธโฃ Spark Job Execution Internals: Get a peek into how Spark internally executes a program.
1๏ธโฃ7๏ธโฃ Direct Output Storage: Explore the possibility of directly storing output without sending it back to the driver.
1๏ธโฃ8๏ธโฃ Coalesce and Repartition: Learn about the applications of coalesce and repartition in Spark.
1๏ธโฃ9๏ธโฃ Physical and Logical Plan Optimization: Uncover the optimization techniques employed in Spark's physical and logical plans.
2๏ธโฃ0๏ธโฃ Treereduce and Treeaggregate: Discover why treereduce and treeaggregate are preferred over reduceByKey and aggregateByKey in certain scenarios.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
Here's what the average data engineering interview looks like in 2024:
- 1 hour algorithms in Python
Here you will be asked irrelevant questions about dynamic programming, linked lists, and inverting trees
- 1 hour SQL
Here you will be asked niche questions about recursive CTEs that you've used once in your ten year career
- 1 hour data architecture
Here you will be asked about CAP theorem, lambda vs kappa, and a bunch of other things that ChatGPT probably could answer in a heartbeat
- 1 hour behavioral
Here you will be asked about how to play nicely with your coworkers. This is the most relevant interview in my opinion
- 1 hour project deep dive
Here you will be asked to make up a story about something you did or did not do in the past that was a technical marvel
- 4 hour take home assignment
Here you will be asked to build their entire data engineering stack from scratch over a weekend because why hire data engineers when you can submit them to tests?
- 1 hour algorithms in Python
Here you will be asked irrelevant questions about dynamic programming, linked lists, and inverting trees
- 1 hour SQL
Here you will be asked niche questions about recursive CTEs that you've used once in your ten year career
- 1 hour data architecture
Here you will be asked about CAP theorem, lambda vs kappa, and a bunch of other things that ChatGPT probably could answer in a heartbeat
- 1 hour behavioral
Here you will be asked about how to play nicely with your coworkers. This is the most relevant interview in my opinion
- 1 hour project deep dive
Here you will be asked to make up a story about something you did or did not do in the past that was a technical marvel
- 4 hour take home assignment
Here you will be asked to build their entire data engineering stack from scratch over a weekend because why hire data engineers when you can submit them to tests?
๐2
Hands-on Guide to Apache Spark 3 (2024).pdf
11.2 MB
Hands-on Guide to Apache Spark 3
Alfonso Antolรญnez Garcรญa, 2023
Alfonso Antolรญnez Garcรญa, 2023
Data Engineers pinned ยซKavitha's Journey to become a Data Engineer ๐๐ 1. Startup to Dream Job Journey: - Started at a startup in India, transitioned to Infosys, then grabbed UK opportunity. - Shifted from legacy Mainframe to AWS Cloud, pursued Master's from illinoisstateu, andโฆยป
Frequently asked SQL interview questions for Data Analyst/Data Engineer role-
1 - What is SQL and what are its main features?
2 - Order of writing SQL query?
3- Order of execution of SQL query?
4- What are some of the most common SQL commands?
5- Whatโs a primary key & foreign key?
6 - All types of joins and questions on their outputs?
7 - Explain all window functions and difference between them?
8 - What is stored procedure?
9 - Difference between stored procedure & Functions in SQL?
10 - What is trigger in SQL?
11 - Difference between where and having?
1 - What is SQL and what are its main features?
2 - Order of writing SQL query?
3- Order of execution of SQL query?
4- What are some of the most common SQL commands?
5- Whatโs a primary key & foreign key?
6 - All types of joins and questions on their outputs?
7 - Explain all window functions and difference between them?
8 - What is stored procedure?
9 - Difference between stored procedure & Functions in SQL?
10 - What is trigger in SQL?
11 - Difference between where and having?
โค2๐1
image_2024-05-30_10-00-48.png
2.6 MB
For all Data Engineers out there, here is The State of Data Engineering 2024
Some of the highlights:
โ More and more, data observability tools are used not just to monitor data sources, but also the infrastructure, pipelines, and systems after data is collected.
โ Companies are now seeing data observability as essential for their AI projects. Gartner has called it a must-have for AI-ready data.
โ Like in 2023, Monte Carlo is leading in this area, with G2 naming them the #1 Data Observability Platform. Big organizations like Cisco, American Airlines, and NASDAQ use Monte Carlo to make their AI systems more reliable.
Some of the highlights:
โ More and more, data observability tools are used not just to monitor data sources, but also the infrastructure, pipelines, and systems after data is collected.
โ Companies are now seeing data observability as essential for their AI projects. Gartner has called it a must-have for AI-ready data.
โ Like in 2023, Monte Carlo is leading in this area, with G2 naming them the #1 Data Observability Platform. Big organizations like Cisco, American Airlines, and NASDAQ use Monte Carlo to make their AI systems more reliable.
๐2
Data Engineer Interview Questions.pdf
2.4 MB
Data Engineering Interview Questions ๐ฅ๐ฅ๐ฅ
React โค๏ธ if you want more content like this
React โค๏ธ if you want more content like this
โค13๐2
Learning SQL is actually a really good skill. It's not just learning SQL the language, but learning the concepts of relational algebra and how to think about data sets, designing schemas, and organizing data.
...
It is about learning the file formatting and the basics of data storage, data partitioning, and the relationship between the execution engines. All of these things will yield you to be a better DBT user, a better Snowflake user or a Databricks user.
...
It is about learning the file formatting and the basics of data storage, data partitioning, and the relationship between the execution engines. All of these things will yield you to be a better DBT user, a better Snowflake user or a Databricks user.
๐15
The number one thing to do as a data engineer? Create high-quality data that people can trust.๐ค
Life of a Data Engineer.....
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines ๐ช
Business user : Can we add a filter on this dashboard. This will help us track a critical metric.
me : sure this should be a quick one.
Next day :
I quickly opened the dashboard to find the column in the existing dashboard's data sources. -- column not found
Spent a couple of hours to identify the data source and how to bring the column into the existence data pipeline which feeds the dashboard( table granularity , join condition etc..).
Then comes the pipeline changes , data model changes , dashboard changes , validation/testing.
Finally deploying to production and a simple email to the user that the filter has been added.
A small change in the front end but a lot of work in the backend to bring that column to life.
Never underestimate data engineers and data pipelines ๐ช
๐27โค2
Data Engineering is not Excel. Not writing ML models. Not โplease can you do this quick? I need it asapโ
๐8
Complete Python topics required for the Data Engineer role:
โค ๐๐ฎ๐๐ถ๐ฐ๐ ๐ผ๐ณ ๐ฃ๐๐๐ต๐ผ๐ป:
- Python Syntax
- Data Types
- Lists
- Tuples
- Dictionaries
- Sets
- Variables
- Operators
- Control Structures:
- if-elif-else
- Loops
- Break & Continue try-except block
- Functions
- Modules & Packages
โค ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐:
- What is Pandas & imports?
- Pandas Data Structures (Series, DataFrame, Index)
- Working with DataFrames:
-> Creating DFs
-> Accessing Data in DFs Filtering & Selecting Data
-> Adding & Removing Columns
-> Merging & Joining in DFs
-> Grouping and Aggregating Data
-> Pivot Tables
- Input/Output Operations with Pandas:
-> Reading & Writing CSV Files
-> Reading & Writing Excel Files
-> Reading & Writing SQL Databases
-> Reading & Writing JSON Files
-> Reading & Writing - Text & Binary Files
โค ๐ก๐๐บ๐ฝ๐:
- What is NumPy & imports?
- NumPy Arrays
- NumPy Array Operations:
- Creating Arrays
- Accessing Array Elements
- Slicing & Indexing
- Reshaping, Combining & Arrays
- Arithmetic Operations
- Broadcasting
- Mathematical Functions
- Statistical Functions
โค ๐๐ฎ๐๐ถ๐ฐ๐ ๐ผ๐ณ ๐ฃ๐๐๐ต๐ผ๐ป, ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐, ๐ก๐๐บ๐ฝ๐ are more than enough for Data Engineer role.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ๐๐
โค ๐๐ฎ๐๐ถ๐ฐ๐ ๐ผ๐ณ ๐ฃ๐๐๐ต๐ผ๐ป:
- Python Syntax
- Data Types
- Lists
- Tuples
- Dictionaries
- Sets
- Variables
- Operators
- Control Structures:
- if-elif-else
- Loops
- Break & Continue try-except block
- Functions
- Modules & Packages
โค ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐:
- What is Pandas & imports?
- Pandas Data Structures (Series, DataFrame, Index)
- Working with DataFrames:
-> Creating DFs
-> Accessing Data in DFs Filtering & Selecting Data
-> Adding & Removing Columns
-> Merging & Joining in DFs
-> Grouping and Aggregating Data
-> Pivot Tables
- Input/Output Operations with Pandas:
-> Reading & Writing CSV Files
-> Reading & Writing Excel Files
-> Reading & Writing SQL Databases
-> Reading & Writing JSON Files
-> Reading & Writing - Text & Binary Files
โค ๐ก๐๐บ๐ฝ๐:
- What is NumPy & imports?
- NumPy Arrays
- NumPy Array Operations:
- Creating Arrays
- Accessing Array Elements
- Slicing & Indexing
- Reshaping, Combining & Arrays
- Arithmetic Operations
- Broadcasting
- Mathematical Functions
- Statistical Functions
โค ๐๐ฎ๐๐ถ๐ฐ๐ ๐ผ๐ณ ๐ฃ๐๐๐ต๐ผ๐ป, ๐ฃ๐ฎ๐ป๐ฑ๐ฎ๐, ๐ก๐๐บ๐ฝ๐ are more than enough for Data Engineer role.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ๐๐
๐19โค4
Preparing for a Spark Interview? Here are 20 Key Differences You Should Know!
1๏ธโฃ Repartition vs. Coalesce: Repartition changes the number of partitions, while coalesce reduces partitions without full shuffle.
2๏ธโฃ Sort By vs. Order By: Sort By sorts data within each partition and may result in partially ordered final results if multiple reducers are used. Order By guarantees total order across all partitions in the final output.
3๏ธโฃ RDD vs. Datasets vs. DataFrames: RDDs are the basic abstraction, Datasets add type safety, and DataFrames optimize for structured data.
4๏ธโฃ Broadcast Join vs. Shuffle Join vs. Sort Merge Join: Broadcast Join is for small tables, Shuffle Join redistributes data, and Sort Merge Join sorts data before joining.
5๏ธโฃ Spark Session vs. Spark Context: Spark Session is the entry point in Spark 2.0+, combining functionality of Spark Context and SQL Context.
6๏ธโฃ Executor vs. Executor Core: Executor runs tasks and manages data storage, while Executor Core handles task execution.
7๏ธโฃ DAG vs. Lineage: DAG (Directed Acyclic Graph) is the execution plan, while Lineage tracks the RDD lineage for fault tolerance.
8๏ธโฃ Transformation vs. Action: Transformation creates RDD/Dataset/DataFrame, while Action triggers execution and returns results to driver.
9๏ธโฃ Narrow Transformation vs. Wide Transformation: Narrow operates on single partition, while Wide involves shuffling across partitions.
๐ Lazy Evaluation vs. Eager Evaluation: Spark delays execution until action is called (Lazy), optimizing performance.
1๏ธโฃ1๏ธโฃ Window Functions vs. Group By: Window Functions compute over a range of rows, while Group By aggregates data into summary.
1๏ธโฃ2๏ธโฃ Partitioning vs. Bucketing: Partitioning divides data into logical units, while Bucketing organizes data into equal-sized buckets.
1๏ธโฃ3๏ธโฃ Avro vs. Parquet vs. ORC: Avro is row-based with schema, Parquet and ORC are columnar formats optimized for query speed.
1๏ธโฃ4๏ธโฃ Client Mode vs. Cluster Mode: Client runs driver in client process, while Cluster deploys driver to the cluster.
1๏ธโฃ5๏ธโฃ Serialization vs. Deserialization: Serialization converts data to byte stream, while Deserialization reconstructs data from byte stream.
1๏ธโฃ6๏ธโฃ DAG Scheduler vs. Task Scheduler: DAG Scheduler divides job into stages, while Task Scheduler assigns tasks to workers.
1๏ธโฃ7๏ธโฃ Accumulators vs. Broadcast Variables: Accumulators aggregate values from workers to driver, Broadcast Variables efficiently broadcast read-only variables.
1๏ธโฃ8๏ธโฃ Cache vs. Persist: Cache stores RDD/Dataset/DataFrame in memory, Persist allows choosing storage level (memory, disk, etc.).
1๏ธโฃ9๏ธโฃ Internal Table vs. External Table: Internal managed by Spark, External managed externally (e.g., Hive).
2๏ธโฃ0๏ธโฃ Executor vs. Driver: Executor runs tasks on worker nodes, Driver manages job execution.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ๐๐
1๏ธโฃ Repartition vs. Coalesce: Repartition changes the number of partitions, while coalesce reduces partitions without full shuffle.
2๏ธโฃ Sort By vs. Order By: Sort By sorts data within each partition and may result in partially ordered final results if multiple reducers are used. Order By guarantees total order across all partitions in the final output.
3๏ธโฃ RDD vs. Datasets vs. DataFrames: RDDs are the basic abstraction, Datasets add type safety, and DataFrames optimize for structured data.
4๏ธโฃ Broadcast Join vs. Shuffle Join vs. Sort Merge Join: Broadcast Join is for small tables, Shuffle Join redistributes data, and Sort Merge Join sorts data before joining.
5๏ธโฃ Spark Session vs. Spark Context: Spark Session is the entry point in Spark 2.0+, combining functionality of Spark Context and SQL Context.
6๏ธโฃ Executor vs. Executor Core: Executor runs tasks and manages data storage, while Executor Core handles task execution.
7๏ธโฃ DAG vs. Lineage: DAG (Directed Acyclic Graph) is the execution plan, while Lineage tracks the RDD lineage for fault tolerance.
8๏ธโฃ Transformation vs. Action: Transformation creates RDD/Dataset/DataFrame, while Action triggers execution and returns results to driver.
9๏ธโฃ Narrow Transformation vs. Wide Transformation: Narrow operates on single partition, while Wide involves shuffling across partitions.
๐ Lazy Evaluation vs. Eager Evaluation: Spark delays execution until action is called (Lazy), optimizing performance.
1๏ธโฃ1๏ธโฃ Window Functions vs. Group By: Window Functions compute over a range of rows, while Group By aggregates data into summary.
1๏ธโฃ2๏ธโฃ Partitioning vs. Bucketing: Partitioning divides data into logical units, while Bucketing organizes data into equal-sized buckets.
1๏ธโฃ3๏ธโฃ Avro vs. Parquet vs. ORC: Avro is row-based with schema, Parquet and ORC are columnar formats optimized for query speed.
1๏ธโฃ4๏ธโฃ Client Mode vs. Cluster Mode: Client runs driver in client process, while Cluster deploys driver to the cluster.
1๏ธโฃ5๏ธโฃ Serialization vs. Deserialization: Serialization converts data to byte stream, while Deserialization reconstructs data from byte stream.
1๏ธโฃ6๏ธโฃ DAG Scheduler vs. Task Scheduler: DAG Scheduler divides job into stages, while Task Scheduler assigns tasks to workers.
1๏ธโฃ7๏ธโฃ Accumulators vs. Broadcast Variables: Accumulators aggregate values from workers to driver, Broadcast Variables efficiently broadcast read-only variables.
1๏ธโฃ8๏ธโฃ Cache vs. Persist: Cache stores RDD/Dataset/DataFrame in memory, Persist allows choosing storage level (memory, disk, etc.).
1๏ธโฃ9๏ธโฃ Internal Table vs. External Table: Internal managed by Spark, External managed externally (e.g., Hive).
2๏ธโฃ0๏ธโฃ Executor vs. Driver: Executor runs tasks on worker nodes, Driver manages job execution.
Data Engineering Interview Preparation Resources: https://whatsapp.com/channel/0029Vaovs0ZKbYMKXvKRYi3C
All the best ๐๐
๐5โค4