If you want to secure a long-lasting and successful career in data engineering, it's not enough to merely acquire proficiency in using a particular tool. Instead, you should strive to understand its inner workings at a fundamental level as well as the first principles that underly it.
"But how?" you may ask. "There are so many tools out there: Spark, Trino, BigQuery, Snowflake, etc."
That's certainly true. However, it might surprise you to discover that all of these systems share strikingly similar foundations.
For instance, they all depend on some variation of the MapReduce model to process data. They all require data shuffling between nodes for tasks like joining or grouping. They all rely on column-oriented data formats. They are all susceptible to issues such as skewed keys, the small object problem, uneven partitioning, and so on.
The key, naturally, lies in the details. What sets each tool apart is a unique set of trade-offs that its developers have chosen to make it particularly suited to address specific use cases. For instance, Trino terminates queries that exceed memory limits to prevent costly disk spills, as one of its primary objectives is low latency. Snowflake automatically handles data partitioning for a more user-friendly experience but relinquishes fine-grained control from end users. Spark offers maximum user control but may come across as a more complex tool, and so forth.
Nonetheless, if you dig into how these tools move data around you'll discover that they are not that different after all. Plus their functionalities continue to overlap and converge over time.
Therefore, my advice is to run an 'EXPLAIN' or equivalent command for every query you write and invest time in understanding the resulting output. Ensure you grasp how each part of your query maps to a specific stage within a physical plan. Use this knowledge to debug your queries. I can assure you that the expertise and experience acquired this way will be transferable to other similar tools or data warehouse vendors.
Individual tools may come and go at a rapid pace, but fundamental principles endure and change far less frequently.
Source https://www.linkedin.com/posts/izeigerman_dataengineering-activity-7110648980732080128-DKmn
"But how?" you may ask. "There are so many tools out there: Spark, Trino, BigQuery, Snowflake, etc."
That's certainly true. However, it might surprise you to discover that all of these systems share strikingly similar foundations.
For instance, they all depend on some variation of the MapReduce model to process data. They all require data shuffling between nodes for tasks like joining or grouping. They all rely on column-oriented data formats. They are all susceptible to issues such as skewed keys, the small object problem, uneven partitioning, and so on.
The key, naturally, lies in the details. What sets each tool apart is a unique set of trade-offs that its developers have chosen to make it particularly suited to address specific use cases. For instance, Trino terminates queries that exceed memory limits to prevent costly disk spills, as one of its primary objectives is low latency. Snowflake automatically handles data partitioning for a more user-friendly experience but relinquishes fine-grained control from end users. Spark offers maximum user control but may come across as a more complex tool, and so forth.
Nonetheless, if you dig into how these tools move data around you'll discover that they are not that different after all. Plus their functionalities continue to overlap and converge over time.
Therefore, my advice is to run an 'EXPLAIN' or equivalent command for every query you write and invest time in understanding the resulting output. Ensure you grasp how each part of your query maps to a specific stage within a physical plan. Use this knowledge to debug your queries. I can assure you that the expertise and experience acquired this way will be transferable to other similar tools or data warehouse vendors.
Individual tools may come and go at a rapid pace, but fundamental principles endure and change far less frequently.
Source https://www.linkedin.com/posts/izeigerman_dataengineering-activity-7110648980732080128-DKmn
🐳1
In Analytics you should always know what to measure and why, at least 2-3 metrics for your business domain.
PS avoid vanity metrics🤨
PS avoid vanity metrics
Please open Telegram to view this post
VIEW IN TELEGRAM
🐳5❤1🔥1
Please open Telegram to view this post
VIEW IN TELEGRAM
🍾1
Is university degree important for data jobs? Not at all. No one cares what degree you have. Skills are more important.
Today I talked with colleague, who paid 50k in 3rd tier US university for 1 year of Masters in Business Analytics + cost of living for 1 year. Overall 80k money waste. Yes she got the job and some skills but in what cost. With the right focus and content, she would "fake it and make it" in 4-5 months. Imagine degree for 2 years and 1st or 2nd tier university with cost of living🫨
Today I talked with colleague, who paid 50k in 3rd tier US university for 1 year of Masters in Business Analytics + cost of living for 1 year. Overall 80k money waste. Yes she got the job and some skills but in what cost. With the right focus and content, she would "fake it and make it" in 4-5 months. Imagine degree for 2 years and 1st or 2nd tier university with cost of living🫨
💯5✍1
🌟 Parquet:
Advantages: Columnar, compressed, schema evolution support!
Disadvantages: Not for write-heavy workloads.
Use Cases: Analytical querying & data warehousing.
🌟 Avro:
Advantages: Row-based, schema evolution, efficient serialization.
Disadvantages: Slower for analytical queries.
Use Cases: Data serialization & data interchange.
🌟 JSON:
Advantages: Human-readable & schema flexible.
Disadvantages: Inefficient storage.
Use Cases: Web data interchange & configuration.
🌟 DeltaLake:
Advantages: ACID Transactions, schema enforcement.
Disadvantages: Proprietary.
Use Cases: ACID transactions & schema enforcement in Data Lakes.
🚀 Tips for Maximizing Benefits in #Spark:
- Choosing Format: Select data format based on read-write patterns, query performance, and storage efficiency.
- Partitioning: Properly partition data to optimize read performance, especially for large datasets.
- Compression: Choose an appropriate compression codec considering the trade-off between storage space and CPU usage.
- Caching: Leverage Spark’s caching features for frequently accessed datasets.
- Schema Evolution: Design schemas thoughtfully to allow for evolution over time without causing data inconsistency or requiring expensive migrations.
Advantages: Columnar, compressed, schema evolution support!
Disadvantages: Not for write-heavy workloads.
Use Cases: Analytical querying & data warehousing.
🌟 Avro:
Advantages: Row-based, schema evolution, efficient serialization.
Disadvantages: Slower for analytical queries.
Use Cases: Data serialization & data interchange.
🌟 JSON:
Advantages: Human-readable & schema flexible.
Disadvantages: Inefficient storage.
Use Cases: Web data interchange & configuration.
🌟 DeltaLake:
Advantages: ACID Transactions, schema enforcement.
Disadvantages: Proprietary.
Use Cases: ACID transactions & schema enforcement in Data Lakes.
🚀 Tips for Maximizing Benefits in #Spark:
- Choosing Format: Select data format based on read-write patterns, query performance, and storage efficiency.
- Partitioning: Properly partition data to optimize read performance, especially for large datasets.
- Compression: Choose an appropriate compression codec considering the trade-off between storage space and CPU usage.
- Caching: Leverage Spark’s caching features for frequently accessed datasets.
- Schema Evolution: Design schemas thoughtfully to allow for evolution over time without causing data inconsistency or requiring expensive migrations.
❤🔥5❤1
From rockyourdata.cloud: "We have awesome news! We launched education programs for Data Engineer, Data Analysts and BI engineers positions. We are going to utilize years of experience into our curriculum and help people move to data industry and land first job"
Rock Your Data is North America consulting company with focus on Cloud Analytics.
Please share https://www.linkedin.com/posts/rock-your-data_dataengineer-dataanalyst-biengineer-activity-7118664122300321792-mXWV
Rock Your Data is North America consulting company with focus on Cloud Analytics.
Please share https://www.linkedin.com/posts/rock-your-data_dataengineer-dataanalyst-biengineer-activity-7118664122300321792-mXWV
🔥8❤2
Are you planning to move from analyst role to the data engineering role and don't know where to start? I bet for any questions out there, there is ideal book that exists and this question is not an exception.
The "The Missing README" is the best book for anyone looking for the foundational software engineering knowledge. Even you don't plan to work as a data engineer right now, you can still learn basic concepts and communicate effectively with backend engineer team.
Personally, this book has helped me tremendously in my career and I highly recommend it to anyone who are lacking Computer Science degree 🤗
Book link: https://lnkd.in/dQxNe3dm
The "The Missing README" is the best book for anyone looking for the foundational software engineering knowledge. Even you don't plan to work as a data engineer right now, you can still learn basic concepts and communicate effectively with backend engineer team.
Personally, this book has helped me tremendously in my career and I highly recommend it to anyone who are lacking Computer Science degree 🤗
Book link: https://lnkd.in/dQxNe3dm
⚡4🔥2❤🔥1
This media is not supported in your browser
VIEW IN TELEGRAM
Club 500 is started https://surfalytics.com/pages/club500/
🍾7🤔3🫡2❤1
Hello, I’ve started Discord for Surfalytics. Best way for collaboration, resume and job hunting progress updates and share your success with rest! Join us here: https://discord.gg/yEQkFerr
Discord
Discord - Group Chat That’s All Fun & Games
Discord is great for playing games and chilling with friends, or even building a worldwide community. Customize your own space to talk, play, and hang out.
🔥8❤1
The success of your career often depends on your impact. While you might find yourself building numerous things—such as reports, pipelines, ingesting additional data sources, merging pull requests, or producing many lines of code—you may realize that these activities alone aren't propelling your career forward.
You're producing outputs and gauging your work by these outputs. However, output isn't synonymous with outcome. Your outputs might have limited business value and negligible impact. In essence, you're caught in the "building trap."
This is why I highly recommend the book "Escaping the Build Trap: How Effective Product Management Creates Real Value." This book will introduce you to the fundamentals of product management and the outcome-driven approach. It aims to help you avoid the building trap, create value for businesses, and focus on meaningful impacts that can genuinely advance your career.
P.S. Naturally, this will shine brightest when paired with a team and leadership that truly have their eyes on outcomes and can deftly distinguish between "output" and "outcome." Wouldn't that be refreshing?
Link to book: https://www.goodreads.com/book/show/42611483-escaping-the-build-trap
You're producing outputs and gauging your work by these outputs. However, output isn't synonymous with outcome. Your outputs might have limited business value and negligible impact. In essence, you're caught in the "building trap."
This is why I highly recommend the book "Escaping the Build Trap: How Effective Product Management Creates Real Value." This book will introduce you to the fundamentals of product management and the outcome-driven approach. It aims to help you avoid the building trap, create value for businesses, and focus on meaningful impacts that can genuinely advance your career.
P.S. Naturally, this will shine brightest when paired with a team and leadership that truly have their eyes on outcomes and can deftly distinguish between "output" and "outcome." Wouldn't that be refreshing?
Link to book: https://www.goodreads.com/book/show/42611483-escaping-the-build-trap
Goodreads
Escaping the Build Trap: How Effective Product Manageme…
To stay competitive in today’s market, organizations ne…
❤2🔥1