Data Memes
489 subscribers
565 photos
9 videos
2 files
63 links
All best data memes in one place!

https://surfalytics.com 🏄‍♀️
Download Telegram
🤔2
https://github.com/will-stone/browserosaurus#readme - help to control multiple browsers
2
The most straightforward yet profound question for newcomers in data engineering is: What is the difference between ETL and ELT?

You can work with tools like dbt and data warehouses without actually considering the difference, but understanding it is crucial as it leads to the right tool choice depending on the use case and requirements.

You might think of ETL as an older concept, from the time when data warehouses were used primarily for storing the results of the Transformation step. This required a powerful ETL server capable of processing the same volume of data, ready to read each record and every row in a table or file. With large volumes of data, this could be expensive.

However, with the rise of Cloud and Cloud Data Warehousing, the need for powerful ETL compute has diminished. Now, we can simply COPY data into cloud storage and then into the cloud data warehouse. After this, we can leverage the powerful compute capabilities of distributed cloud data warehouses or SQL engines.

The advent of cloud computing wasn't the only pivotal moment. Even before the cloud, ETL tools like Informatica employed a 'push down' approach, pushing all data into MPP data warehouses like Teradata, and then orchestrating SQL transformations.

Let's consider a simple example:

In the case of ETL:
1. Extract Orders and Products data.
2. Transform the data (join, clean, aggregate).
3. Load the data into the data warehouse, often using INSERT (a slower, row-by-row process).

In the case of ELT:
1. Extract Orders and Products data.
2. Load the data into the data warehouse, often using COPY for storage accounts and data warehouses (a faster, bulk load process).
3. Transform with SQL or tools like DBT or Dataframes.

Reflecting on the role of Spark, it becomes clear that Spark is an actual ETL tool since it reads the data when performing transformations.

Link to share to like: https://www.linkedin.com/posts/dmitryanoshin_etl-elt-dataengineering-activity-7137583690678747136-7imR
❤‍🔥2💯1
Say No to "Season of Yes".
4💯1🙈1
I know some of you have challenges due to lack on knowledge of Cloud Computing - Azure, AWS, GCP.

Starting January I will run 6 weeks program online in University of Victoria. Tuesday/Thursday 6pm PST for 2 hours.

The price is 715CAD. Money is going to university, I am not getting much. But it is extremly great opportunity to close the gap in cloud computing, Azure/AWS and do lots of hands on.

You employer may pay for this course.

Highly recommend. I think 10 seats left.

https://continuingstudies.uvic.ca/data-computing-and-technology/courses/cloud-computing-for-business/
5💯2
1
Last weekend, we worked on a traditional data engineering project at Surfalytics, which involved using Snowflake as a Data Warehouse, dbt for transformations, and Fivetran for data ingestion from Google Drive.

For BI and data exploration, we utilized Hex. We hosted dbt in a container and ran it via GitHub Actions.

The project was prepared and executed by Nikita Volynets, and Tsebek Badmaev did an amazing job documenting the code in GitHub. Now, anyone can reproduce it and learn from it.

I bet everyone learned something new and will use this newfound knowledge at work or in interviews.

Link to the repo: https://lnkd.in/g4PXNV_W

Link for like: https://www.linkedin.com/posts/dmitryanoshin_dataengineering-dbt-snowflake-activity-7137879439664693248-V1wzp
3🤩1
❤‍🔥731
Before starting chasing analytics and data engineering jobs I usually suggest to be more or less fluent with multiple things:

- CLI: popular commands, navigation, vim and nano text editors, permissions, environment variables
- GitHub (or any similar platform) with focus on Code Reviews, PRs, development lifecycle, basic pre-commit, CI/CD
- Containers: docker file, image, compose
- IDE of your choice: don't know where to start? Take Visual Code.

You don't need to be pro in any of these but it will make a difference and pay back in long term i.e. #engineeringexcellence

Last week at Surfalytics I ran CLI and GitHub and next Saturday planning to wrap containers.

All of this will wrap into the 3 simple free courses - "Just enough <TERM> for data professional"

Link for like: https://www.linkedin.com/posts/dmitryanoshin_engineeringexcellence-dataengineering-analyticsengineering-activity-7138073576309489667-_H2h
72❤‍🔥1
21
11
AI competition is weird. But this goes deeper than "Which LLM is best??" or companies on the left trying back the winning horse. Behind the scenes is a cloud war and a chip war - that's where the money is. Let’s take a look:


1. This is a cloud war.

Let’s take Anthropic, for example. They’re committing to use AWS as its primary cloud provider. That could translate into billions in revenue for AWS as Anthropic scales up.

By investing in Anthropic and its large language model Claude, Amazon is positioning itself to reap the benefits of the growing AI market.

As Claude gains popularity and drives more businesses to adopt AI solutions, it funnels money back to Amazon through increased usage of AWS services.

This strategic investment not only strengthens Amazon's position in the AI space but also creates a virtuous cycle of growth for its cloud business.

Guys - everyone is doing this. Investing huge amounts and getting it back in cloud services. That should command our attention.

The war between MS Azure, Google Cloud and AWS is worth billions and it’s only going to get bigger.


2. This is a chip war.

Chips are everything - they’re the engines. And up till now Nvidia has ruled the world.

But let’s just look at the last few weeks:

Nvidia:
The company announced the H200 GPU on November 13. This new chip is designed for AI work and upgrades the H100 with 1.4x more memory bandwidth and 1.8x more memory capacity. The first H200 chips are expected to be released in the 2nd quarter of 2024.

Microsoft:
Microsoft unveiled the Maia 100 artificial intelligence chip on November 15. The chip is designed for AI tasks and generative AI. The company hasn’t provided a specific timeline for the release of the Maia 100, but it is expected to arrive in early 2024

Amazon:
Amazon Web Services (AWS) announced the next generation of two AWS-designed chip families—AWS Graviton4 and AWS Trainium2—on November 28. These chips are designed for a broad range of customer workloads, including ML and AI applications - that was at their big show in Vegas.

And Google has jumped in to this race as well.
31
31
🔥1031
Cloud Computing.pdf
1.5 MB
Cloud Computing is one of the core skills in the data analytics domain. Why is this so?

95% of modern organizations host their Data Analytics solutions on Public Cloud platforms such as AWS, Azure, or GCP. Sometimes, they may utilize multiple cloud vendors.

The bottom line is that knowledge of Cloud is a must-have skill for anyone in the modern data workforce.

For the 3rd year in a row, I am running a Cloud Computing Fundamentals course with the University of Victoria.

Here’s a 6-week program overview:
Week 1: Cloud Computing overview and essential cloud technologies.
Week 2: Cloud concepts and business benefits.
Week 3: Cloud security fundamentals.
Week 4: Cloud Architectures and Cloud Migration.
Week 5: Modern Analytics - BI, Big Data, and AI.
Week 6: Cloud Career paths and professional certifications.

There are still some spots available. You can register at the following link: Cloud Computing for Business

The course starts on Jan 09, 2024. We are planning to have 2 classes per week in the evening.

This is a great option to close cloud computing gaps and understand how modern organizations are using cloud computing to run their businesses, analytics, and AI workloads. As a bonus, we will talk about hiring and interviewing in the North America.

There will be lots of hands-on experience with Azure and optionally with AWS and GCP.

Link to like/share https://www.linkedin.com/posts/dmitryanoshin_cloud-computing-fundamentals-activity-7140383754274942977-Faso
9
The primary source of data for gaming is #telemetry. What is telemetry and how to collect and analyze it?



The word Telemetry is derived from the Greek roots tele, "remote", and metron, "measure".



Games are state machines - a person creates a continual loop of actions and responses which keep the game state changing. Often loops keeping the user engages over a period of time.



Telemetry helps to discovering #who is performing #what action #when and where in the game. It cannot provide #why.



Back in the day, I built Delta Lake using Azure Databricks for one of the largest Xbox game studios. It was a new concept for me, especially coming from Amazon, where we faced numerous privacy-related issues with the traditional data lake on S3 and EMR. Delta Lake emerged as a promising technology to address these problems, and it succeeded.



The Delta Lake approach worked brilliantly for game analytics. It enabled us to stream telemetry data into the Bronze and Silver layers in near real-time using Delta Streaming, and to compute metrics in the Gold layer every hour.

Databricks allowed us to unify data engineering, analytics, and machine learning cases in the same workspace.



Using Azure Bicep and Azure DevOps, we maintained our infrastructure as code.



The bottom line is that Apache Spark and the Lakehouse architecture remain significant concepts today, enabling the effective tackling of real business processes.



A close alternative to Databricks would be Snowflake, if you're willing to incur the associated costs. If not, there are numerous open-source technologies available that can serve your needs, assuming you have a competent engineering team.



Link: https://www.linkedin.com/posts/dmitryanoshin_telemetry-who-what-activity-7140747632137674752-ssRL
🔥32
❤‍🔥11
3❤‍🔥1
What exactly is Apache Spark, especially for those who have never worked with it?

Let's simplify the concept. Spark is a powerful tool for in-memory processing. And it one of the best data engineering tool that might cover 80% of data engineering use cases.

It reads raw data from a source, and optionally, transforms and manipulates it before writing it to a target. When writing data to a destination, we can define a TABLE that points to the data files in storage. This table is not for storing data; it's just metadata, containing information about the data's path, partitions, indexes, etc.

An important thing to note about Apache Spark is that it's neither a data warehouse nor a data lake. It's a distributed computing engine that runs Spark software.

The choice of where and in what format to store data is up to us.
In terms of how Spark handles data, it uses an abstraction called a dataframe, similar to a traditional database table with columns and rows. The key difference is that a dataframe is an abstraction of the data, whereas a table usually contains the data. This distinction might not always hold true for systems like Trino, Presto, Athena, and various serverless offerings from vendors.

So, in a nutshell, remember that Spark is a computing engine that facilitates data READ, WRITE, and TRANSFORMATION operations. It's a widely used solution for building Data Lakes or Lakehouses, often utilizing the Delta format.

Furthermore, at Surfalytics, we're gearing up to focus on Apache Spark, Databricks, and Delta Lake starting in January 2024. We'll start with basics like running Spark on a local laptop and using Docker. After gaining a thorough understanding of Spark, we'll advance to Databricks and Delta Lake, exploring more complex topics like Unity Catalog, Structured Streaming, MLOps, and LLMs.

Are you interested in learning in a cohort-based setting and advancing your career? Join us for the cost of a Netflix subscription -> https://surfalytics.com/
3
What is Delta Lake?

Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the data lake house.

It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

If you want to learn more - check the paper "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores" https://lnkd.in/d7uvbhw5

What is ACID and why do we need it?

ACID stands for Atomicity, Consistency, Isolation, and Durability. These four properties are essential for ensuring reliable transaction processing in database systems, and they play a crucial role in data management and data integrity.

In other words, Delta builds the bridge between data warehouse and data lake.

There are two alternatives available:

Apache Hudi: Developed by Uber Engineering, it offers capabilities for managing large-scale data storage with features like record-level insert, update, and delete operations.

Apache Iceberg: Created by Netflix, this platform provides a table format that improves data accessibility and reliability for large analytic datasets.

Which one to choose?

As you may guess overall they all work fine and it depends on your team preference and stack you are having now.

Post to like: https://www.linkedin.com/posts/dmitryanoshin_dataengineering-surfalytics-deltalake-activity-7142977678537662464-75Iq
❤‍🔥522