Data Memes
489 subscribers
563 photos
9 videos
2 files
63 links
All best data memes in one place!

https://surfalytics.com πŸ„β€β™€οΈ
Download Telegram
The primary source of data for gaming is #telemetry. What is telemetry and how to collect and analyze it?



The word Telemetry is derived from the Greek roots tele, "remote", and metron, "measure".



Games are state machines - a person creates a continual loop of actions and responses which keep the game state changing. Often loops keeping the user engages over a period of time.



Telemetry helps to discovering #who is performing #what action #when and where in the game. It cannot provide #why.



Back in the day, I built Delta Lake using Azure Databricks for one of the largest Xbox game studios. It was a new concept for me, especially coming from Amazon, where we faced numerous privacy-related issues with the traditional data lake on S3 and EMR. Delta Lake emerged as a promising technology to address these problems, and it succeeded.



The Delta Lake approach worked brilliantly for game analytics. It enabled us to stream telemetry data into the Bronze and Silver layers in near real-time using Delta Streaming, and to compute metrics in the Gold layer every hour.

Databricks allowed us to unify data engineering, analytics, and machine learning cases in the same workspace.



Using Azure Bicep and Azure DevOps, we maintained our infrastructure as code.



The bottom line is that Apache Spark and the Lakehouse architecture remain significant concepts today, enabling the effective tackling of real business processes.



A close alternative to Databricks would be Snowflake, if you're willing to incur the associated costs. If not, there are numerous open-source technologies available that can serve your needs, assuming you have a competent engineering team.



Link: https://www.linkedin.com/posts/dmitryanoshin_telemetry-who-what-activity-7140747632137674752-ssRL
πŸ”₯3❀2
❀‍πŸ”₯1❀1
❀3❀‍πŸ”₯1
What exactly is Apache Spark, especially for those who have never worked with it?

Let's simplify the concept. Spark is a powerful tool for in-memory processing. And it one of the best data engineering tool that might cover 80% of data engineering use cases.

It reads raw data from a source, and optionally, transforms and manipulates it before writing it to a target. When writing data to a destination, we can define a TABLE that points to the data files in storage. This table is not for storing data; it's just metadata, containing information about the data's path, partitions, indexes, etc.

An important thing to note about Apache Spark is that it's neither a data warehouse nor a data lake. It's a distributed computing engine that runs Spark software.

The choice of where and in what format to store data is up to us.
In terms of how Spark handles data, it uses an abstraction called a dataframe, similar to a traditional database table with columns and rows. The key difference is that a dataframe is an abstraction of the data, whereas a table usually contains the data. This distinction might not always hold true for systems like Trino, Presto, Athena, and various serverless offerings from vendors.

So, in a nutshell, remember that Spark is a computing engine that facilitates data READ, WRITE, and TRANSFORMATION operations. It's a widely used solution for building Data Lakes or Lakehouses, often utilizing the Delta format.

Furthermore, at Surfalytics, we're gearing up to focus on Apache Spark, Databricks, and Delta Lake starting in January 2024. We'll start with basics like running Spark on a local laptop and using Docker. After gaining a thorough understanding of Spark, we'll advance to Databricks and Delta Lake, exploring more complex topics like Unity Catalog, Structured Streaming, MLOps, and LLMs.

Are you interested in learning in a cohort-based setting and advancing your career? Join us for the cost of a Netflix subscription -> https://surfalytics.com/
✍3
What is Delta Lake?

Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the data lake house.

It extends Parquet data files with a file-based transaction log for ACID transactions and scalable metadata handling.

If you want to learn more - check the paper "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores" https://lnkd.in/d7uvbhw5

What is ACID and why do we need it?

ACID stands for Atomicity, Consistency, Isolation, and Durability. These four properties are essential for ensuring reliable transaction processing in database systems, and they play a crucial role in data management and data integrity.

In other words, Delta builds the bridge between data warehouse and data lake.

There are two alternatives available:

Apache Hudi: Developed by Uber Engineering, it offers capabilities for managing large-scale data storage with features like record-level insert, update, and delete operations.

Apache Iceberg: Created by Netflix, this platform provides a table format that improves data accessibility and reliability for large analytic datasets.

Which one to choose?

As you may guess overall they all work fine and it depends on your team preference and stack you are having now.

Post to like: https://www.linkedin.com/posts/dmitryanoshin_dataengineering-surfalytics-deltalake-activity-7142977678537662464-75Iq
❀‍πŸ”₯5✍2❀2
If you're a Data Engineer working on Azure, you'll likely encounter Synapse or Databricks.



I acknowledge that there are other solutions like HDInsights (which is somewhat outdated), Microsoft Fabric (still quite new and lacking a clear engineering focus for data engineering), or Azure Data Explorer, which represents a different paradigm.



Let's focus on Synapse and Databricks.



So, which one should you choose?



Based on my experience, engineers tend to prefer Databricks, whereas organizations often opt for Synapse and expect engineers to manage it. There are numerous arguments in favor of each product. Both are accessible through the same Azure portal and have considerable overlap.



This post aims to highlight the differences between the two, and I've summarized these in the accompanying picture.



Which one do you prefer?



#dataengineering #surfalytics #synapse #databricks #azure #azuredataengineer

https://www.linkedin.com/posts/dmitryanoshin_dataengineering-surfalytics-synapse-activity-7143313309646094336-oigm
πŸ”₯3πŸ€”1
What is the most cost-effective solution for data engineering?

The answer doesn't lie in a specific vendor or technology, but in understanding the use cases and setting up optimal infrastructure. This involves leveraging best practices in software and appropriately sizing hardware. In essence, there's a strong correlation between a team's skill set for a particular stack and cost-tracking initiatives.

Often, open source solutions can be more expensive than commercial offerings. Consider a scenario where a team of 10 data engineers struggles with infrastructure configuration and daily on-call duties for an open-source solution to keep pipelines operational, versus just 2 data engineers efficiently managing the same with a commercial offering.

For fun, I tried to find evidence that Snowflake is cheaper than Databricks, but I couldn't. This seems like a great niche for Snowflake's SEO optimization.
I suspect Snowflake should cost roughly the same for the same compute instance, but with Data Warehouse as a Service, there's often less focus on right optimization and query patterns.

Have you seen any evidence that Snowflake is cheaper than Databricks? How would you choose the right solution?

https://www.linkedin.com/posts/dmitryanoshin_dataengineering-dataanalytics-surfalytics-activity-7143663492347117568-JrB1
❀5πŸ”₯2🐳1🫑1
❀2πŸ”₯2
πŸ”₯3
πŸ”₯2❀1
πŸ”₯2
🐳5πŸ’―2
⚑4πŸ™ˆ2πŸ€”1
❀‍πŸ”₯9⚑1
🍾9
✍4❀3
⚑4πŸ€”2