Data Memes
487 subscribers
566 photos
9 videos
2 files
63 links
All best data memes in one place!

https://surfalytics.com πŸ„β€β™€οΈ
Download Telegram
We had another Saturday for #dataengineering and #analytics projects at Surfalytics. First of all, we tried new Microsoft #Fabric and built end-to-end solutions with #lakehouse, hashtag #powerbi data model, data pipeline and dashboard.

Next, we learnt about the #synapse analytics and compared it to the Fabrics and learnt pros and cons. For the Synapse Analytics we dived deep into the Dedicated SQL Pool, Serverless SQL, Spark Pool and had hands on with hashtag #sql and #pyspark. We also discussed the difference between Azure Synapse and AWS/GCP offerings.

Apart from Azure analytics products, we had another project at the same time with classic tools such as #snowflake, #fivetran and #dbt. The goal was to ingest data with Fivetran and build dbt models.

As usual we had productive discussion about job markets, salaries and hiring managers expectations as well as best practices for killer CV and overall interview process across North America, Europe and APAC regions.

Next Saturday, we will practice something new. We are planning, do add #docker, #airflow and #looker for Snowflake/dbt/Fivetran as well as start new project with #meltano and #duckdb

The bottom line is that you can spend time to enjoy the weekend or you can gain new skills and knowledge and be more competitive on current job market. As soon as you stop learning new stuff, you start to losing your market value. The primary competitor for #surfalytics is #netlfix, #xbox, #playstation and so on. You decide how you spend you time.
πŸ”₯5πŸ€”1πŸ’―1
Let's take a high-level look at the history of #dataengineering and #analytics. We can examine at least the past 40 years to identify several key moments that shaped the industry.

In the past, organizations heavily relied on relational databases. For any kind of data job, they needed programmers to write custom software for data integration or visualization. A simple report with a few charts required significant effort. The first release of Microsoft Excel, therefore, was a game-changer.

Smart individuals saw an opportunity to develop products focusing on these pain points. They introduced tools like BusinessObjects, Cognos, and Crystal Reports in the field of BI. Similar developments happened in data integration and mining (yes, before the 'sexiest job' of data scientist emerged).

Later, these companies were acquired by major vendors like SAP, Oracle, and IBM. This era was marked by everything "Enterprise" grade, such as BI, ETL, DW. At the same time, Informatica was launched, targeting the "enterprise" market.

The volume of relational data grew, and companies like Teradata leveraged horizontal scaling and "shared nothing" architecture, popularizing the MPP approach. Others introduced their MPPs like Exadata, Netezza, and Vertica.

However, this was quite expensive, and only large "enterprise" companies could afford these technologies. Many didn't realize they were falling into a trap of vendor lock-in, with many S&P companies still using Teradata.

After the 2000s, the rise of Hadoop introduced the key idea of decoupling storage and compute, allowing companies to process and store various data types on a larger scale and more affordably than with "enterprise" DWs. These tools often complemented each other. I almost learned Java to write MapReduce jobs – I'm glad I didn't. Due to Hadoop's imperfections, we saw the emergence of many great products, including Apache Spark.

Simultaneously, the field of Data Science grew, and there was confusion about the term 'data model', which could refer to a data science model or an ER diagram. There was also a surge in Python and R, with numerous free courses on Coursera about data science and R. I almost learned R – again, I'm glad I didn't, as I still see some data engineering projects struggling with R in their pipelines.

Finally, with the rise of Cloud Computing, a new way of building data solutions emerged, starting small and cheap. There are three major players in the public cloud – AWS, Azure, GCP – each with their own set of analytics tools. The first notable one was Redshift. I would erect a monument to the Redshift data warehouse for changing our perception of traditional data warehouses and leading us to consider cloud data warehouses.

We have witnessed many new concepts and products that have transformed the industry, encouraging vendors and companies to move forward with a faster pace and a better engineering experience, exemplified by Snowflake and Databricks.

PS please like here: https://www.linkedin.com/posts/dmitryanoshin_dataengineering-analytics-activity-7134984496461844481-vTGI

PPS It is very important understand this history to see that every new tool is 50% old tool and due to new techmologies like Hadoop, Cloud it has new features and close some pain points.
πŸ”₯5⚑1
I know that everyone is deep into #engienering excellence and crashing technical courses, books, tutorials, pet projects and so on. But we should always think about the #soft part of the projects we are working and by the end of the day we still work with people. I've attached the books I took for 2024 with the idea of non-technical books.
πŸ”₯6❀3⚑2✍1
❀3✍1❀‍πŸ”₯1πŸ”₯1
For the next Saturday at Surfalytics community we decided to work on two entry level projects!

First project will be about foundational knowledge of Command Line Interface or simply #cli and simple shell scripting in context of data analysts and data engineers day to day use. We can call it simply "Just Enough CLI for data engineers and analysts". PS bring your kids and parents, CLI is fun!

Second project will be about Getting Started with Snowflake and Hex. We will learn the architecture of #lakehouse, virtual warehouses, time travel, #snowsql and #snowpipes as well as use #hex for UI to explore data in #datawarehouse.

PS Did you know that I was in the first batch of #snowfalke #datahero?! β›„

Thinking to learn new stuff and be on top of everything in data space, join us! πŸ˜‹

Unlimited learning capabilities, data engineering and analytics projects, interviewing skills, resume feedback, knowledge sharing and many more. This bus will bring you towards desired destination and help you land six figures offer (am I right about figures? πŸ˜ƒ ) Anyway, I bet you reach your goals way faster and avoid wasting energy and money.

#surfalytics #saturdayforgrow #snowflake #mpp #sharednothing

PS Like welcome https://www.linkedin.com/posts/dmitryanoshin_cli-lakehouse-snowsql-activity-7135442775465984000-Z3ch
⚑7❀2✍1
Any data engineer should know the terms MPP and SMP.

Let me share a story about laundromats that will help you to remember this forever.

But first the theory.

SMP - Symmetric Multi-Processing
● Traditionally one server systems
● Data stored locally
● Processors share single OS, memory, I/O devices
● Scale-up only - physical limitations to scaling to accommodate workload

MPP - Massively Parallel Processing
● Multi-node(server) systems
● Data stored externally
● Scale-out - add more Compute nodes, each with
dedicated CPU, memory & I/O subsystems
● No single point of contention

Examples of SMP are SQL Server, MySQL, Postgres.
Examples of MPP are Redshift, Synapse Dedicated Pool, BigQuery, Snowflake.

The bottom line, that usually for data engineering projects we are going to utilize the MPP to handle big volume of data in distributed method. By the end of the day, it depends on requirements and it is the role of DE to make the right call.

The concept of using washing machines as an analogy isn't originally mine; it's borrowed from an age-old Teradata study guide. However, I find it quite appealing. This metaphor effectively illuminates the function and method of each approach. Plus, it's a memorable way to understand these concepts.

Let's imagine that two friends have plans for the Friday evening and they both have a mountain of laundry to do.

Sam decides to wait until Saturday to do their laundry. They believe using one large machine at the laundromat will be sufficient. On Saturday, Sam heads to the laundromat, only to find that the single large machine takes much longer to complete the task. Their entire day gets consumed by laundry, eating into their relaxation time.

Max, on the other hand, goes to the laundromat on Friday evening, before the party. They use multiple smaller machines simultaneously, dividing their laundry among them. This parallel approach allows Max to finish the laundry quickly, saving them enough time to enjoy the party and have a free weekend.

Sam, seeing how Max managed to save time and still enjoy the weekend, realizes the efficiency of parallel processing. While the single machine is powerful, it's not always the most time-efficient choice for large tasks.

When it comes to scalability, SMP is known for vertical scale or scale up. MPP is known for horizontal scale and scale up.

Another important term for MPP is "shared nothing" architecture - means each node has own set of CPUs, RAM, drive.

All this lead us to common data engineering problems like data skew, data distribution, I/O, network traffic, shuffling and so on.

I hope you learn a thing and will impress your next hiring manager!

PS Despite the fact that Teradata missed the opportunity for cloud analytics we can learn about another interesting term - "vendor lock", you may ask S&P500 companies who are still Teradata customers.

PS Likes are welcome -> https://www.linkedin.com/posts/dmitryanoshin_mpp-teradata-snowflake-activity-7135674781571436544-y8cS
❀11✍1πŸ”₯1
What is Surfalytics?
❀5✍3πŸ”₯1
Sharing Nothing Architecture one is key term in dataengienering and distributed computing. It was coined several decades ago by Michael Stonebraker and was used in Teradata's 1983 database system.

Shared Nothing Architecture (SN) is a computing setup where each task is handled independently by separate units in a computer network. This method avoids delays and conflicts common in "shared everything" systems, where multiple units may need the same resources simultaneously.

SN systems are reliable; if one unit fails, others continue unaffected. They're easily scalable by adding more units. In databases, SN often involves 'sharding', splitting a database into smaller sections stored separately.

PS post for like https://www.linkedin.com/posts/dmitryanoshin_dataengienering-activity-7136413923381059585-rvlW
❀4✍1
πŸ€—4✍3❀2
πŸ’―8✍3🫑1
πŸ”₯5πŸ™ˆ1
πŸ’―3
API stands for Application Programming Interface, which is a software intermediary provided by an application to other applications and allows two applications to talk to each other. The RESTful standard stands for REpresentational State Transfer, which is an architectural style. REST defines a set of principles and standard protocols through which APIs can be built around. REST is the widely accepted architectural style of building APIs.
✍4
❀‍πŸ”₯9
The primary question for every data professional out there is: How will Generative AI and LLMs reshape the industry, and what are the expectations for future data professionals?

The answer depends on two opposing options:
1. AI will replace roles like Data Engineer, BI Analyst, Data Scientist, and so on.
2. AI will complement these roles, enabling people to work more efficiently, with higher quality and significant impact.

Whichever option you choose, you’ll agree that a growth mindset and constant learning are key to staying competitive and being ready to pivot your career and pick up the right skills.

Our careers remind me of an underground subway escalator. While it’s going down, you’re moving up, step by step. You may falsely assume that you’ve reached the top, but forget that the escalator is constantly going down.
The bottom line is, as soon as you stop learning and growing, you de facto degrade and lose market value.

At the Surfalytics community, my primary objective is to stay up-to-date with modern directions in the industry, talk with people globally, and move in the same direction.

I feel a wave of power, energy, and momentum that will bring everyone to the right destination, saving them from wasting money and time. On the same note, I feel blessed to see how people are changing their lives forever.

PS On the picture Tofino, BC! Every summer we run Surf + Data bootcamp out there!

Link for likes;) https://www.linkedin.com/posts/dmitryanoshin_dataengineering-analytics-dataanalyst-activity-7137165815346315264-iKWk
❀‍πŸ”₯7❀1
πŸ€”2
https://github.com/will-stone/browserosaurus#readme - help to control multiple browsers
❀2
The most straightforward yet profound question for newcomers in data engineering is: What is the difference between ETL and ELT?

You can work with tools like dbt and data warehouses without actually considering the difference, but understanding it is crucial as it leads to the right tool choice depending on the use case and requirements.

You might think of ETL as an older concept, from the time when data warehouses were used primarily for storing the results of the Transformation step. This required a powerful ETL server capable of processing the same volume of data, ready to read each record and every row in a table or file. With large volumes of data, this could be expensive.

However, with the rise of Cloud and Cloud Data Warehousing, the need for powerful ETL compute has diminished. Now, we can simply COPY data into cloud storage and then into the cloud data warehouse. After this, we can leverage the powerful compute capabilities of distributed cloud data warehouses or SQL engines.

The advent of cloud computing wasn't the only pivotal moment. Even before the cloud, ETL tools like Informatica employed a 'push down' approach, pushing all data into MPP data warehouses like Teradata, and then orchestrating SQL transformations.

Let's consider a simple example:

In the case of ETL:
1. Extract Orders and Products data.
2. Transform the data (join, clean, aggregate).
3. Load the data into the data warehouse, often using INSERT (a slower, row-by-row process).

In the case of ELT:
1. Extract Orders and Products data.
2. Load the data into the data warehouse, often using COPY for storage accounts and data warehouses (a faster, bulk load process).
3. Transform with SQL or tools like DBT or Dataframes.

Reflecting on the role of Spark, it becomes clear that Spark is an actual ETL tool since it reads the data when performing transformations.

Link to share to like: https://www.linkedin.com/posts/dmitryanoshin_etl-elt-dataengineering-activity-7137583690678747136-7imR
❀‍πŸ”₯2πŸ’―1
Say No to "Season of Yes".
❀4πŸ’―1πŸ™ˆ1
I know some of you have challenges due to lack on knowledge of Cloud Computing - Azure, AWS, GCP.

Starting January I will run 6 weeks program online in University of Victoria. Tuesday/Thursday 6pm PST for 2 hours.

The price is 715CAD. Money is going to university, I am not getting much. But it is extremly great opportunity to close the gap in cloud computing, Azure/AWS and do lots of hands on.

You employer may pay for this course.

Highly recommend. I think 10 seats left.

https://continuingstudies.uvic.ca/data-computing-and-technology/courses/cloud-computing-for-business/
❀5πŸ’―2
✍1