Data Memes

Any data engineer should know the terms MPP and SMP.

Let me share a story about laundromats that will help you to remember this forever.

But first the theory.

SMP - Symmetric Multi-Processing
● Traditionally one server systems
● Data stored locally
● Processors share single OS, memory, I/O devices
● Scale-up only - physical limitations to scaling to accommodate workload

MPP - Massively Parallel Processing
● Multi-node(server) systems
● Data stored externally
● Scale-out - add more Compute nodes, each with
dedicated CPU, memory & I/O subsystems
● No single point of contention

Examples of SMP are SQL Server, MySQL, Postgres.
Examples of MPP are Redshift, Synapse Dedicated Pool, BigQuery, Snowflake.

The bottom line, that usually for data engineering projects we are going to utilize the MPP to handle big volume of data in distributed method. By the end of the day, it depends on requirements and it is the role of DE to make the right call.

The concept of using washing machines as an analogy isn't originally mine; it's borrowed from an age-old Teradata study guide. However, I find it quite appealing. This metaphor effectively illuminates the function and method of each approach. Plus, it's a memorable way to understand these concepts.

Let's imagine that two friends have plans for the Friday evening and they both have a mountain of laundry to do.

Sam decides to wait until Saturday to do their laundry. They believe using one large machine at the laundromat will be sufficient. On Saturday, Sam heads to the laundromat, only to find that the single large machine takes much longer to complete the task. Their entire day gets consumed by laundry, eating into their relaxation time.

Max, on the other hand, goes to the laundromat on Friday evening, before the party. They use multiple smaller machines simultaneously, dividing their laundry among them. This parallel approach allows Max to finish the laundry quickly, saving them enough time to enjoy the party and have a free weekend.

Sam, seeing how Max managed to save time and still enjoy the weekend, realizes the efficiency of parallel processing. While the single machine is powerful, it's not always the most time-efficient choice for large tasks.

When it comes to scalability, SMP is known for vertical scale or scale up. MPP is known for horizontal scale and scale up.

Another important term for MPP is "shared nothing" architecture - means each node has own set of CPUs, RAM, drive.

All this lead us to common data engineering problems like data skew, data distribution, I/O, network traffic, shuffling and so on.

I hope you learn a thing and will impress your next hiring manager!

PS Despite the fact that Teradata missed the opportunity for cloud analytics we can learn about another interesting term - "vendor lock", you may ask S&P500 companies who are still Teradata customers.

PS Likes are welcome -> https://www.linkedin.com/posts/dmitryanoshin_mpp-teradata-snowflake-activity-7135674781571436544-y8cS

#mpp #teradata #snowflake #redshift #bigquery #dataengineering #skew | 🏄‍♂️ Dmitry Anoshin

❤11✍1🔥1

323 views17:15