Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
#### Start PySpark Shell:
#### Sample Code to Initialize SparkSession:
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
#### From CSV File:
---
### 7. Inspecting DataFrames
---
### 8. Basic Transformations
---
### 9. Working with SQL
---
### 10. Writing Data
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://t.iss.one/DataScienceM
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
pip install pyspark
#### Start PySpark Shell:
pyspark
#### Sample Code to Initialize SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
#### From CSV File:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()
---
### 7. Inspecting DataFrames
df.printSchema() # Schema info
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows
---
### 8. Basic Transformations
df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()
---
### 9. Working with SQL
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()
---
### 10. Writing Data
df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://t.iss.one/DataScienceM
❤4
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
#### Create/Modify Column:
#### Rename a Column:
#### Drop Column:
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
#### Multiple Conditions:
#### Using `when` for Conditional Columns:
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
#### Using Aggregate Functions:
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
#### Drop Rows with Nulls:
#### Fill Null Values:
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
•
•
•
•
•
#### Example – Inner Join:
#### Left Join Example:
---
### 7. Working with Dates and Timestamps
#### Date Formatting:
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
Or use:
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://t.iss.one/DataScienceM
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
df.select("Name", "Age").show()
#### Create/Modify Column:
from pyspark.sql.functions import col
df.withColumn("AgePlus5", col("Age") + 5).show()
#### Rename a Column:
df.withColumnRenamed("Age", "UserAge").show()
#### Drop Column:
df.drop("Age").show()
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
df.filter(col("Age") > 25).show()
#### Multiple Conditions:
df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()
#### Using `when` for Conditional Columns:
from pyspark.sql.functions import when
df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()
#### Using Aggregate Functions:
from pyspark.sql.functions import avg, max, min, count
df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
df.dropDuplicates(["Name", "Age"]).show()
#### Drop Rows with Nulls:
df.dropna().show()
#### Fill Null Values:
df.fillna({"Salary": 0}).show()
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
inner
•
left
•
right
•
outer
•
left_semi
•
left_anti
#### Example – Inner Join:
df1.join(df2, on="id", how="inner").show()
#### Left Join Example:
df1.join(df2, on="id", how="left").show()
---
### 7. Working with Dates and Timestamps
from pyspark.sql.functions import current_date, current_timestamp
df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()
#### Date Formatting:
from pyspark.sql.functions import date_format
df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
df.cache()
df.show()
Or use:
df.persist()
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://t.iss.one/DataScienceM
❤2
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
#### Read & Write Parquet File:
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
#### Step 2: Train Model
#### Step 3: Make Predictions
---
### 5. Model Evaluation
---
### 6. Save and Load Models
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def label_age(age):
return "Senior" if age > 50 else "Adult"
label_udf = udf(label_age, StringType())
df.withColumn("AgeGroup", label_udf(df["Age"])).show()
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
df_json = spark.read.json("data.json")
df_json.show()
#### Read & Write Parquet File:
df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])
# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)
#### Step 2: Train Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)
#### Step 3: Make Predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
---
### 5. Model Evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))
---
### 6. Save and Load Models
# Save
model.save("models/logistic_model")
# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
pyspark.pandas
:import pyspark.pandas as ps
pdf = ps.read_csv("data.csv")
pdf.head()
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
spark-submit
for production scripts:spark-submit my_script.py
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
collect()
on large datasets---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
spark-submit
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
❤4
🔥 Trending Repository: data-engineer-handbook
📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering
🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook
📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme
📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks
💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell
🏷️ Related Topics:
==================================
🧠 By: https://t.iss.one/DataScienceM
📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering
🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook
📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme
📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks
💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell
🏷️ Related Topics:
#data #awesome #sql #bigdata #dataengineering #apachespark
==================================
🧠 By: https://t.iss.one/DataScienceM