Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
#### Read & Write Parquet File:
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
#### Step 2: Train Model
#### Step 3: Make Predictions
---
### 5. Model Evaluation
---
### 6. Save and Load Models
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def label_age(age):
return "Senior" if age > 50 else "Adult"
label_udf = udf(label_age, StringType())
df.withColumn("AgeGroup", label_udf(df["Age"])).show()
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
df_json = spark.read.json("data.json")
df_json.show()#### Read & Write Parquet File:
df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])
# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)
#### Step 2: Train Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)
#### Step 3: Make Predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
---
### 5. Model Evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))
---
### 6. Save and Load Models
# Save
model.save("models/logistic_model")
# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
pyspark.pandas:import pyspark.pandas as ps
pdf = ps.read_csv("data.csv")
pdf.head()
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
spark-submit for production scripts:spark-submit my_script.py
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
collect() on large datasets---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
VectorAssembler3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
spark-submit---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
❤5