Machine Learning

Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
    return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()

> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()

#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")

---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
    (1.0, 2.0, 3.0, 1.0),
    (2.0, 3.0, 4.0, 0.0),
    (1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)

#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)

#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()

---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))

---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")

---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()

> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py

---

### 9. Tuning and Optimization Tips

• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using collect() on large datasets

---

### 10. Summary of Part 3

• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://t.iss.one/DataScienceM

❤5

1.42K views09:06

About

Blog

Apps

Platform