Machine Learning
39.2K subscribers
3.83K photos
32 videos
41 files
1.3K links
Machine learning insights, practical tutorials, and clear explanations for beginners and aspiring data scientists. Follow the channel for models, algorithms, coding guides, and real-world ML applications.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()


> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()


#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")


---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)


#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)


#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()


---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))


---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")


---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()


> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

• Locally
• On YARN (Hadoop)
Mesos
Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py


---

### 9. Tuning and Optimization Tips

• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using collect() on large datasets

---

### 10. Summary of Part 3

• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://t.iss.one/DataScienceM
5