Machine Learning
39.2K subscribers
3.82K photos
32 videos
41 files
1.3K links
Machine learning insights, practical tutorials, and clear explanations for beginners and aspiring data scientists. Follow the channel for models, algorithms, coding guides, and real-world ML applications.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
πŸ“š Distributed Machine Learning with PySpark (2023)

1⃣ Join Channel Download:
https://t.iss.one/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://t.iss.one/c/1854405158/916

πŸ’¬ Tags: #PySpark #ML

πŸ‘‰ BEST DATA SCIENCE CHANNELS ON TELEGRAM πŸ‘ˆ
πŸ‘6
πŸ“š Distributed Machine Learning with PySpark (2023)

1⃣ Join Channel Download:
https://t.iss.one/+MhmkscCzIYQ2MmM8

2⃣ Download Book: https://t.iss.one/c/1854405158/957

πŸ’¬ Tags: #ML #PySpark

πŸ‘‰ BEST DATA SCIENCE CHANNELS ON TELEGRAM πŸ‘ˆ
πŸ‘9
Pandas ➑️ Polars ➑️ SQL ➑️ PySpark translations:

Is it useful to you❓

πŸ“‚ Tags: #pandas #Polars #sql #Pyspark

https://t.iss.one/codeprogrammer ⭐️
Please open Telegram to view this post
VIEW IN TELEGRAM
πŸ‘12πŸ”₯2
Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts

---

### 1. What is PySpark?

PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.

PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:

β€’ Handle massive datasets
β€’ Perform distributed computing
β€’ Run parallel data transformations

---

### 2. PySpark Ecosystem Components

β€’ Spark SQL – Structured data queries with DataFrame and SQL APIs
β€’ Spark Core – Fundamental engine for task scheduling and memory management
β€’ Spark Streaming – Real-time data processing
β€’ MLlib – Machine learning at scale
β€’ GraphX – Graph computation

---

### 3. Why PySpark over Pandas?

| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |

---

### 4. PySpark Setup in Local Machine

#### Install PySpark via pip:

pip install pyspark


#### Start PySpark Shell:

pyspark


#### Sample Code to Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()


---

### 5. RDD vs DataFrame

| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |

---

### 6. Creating DataFrames

#### From Python List:

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()


#### From CSV File:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()


---

### 7. Inspecting DataFrames

df.printSchema()     # Schema info  
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows


---

### 8. Basic Transformations

df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()


---

### 9. Working with SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()


---

### 10. Writing Data

df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")


---

### 11. Summary of Concepts Covered

β€’ Spark architecture & PySpark setup
β€’ Core components of PySpark
β€’ Differences between RDD and DataFrames
β€’ How to create, inspect, and manipulate DataFrames
β€’ SQL support in Spark
β€’ Reading/writing to/from storage

---

### Exercise

1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file

---

#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL

https://t.iss.one/DataScienceM
❀4
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations

---

### 1. Column Operations

PySpark supports various column-wise operations using expressions.

#### Select Specific Columns:

df.select("Name", "Age").show()


#### Create/Modify Column:

from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()


#### Rename a Column:

df.withColumnRenamed("Age", "UserAge").show()


#### Drop Column:

df.drop("Age").show()


---

### 2. Filtering and Conditional Logic

#### Filter Rows:

df.filter(col("Age") > 25).show()


#### Multiple Conditions:

df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()


#### Using `when` for Conditional Columns:

from pyspark.sql.functions import when

df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()


---

### 3. Aggregations and Grouping

#### GroupBy + Aggregations:

df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()


#### Using Aggregate Functions:

from pyspark.sql.functions import avg, max, min, count

df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()


---

### 4. Sorting and Ordering

#### Sort by One or More Columns:

df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()


---

### 5. Dropping Duplicates & Handling Missing Data

#### Drop Duplicates:

df.dropDuplicates(["Name", "Age"]).show()


#### Drop Rows with Nulls:

df.dropna().show()


#### Fill Null Values:

df.fillna({"Salary": 0}).show()


---

### 6. Joins in PySpark

PySpark supports various join types like SQL.

#### Types of Joins:

β€’ inner
β€’ left
β€’ right
β€’ outer
β€’ left_semi
β€’ left_anti

#### Example – Inner Join:

df1.join(df2, on="id", how="inner").show()


#### Left Join Example:

df1.join(df2, on="id", how="left").show()


---

### 7. Working with Dates and Timestamps

from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()


#### Date Formatting:

from pyspark.sql.functions import date_format

df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()


---

### 8. Window Functions (Advanced Aggregations)

Used for operations like ranking, cumulative sum, and moving average.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()


---

### 9. Caching and Persistence

Use caching for performance when reusing data:

df.cache()
df.show()


Or use:

df.persist()


---

### 10. Summary of Concepts Covered

β€’ Column transformations and renaming
β€’ Filtering and conditional logic
β€’ Grouping, aggregating, and sorting
β€’ Handling nulls and duplicates
β€’ All types of joins
β€’ Working with dates and window functions
β€’ Caching for performance

---

### Exercise

1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance

---

#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark

https://t.iss.one/DataScienceM
❀2
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()


> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()


#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")


---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

β€’ Load and prepare data
β€’ Feature engineering
β€’ Model training
β€’ Evaluation
β€’ Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)


#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)


#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()


---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))


---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")


---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()


> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

β€’ Locally
β€’ On YARN (Hadoop)
β€’ Mesos
β€’ Kubernetes
β€’ In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py


---

### 9. Tuning and Optimization Tips

β€’ Cache reused DataFrames
β€’ Use built-in functions instead of UDFs
β€’ Repartition if data is skewed
β€’ Avoid using collect() on large datasets

---

### 10. Summary of Part 3

β€’ Custom logic with UDFs
β€’ Working with JSON, Parquet, and other formats
β€’ Machine Learning with MLlib (Logistic Regression)
β€’ Model evaluation and saving
β€’ Integration with Pandas
β€’ Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as β€œYes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://t.iss.one/DataScienceM
❀5