Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts

---

### 1. What is PySpark?

PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.

PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:

• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations

---

### 2. PySpark Ecosystem Components

• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation

---

### 3. Why PySpark over Pandas?

| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |

---

### 4. PySpark Setup in Local Machine

#### Install PySpark via pip:

pip install pyspark

#### Start PySpark Shell:

pyspark

#### Sample Code to Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()

---

### 5. RDD vs DataFrame

| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |

---

### 6. Creating DataFrames

#### From Python List:

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

#### From CSV File:

df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()

---

### 7. Inspecting DataFrames

df.printSchema()     # Schema info  
df.columns           # List column names  
df.describe().show() # Summary stats  
df.head(5)           # First 5 rows

---

### 8. Basic Transformations

df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()

---

### 9. Working with SQL

df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()

---

### 10. Writing Data

df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")

---

### 11. Summary of Concepts Covered

• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage

---

### Exercise

1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file

---

#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL

https://t.iss.one/DataScienceM

❤4

1.93K viewsedited 05:51

Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations

---

### 1. Column Operations

PySpark supports various column-wise operations using expressions.

#### Select Specific Columns:

df.select("Name", "Age").show()

#### Create/Modify Column:

from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()

#### Rename a Column:

df.withColumnRenamed("Age", "UserAge").show()

#### Drop Column:

df.drop("Age").show()

---

### 2. Filtering and Conditional Logic

#### Filter Rows:

df.filter(col("Age") > 25).show()

#### Multiple Conditions:

df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()

#### Using `when` for Conditional Columns:

from pyspark.sql.functions import when

df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()

---

### 3. Aggregations and Grouping

#### GroupBy + Aggregations:

df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()

#### Using Aggregate Functions:

from pyspark.sql.functions import avg, max, min, count

df.groupBy("Department").agg(
    avg("Salary").alias("AvgSalary"),
    max("Salary").alias("MaxSalary")
).show()

---

### 4. Sorting and Ordering

#### Sort by One or More Columns:

df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()

---

### 5. Dropping Duplicates & Handling Missing Data

#### Drop Duplicates:

df.dropDuplicates(["Name", "Age"]).show()

#### Drop Rows with Nulls:

df.dropna().show()

#### Fill Null Values:

df.fillna({"Salary": 0}).show()

---

### 6. Joins in PySpark

PySpark supports various join types like SQL.

#### Types of Joins:

• inner
• left
• right
• outer
• left_semi
• left_anti

#### Example – Inner Join:

df1.join(df2, on="id", how="inner").show()

#### Left Join Example:

df1.join(df2, on="id", how="left").show()

---

### 7. Working with Dates and Timestamps

from pyspark.sql.functions import current_date, current_timestamp

df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()

#### Date Formatting:

from pyspark.sql.functions import date_format

df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()

---

### 8. Window Functions (Advanced Aggregations)

Used for operations like ranking, cumulative sum, and moving average.

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()

---

### 9. Caching and Persistence

Use caching for performance when reusing data:

df.cache()
df.show()

Or use:

df.persist()

---

### 10. Summary of Concepts Covered

• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance

---

### Exercise

1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance

---

#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark

https://t.iss.one/DataScienceM

❤2

1.4K views06:04

Data Science Machine Learning Data Analysis

Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment

---

### 1. Working with UDFs (User Defined Functions)

UDFs allow custom Python functions to be used in PySpark transformations.

#### Define and Use a UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def label_age(age):
    return "Senior" if age > 50 else "Adult"

label_udf = udf(label_age, StringType())

df.withColumn("AgeGroup", label_udf(df["Age"])).show()

> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.

---

### 2. Working with JSON and Parquet Files

#### Read JSON File:

df_json = spark.read.json("data.json")
df_json.show()

#### Read & Write Parquet File:

df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")

---

### 3. Using PySpark MLlib (Machine Learning Library)

MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.

---

#### Steps in a Typical ML Pipeline:

• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction

---

### 4. Example: Logistic Regression in PySpark

#### Step 1: Prepare Data

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Sample DataFrame
data = spark.createDataFrame([
    (1.0, 2.0, 3.0, 1.0),
    (2.0, 3.0, 4.0, 0.0),
    (1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])

# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)

#### Step 2: Train Model

lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)

#### Step 3: Make Predictions

predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()

---

### 5. Model Evaluation

from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))

---

### 6. Save and Load Models

# Save
model.save("models/logistic_model")

# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")

---

### 7. PySpark with Pandas API on Spark

For small-medium data (pandas-compatible), use pyspark.pandas:

import pyspark.pandas as ps

pdf = ps.read_csv("data.csv")
pdf.head()

> Works like Pandas, but with Spark backend.

---

### 8. Scheduling & Cluster Deployment

PySpark can run:

• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc

Use spark-submit for production scripts:

spark-submit my_script.py

---

### 9. Tuning and Optimization Tips

• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using collect() on large datasets

---

### 10. Summary of Part 3

• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques

---

### Exercise

1. Load a dataset and train a logistic regression model
2. Add feature engineering using VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using spark-submit

---

#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark

https://t.iss.one/DataScienceM

❤4

1.33K views09:06

Data Science Machine Learning Data Analysis

🔥 Trending Repository: data-engineer-handbook

📝 Description: This is a repo with links to everything you'd ever want to learn about data engineering

🔗 Repository URL: https://github.com/DataExpert-io/data-engineer-handbook

📖 Readme: https://github.com/DataExpert-io/data-engineer-handbook#readme

📊 Statistics:
🌟 Stars: 36.3K stars
👀 Watchers: 429
🍴 Forks: 7K forks

💻 Programming Languages: Jupyter Notebook - Python - Makefile - Dockerfile - Shell

🏷️ Related Topics:

#data #awesome #sql #bigdata #dataengineering #apachespark

==================================
🧠 By: https://t.iss.one/DataScienceM

674 views11:27

📥 Download Zip

🚀 Explore Data Science

About

Blog

Apps

Platform