Topic: Python PySpark Data Sheet – Part 1 of 3: Introduction, Setup, and Core Concepts
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
#### Start PySpark Shell:
#### Sample Code to Initialize SparkSession:
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
#### From CSV File:
---
### 7. Inspecting DataFrames
---
### 8. Basic Transformations
---
### 9. Working with SQL
---
### 10. Writing Data
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://t.iss.one/DataScienceM
---
### 1. What is PySpark?
PySpark is the Python API for Apache Spark, a powerful distributed computing engine for big data processing.
PySpark allows you to leverage the full power of Apache Spark using Python, making it easier to:
• Handle massive datasets
• Perform distributed computing
• Run parallel data transformations
---
### 2. PySpark Ecosystem Components
• Spark SQL – Structured data queries with DataFrame and SQL APIs
• Spark Core – Fundamental engine for task scheduling and memory management
• Spark Streaming – Real-time data processing
• MLlib – Machine learning at scale
• GraphX – Graph computation
---
### 3. Why PySpark over Pandas?
| Feature | Pandas | PySpark |
| -------------- | --------------------- | ----------------------- |
| Scale | Single machine | Distributed (Cluster) |
| Speed | Slower for large data | Optimized execution |
| Language | Python | Python on JVM via Py4J |
| Learning Curve | Easier | Medium (Big Data focus) |
---
### 4. PySpark Setup in Local Machine
#### Install PySpark via pip:
pip install pyspark
#### Start PySpark Shell:
pyspark
#### Sample Code to Initialize SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("MyApp") \
.getOrCreate()
---
### 5. RDD vs DataFrame
| Feature | RDD | DataFrame |
| ------------ | ----------------------- | ------------------------------ |
| Type | Low-level API (objects) | High-level API (structured) |
| Optimization | Manual | Catalyst Optimizer (automatic) |
| Usage | Complex transformations | SQL-like operations |
---
### 6. Creating DataFrames
#### From Python List:
data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
#### From CSV File:
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.show()
---
### 7. Inspecting DataFrames
df.printSchema() # Schema info
df.columns # List column names
df.describe().show() # Summary stats
df.head(5) # First 5 rows
---
### 8. Basic Transformations
df.select("Name").show()
df.filter(df["Age"] > 25).show()
df.withColumn("AgePlus10", df["Age"] + 10).show()
df.drop("Age").show()
---
### 9. Working with SQL
df.createOrReplaceTempView("people")
spark.sql("SELECT * FROM people WHERE Age > 25").show()
---
### 10. Writing Data
df.write.csv("output.csv", header=True)
df.write.parquet("output_parquet/")
---
### 11. Summary of Concepts Covered
• Spark architecture & PySpark setup
• Core components of PySpark
• Differences between RDD and DataFrames
• How to create, inspect, and manipulate DataFrames
• SQL support in Spark
• Reading/writing to/from storage
---
### Exercise
1. Load a sample CSV file and display the schema
2. Add a new column with a calculated value
3. Filter the rows based on a condition
4. Save the result as a new CSV or Parquet file
---
#Python #PySpark #BigData #ApacheSpark #DataEngineering #ETL
https://t.iss.one/DataScienceM
❤4
Topic: Python Matplotlib – From Easy to Top: Part 1 of 6: Introduction and Basic Plotting
---
### 1. What is Matplotlib?
• Matplotlib is the most widely used Python library for data visualization.
• It provides an object-oriented API for embedding plots into applications and supports a wide variety of graphs: line charts, bar charts, scatter plots, histograms, etc.
---
### 2. Installing and Importing Matplotlib
Install Matplotlib if you haven't:
Import the main module and pyplot interface:
---
### 3. Plotting a Basic Line Chart
---
### 4. Customizing Line Style, Color, and Markers
---
### 5. Adding Multiple Lines to a Plot
---
### 6. Scatter Plot
Used to show relationships between two variables.
---
### 7. Bar Chart
---
### 8. Histogram
---
### 9. Saving the Plot to a File
---
### 10. Summary
•
• You can customize styles, add labels, titles, and legends.
• Understanding basic plots is the foundation for creating advanced visualizations.
---
Exercise
• Plot
• Create a scatter plot of 100 random points.
• Create and save a histogram from a normal distribution sample of 500 points.
---
#Python #Matplotlib #DataVisualization #Plots #Charts
https://t.iss.one/DataScienceM
---
### 1. What is Matplotlib?
• Matplotlib is the most widely used Python library for data visualization.
• It provides an object-oriented API for embedding plots into applications and supports a wide variety of graphs: line charts, bar charts, scatter plots, histograms, etc.
---
### 2. Installing and Importing Matplotlib
Install Matplotlib if you haven't:
pip install matplotlib
Import the main module and pyplot interface:
import matplotlib.pyplot as plt
import numpy as np
---
### 3. Plotting a Basic Line Chart
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
plt.plot(x, y)
plt.title("Simple Line Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.grid(True)
plt.show()
---
### 4. Customizing Line Style, Color, and Markers
plt.plot(x, y, color='green', linestyle='--', marker='o', label='Data')
plt.title("Styled Line Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.legend()
plt.show()
---
### 5. Adding Multiple Lines to a Plot
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.plot(x, y1, label="sin(x)", color='blue')
plt.plot(x, y2, label="cos(x)", color='red')
plt.title("Multiple Line Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.legend()
plt.grid(True)
plt.show()
---
### 6. Scatter Plot
Used to show relationships between two variables.
x = np.random.rand(100)
y = np.random.rand(100)
plt.scatter(x, y, color='purple', alpha=0.6)
plt.title("Scatter Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.show()
---
### 7. Bar Chart
categories = ['A', 'B', 'C', 'D']
values = [4, 7, 2, 5]
plt.bar(categories, values, color='skyblue')
plt.title("Bar Chart Example")
plt.xlabel("Category")
plt.ylabel("Value")
plt.show()
---
### 8. Histogram
data = np.random.randn(1000)
plt.hist(data, bins=30, color='orange', edgecolor='black')
plt.title("Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()
---
### 9. Saving the Plot to a File
plt.plot([1, 2, 3], [4, 5, 6])
plt.savefig("plot.png")
---
### 10. Summary
•
matplotlib.pyplot
is the key module for creating all kinds of plots.• You can customize styles, add labels, titles, and legends.
• Understanding basic plots is the foundation for creating advanced visualizations.
---
Exercise
• Plot
y = x^2
and y = x^3
on the same figure.• Create a scatter plot of 100 random points.
• Create and save a histogram from a normal distribution sample of 500 points.
---
#Python #Matplotlib #DataVisualization #Plots #Charts
https://t.iss.one/DataScienceM
❤3
Topic: Python Matplotlib – From Easy to Top: Part 2 of 6: Subplots, Figures, and Layout Management
---
### 1. Introduction to Figures and Axes
• In Matplotlib, a Figure is the entire image or window on which everything is drawn.
• An Axes is a part of the figure where data is plotted — it contains titles, labels, ticks, lines, etc.
Basic hierarchy:
* Figure ➝ contains one or more Axes
* Axes ➝ the area where the data is actually plotted
* Axis ➝ x-axis and y-axis inside an Axes
---
### 2. Creating Multiple Subplots using `plt.subplot()`
Explanation:
*
*
---
### 3. Creating Subplots with `plt.subplots()` (Recommended)
---
### 4. Sharing Axes Between Subplots
---
### 5. Adjusting Spacing with `subplots_adjust()`
---
### 6. Nested Plots Using `inset_axes`
You can add a small plot inside another:
---
### 7. Advanced Layout: Gridspec
---
### 8. Summary
• Use
• Share axes to align multiple plots.
• Use
• Always use
---
### Exercise
• Create a 2x2 grid of subplots showing different trigonometric functions.
• Add an inset plot inside a sine wave chart.
• Use Gridspec to create an asymmetric layout with at least 5 different plots.
---
#Python #Matplotlib #Subplots #DataVisualization #Gridspec #LayoutManagement
https://t.iss.one/DataScienceM
---
### 1. Introduction to Figures and Axes
• In Matplotlib, a Figure is the entire image or window on which everything is drawn.
• An Axes is a part of the figure where data is plotted — it contains titles, labels, ticks, lines, etc.
Basic hierarchy:
* Figure ➝ contains one or more Axes
* Axes ➝ the area where the data is actually plotted
* Axis ➝ x-axis and y-axis inside an Axes
import matplotlib.pyplot as plt
import numpy as np
---
### 2. Creating Multiple Subplots using `plt.subplot()`
x = np.linspace(0, 2*np.pi, 100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.subplot(2, 1, 1)
plt.plot(x, y1, label="sin(x)")
plt.title("First Subplot")
plt.subplot(2, 1, 2)
plt.plot(x, y2, label="cos(x)", color='green')
plt.title("Second Subplot")
plt.tight_layout()
plt.show()
Explanation:
*
subplot(2, 1, 1)
means 2 rows, 1 column, this is the first plot.*
tight_layout()
prevents overlap between plots.---
### 3. Creating Subplots with `plt.subplots()` (Recommended)
fig, axs = plt.subplots(2, 2, figsize=(8, 6))
x = np.linspace(0, 10, 100)
axs[0, 0].plot(x, np.sin(x))
axs[0, 0].set_title("sin(x)")
axs[0, 1].plot(x, np.cos(x))
axs[0, 1].set_title("cos(x)")
axs[1, 0].plot(x, np.tan(x))
axs[1, 0].set_title("tan(x)")
axs[1, 0].set_ylim(-10, 10)
axs[1, 1].plot(x, np.exp(-x))
axs[1, 1].set_title("exp(-x)")
plt.tight_layout()
plt.show()
---
### 4. Sharing Axes Between Subplots
fig, axs = plt.subplots(1, 2, sharey=True)
x = np.linspace(0, 10, 100)
axs[0].plot(x, np.sin(x))
axs[0].set_title("sin(x)")
axs[1].plot(x, np.cos(x), color='red')
axs[1].set_title("cos(x)")
plt.show()
---
### 5. Adjusting Spacing with `subplots_adjust()`
fig, axs = plt.subplots(2, 2)
fig.subplots_adjust(hspace=0.4, wspace=0.3)
---
### 6. Nested Plots Using `inset_axes`
You can add a small plot inside another:
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
fig, ax = plt.subplots()
x = np.linspace(0, 10, 100)
y = np.sin(x)
ax.plot(x, y)
ax.set_title("Main Plot")
inset_ax = inset_axes(ax, width="30%", height="30%", loc=1)
inset_ax.plot(x, np.cos(x), color='orange')
inset_ax.set_title("Inset", fontsize=8)
plt.show()
---
### 7. Advanced Layout: Gridspec
import matplotlib.gridspec as gridspec
fig = plt.figure(figsize=(8, 6))
gs = gridspec.GridSpec(3, 3)
ax1 = fig.add_subplot(gs[0, :])
ax2 = fig.add_subplot(gs[1, :-1])
ax3 = fig.add_subplot(gs[1:, -1])
ax4 = fig.add_subplot(gs[2, 0])
ax5 = fig.add_subplot(gs[2, 1])
ax1.set_title("Top")
ax2.set_title("Left")
ax3.set_title("Right")
ax4.set_title("Bottom Left")
ax5.set_title("Bottom Center")
plt.tight_layout()
plt.show()
---
### 8. Summary
• Use
subplot()
for quick layouts and subplots()
for flexibility.• Share axes to align multiple plots.
• Use
inset_axes
and gridspec
for custom and complex layouts.• Always use
tight_layout()
or subplots_adjust()
to clean up spacing.---
### Exercise
• Create a 2x2 grid of subplots showing different trigonometric functions.
• Add an inset plot inside a sine wave chart.
• Use Gridspec to create an asymmetric layout with at least 5 different plots.
---
#Python #Matplotlib #Subplots #DataVisualization #Gridspec #LayoutManagement
https://t.iss.one/DataScienceM
❤1
Topic: Python Matplotlib – From Easy to Top: Part 3 of 6: Plot Customization and Styling
---
### 1. Why Customize Plots?
• Customization improves readability and presentation.
• You can control everything from fonts and colors to axis ticks and legend placement.
---
### 2. Customizing Titles, Labels, and Ticks
---
### 3. Changing Line Styles and Markers
Common styles:
• Line styles:
• Markers:
• Colors:
---
### 4. Adding Legends
---
### 5. Using Annotations
Annotations help highlight specific points:
---
### 6. Customizing Axes Appearance
---
### 7. Setting Plot Limits
---
### 8. Using Style Sheets
Matplotlib has built-in style sheets for quick beautification.
Popular styles:
---
### 9. Creating Grids and Minor Ticks
---
### 10. Summary
• Customize everything: lines, axes, colors, labels, and grid.
• Use legends and annotations for clarity.
• Apply styles and themes for professional looks.
• Small changes improve the quality of your plots significantly.
---
### Exercise
• Plot sin(x) with red dashed lines and circle markers.
• Add a title, custom x/y labels, and set axis ranges manually.
• Apply the
---
#Python #Matplotlib #Customization #DataVisualization #PlotStyling
https://t.iss.one/DataScienceM
---
### 1. Why Customize Plots?
• Customization improves readability and presentation.
• You can control everything from fonts and colors to axis ticks and legend placement.
---
### 2. Customizing Titles, Labels, and Ticks
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.plot(x, y)
plt.title("Sine Wave", fontsize=16, color='navy')
plt.xlabel("Time (s)", fontsize=12)
plt.ylabel("Amplitude", fontsize=12)
plt.xticks(np.arange(0, 11, 1))
plt.yticks(np.linspace(-1, 1, 5))
plt.grid(True)
plt.show()
---
### 3. Changing Line Styles and Markers
plt.plot(x, y, color='red', linestyle='--', linewidth=2, marker='o', markersize=5, label='sin(x)')
plt.title("Styled Sine Curve")
plt.legend()
plt.grid(True)
plt.show()
Common styles:
• Line styles:
'-'
, '--'
, ':'
, '-.'
• Markers:
'o'
, '^'
, 's'
, '*'
, 'D'
, etc.• Colors:
'r'
, 'g'
, 'b'
, 'c'
, 'm'
, 'y'
, 'k'
, etc.---
### 4. Adding Legends
plt.plot(x, np.sin(x), label="Sine")
plt.plot(x, np.cos(x), label="Cosine")
plt.legend(loc='upper right', fontsize=10)
plt.title("Legend Example")
plt.show()
---
### 5. Using Annotations
Annotations help highlight specific points:
plt.plot(x, y)
plt.annotate('Peak', xy=(np.pi/2, 1), xytext=(2, 1.2),
arrowprops=dict(facecolor='black', shrink=0.05))
plt.title("Annotated Peak")
plt.show()
---
### 6. Customizing Axes Appearance
fig, ax = plt.subplots()
ax.plot(x, y)
# Remove top and right border
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Customize axis colors and widths
ax.spines['left'].set_color('blue')
ax.spines['left'].set_linewidth(2)
plt.title("Customized Axes")
plt.show()
---
### 7. Setting Plot Limits
plt.plot(x, y)
plt.xlim(0, 10)
plt.ylim(-1.5, 1.5)
plt.title("Limit Axes")
plt.show()
---
### 8. Using Style Sheets
Matplotlib has built-in style sheets for quick beautification.
plt.style.use('ggplot')
plt.plot(x, np.sin(x))
plt.title("ggplot Style")
plt.show()
Popular styles:
seaborn
, fivethirtyeight
, bmh
, dark_background
, etc.---
### 9. Creating Grids and Minor Ticks
plt.plot(x, y)
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.minorticks_on()
plt.title("Grid with Minor Ticks")
plt.show()
---
### 10. Summary
• Customize everything: lines, axes, colors, labels, and grid.
• Use legends and annotations for clarity.
• Apply styles and themes for professional looks.
• Small changes improve the quality of your plots significantly.
---
### Exercise
• Plot sin(x) with red dashed lines and circle markers.
• Add a title, custom x/y labels, and set axis ranges manually.
• Apply the
'seaborn-darkgrid'
style and highlight the peak with an annotation.---
#Python #Matplotlib #Customization #DataVisualization #PlotStyling
https://t.iss.one/DataScienceM
❤3
Topic: Python PySpark Data Sheet – Part 2 of 3: DataFrame Transformations, Joins, and Group Operations
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
#### Create/Modify Column:
#### Rename a Column:
#### Drop Column:
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
#### Multiple Conditions:
#### Using `when` for Conditional Columns:
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
#### Using Aggregate Functions:
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
#### Drop Rows with Nulls:
#### Fill Null Values:
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
•
•
•
•
•
#### Example – Inner Join:
#### Left Join Example:
---
### 7. Working with Dates and Timestamps
#### Date Formatting:
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
Or use:
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://t.iss.one/DataScienceM
---
### 1. Column Operations
PySpark supports various column-wise operations using expressions.
#### Select Specific Columns:
df.select("Name", "Age").show()
#### Create/Modify Column:
from pyspark.sql.functions import col
df.withColumn("AgePlus5", col("Age") + 5).show()
#### Rename a Column:
df.withColumnRenamed("Age", "UserAge").show()
#### Drop Column:
df.drop("Age").show()
---
### 2. Filtering and Conditional Logic
#### Filter Rows:
df.filter(col("Age") > 25).show()
#### Multiple Conditions:
df.filter((col("Age") > 25) & (col("Name") != "Alice")).show()
#### Using `when` for Conditional Columns:
from pyspark.sql.functions import when
df.withColumn("Category", when(col("Age") < 30, "Young").otherwise("Adult")).show()
---
### 3. Aggregations and Grouping
#### GroupBy + Aggregations:
df.groupBy("Department").count().show()
df.groupBy("Department").agg({"Salary": "avg"}).show()
#### Using Aggregate Functions:
from pyspark.sql.functions import avg, max, min, count
df.groupBy("Department").agg(
avg("Salary").alias("AvgSalary"),
max("Salary").alias("MaxSalary")
).show()
---
### 4. Sorting and Ordering
#### Sort by One or More Columns:
df.orderBy("Age").show()
df.orderBy(col("Salary").desc()).show()
---
### 5. Dropping Duplicates & Handling Missing Data
#### Drop Duplicates:
df.dropDuplicates(["Name", "Age"]).show()
#### Drop Rows with Nulls:
df.dropna().show()
#### Fill Null Values:
df.fillna({"Salary": 0}).show()
---
### 6. Joins in PySpark
PySpark supports various join types like SQL.
#### Types of Joins:
•
inner
•
left
•
right
•
outer
•
left_semi
•
left_anti
#### Example – Inner Join:
df1.join(df2, on="id", how="inner").show()
#### Left Join Example:
df1.join(df2, on="id", how="left").show()
---
### 7. Working with Dates and Timestamps
from pyspark.sql.functions import current_date, current_timestamp
df.withColumn("today", current_date()).show()
df.withColumn("now", current_timestamp()).show()
#### Date Formatting:
from pyspark.sql.functions import date_format
df.withColumn("formatted", date_format(col("Date"), "yyyy-MM-dd")).show()
---
### 8. Window Functions (Advanced Aggregations)
Used for operations like ranking, cumulative sum, and moving average.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window_spec = Window.partitionBy("Department").orderBy("Salary")
df.withColumn("rank", row_number().over(window_spec)).show()
---
### 9. Caching and Persistence
Use caching for performance when reusing data:
df.cache()
df.show()
Or use:
df.persist()
---
### 10. Summary of Concepts Covered
• Column transformations and renaming
• Filtering and conditional logic
• Grouping, aggregating, and sorting
• Handling nulls and duplicates
• All types of joins
• Working with dates and window functions
• Caching for performance
---
### Exercise
1. Load two CSV datasets and perform different types of joins
2. Add a new column with a custom label based on a condition
3. Aggregate salary data by department and show top-paid employees per department using window functions
4. Practice caching and observe performance
---
#Python #PySpark #DataEngineering #BigData #ETL #ApacheSpark
https://t.iss.one/DataScienceM
❤2
Topic: Python Matplotlib – From Easy to Top: Part 4 of 6: Advanced Charts – Histograms, Pie, Box, Area, and Error Bars
---
### 1. Histogram: Visualizing Data Distribution
Histograms show frequency distribution of numerical data.
Customizations:
•
•
•
---
### 2. Pie Chart: Showing Proportions
---
### 3. Box Plot: Summarizing Distribution Stats
Box plots show min, Q1, median, Q3, max, and outliers.
Tip: Use
---
### 4. Area Chart: Cumulative Trends
---
### 5. Error Bar Plot: Showing Uncertainty
---
### 6. Horizontal Bar Chart
---
### 7. Stacked Bar Chart
---
### 8. Summary
• Histograms show frequency distribution
• Pie charts are good for proportions
• Box plots summarize spread and outliers
• Area charts visualize trends over time
• Error bars indicate uncertainty in measurements
• Stacked and horizontal bars enhance categorical data clarity
---
### Exercise
• Create a pie chart showing budget allocation of 5 departments.
• Plot 3 histograms on the same figure with different distributions.
• Build a stacked bar chart for monthly expenses across 3 categories.
• Add error bars to a decaying function and annotate the max point.
---
#Python #Matplotlib #DataVisualization #AdvancedCharts #Histograms #PieCharts #BoxPlots
https://t.iss.one/DataScienceM
---
### 1. Histogram: Visualizing Data Distribution
Histograms show frequency distribution of numerical data.
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randn(1000)
plt.hist(data, bins=30, color='skyblue', edgecolor='black')
plt.title("Normal Distribution Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.grid(True)
plt.show()
Customizations:
•
bins=30
– controls granularity•
density=True
– normalize the histogram•
alpha=0.7
– transparency---
### 2. Pie Chart: Showing Proportions
labels = ['Python', 'JavaScript', 'C++', 'Java']
sizes = [45, 30, 15, 10]
colors = ['gold', 'lightgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0, 0) # explode the 1st slice
plt.pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%',
startangle=140, explode=explode, shadow=True)
plt.title("Programming Language Popularity")
plt.axis('equal') # Equal aspect ratio ensures pie is circular
plt.show()
---
### 3. Box Plot: Summarizing Distribution Stats
Box plots show min, Q1, median, Q3, max, and outliers.
data = [np.random.normal(0, std, 100) for std in range(1, 4)]
plt.boxplot(data, patch_artist=True, labels=['std=1', 'std=2', 'std=3'])
plt.title("Box Plot Example")
plt.grid(True)
plt.show()
Tip: Use
vert=False
to make a horizontal boxplot.---
### 4. Area Chart: Cumulative Trends
x = np.arange(1, 6)
y1 = np.array([1, 3, 4, 5, 7])
y2 = np.array([1, 2, 4, 6, 8])
plt.fill_between(x, y1, color="skyblue", alpha=0.5, label="Y1")
plt.fill_between(x, y2, color="orange", alpha=0.5, label="Y2")
plt.title("Area Chart")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.legend()
plt.show()
---
### 5. Error Bar Plot: Showing Uncertainty
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)
error = 0.1 + 0.2 * x
plt.errorbar(x, y, yerr=error, fmt='-o', color='teal', ecolor='red', capsize=5)
plt.title("Error Bar Plot")
plt.xlabel("X Axis")
plt.ylabel("Y Axis")
plt.grid(True)
plt.show()
---
### 6. Horizontal Bar Chart
langs = ['Python', 'Java', 'C++', 'JavaScript']
popularity = [50, 40, 30, 45]
plt.barh(langs, popularity, color='plum')
plt.title("Programming Language Popularity")
plt.xlabel("Popularity")
plt.show()
---
### 7. Stacked Bar Chart
labels = ['2019', '2020', '2021']
men = [20, 35, 30]
women = [25, 32, 34]
x = np.arange(len(labels))
width = 0.5
plt.bar(x, men, width, label='Men')
plt.bar(x, women, width, bottom=men, label='Women')
plt.ylabel('Scores')
plt.title('Scores by Year and Gender')
plt.xticks(x, labels)
plt.legend()
plt.show()
---
### 8. Summary
• Histograms show frequency distribution
• Pie charts are good for proportions
• Box plots summarize spread and outliers
• Area charts visualize trends over time
• Error bars indicate uncertainty in measurements
• Stacked and horizontal bars enhance categorical data clarity
---
### Exercise
• Create a pie chart showing budget allocation of 5 departments.
• Plot 3 histograms on the same figure with different distributions.
• Build a stacked bar chart for monthly expenses across 3 categories.
• Add error bars to a decaying function and annotate the max point.
---
#Python #Matplotlib #DataVisualization #AdvancedCharts #Histograms #PieCharts #BoxPlots
https://t.iss.one/DataScienceM
Topic: Python PySpark Data Sheet – Part 3 of 3: Advanced Operations, MLlib, and Deployment
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
#### Read & Write Parquet File:
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
#### Step 2: Train Model
#### Step 3: Make Predictions
---
### 5. Model Evaluation
---
### 6. Save and Load Models
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
---
### 1. Working with UDFs (User Defined Functions)
UDFs allow custom Python functions to be used in PySpark transformations.
#### Define and Use a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def label_age(age):
return "Senior" if age > 50 else "Adult"
label_udf = udf(label_age, StringType())
df.withColumn("AgeGroup", label_udf(df["Age"])).show()
> ⚠️ Note: UDFs are less optimized than built-in functions. Use built-ins when possible.
---
### 2. Working with JSON and Parquet Files
#### Read JSON File:
df_json = spark.read.json("data.json")
df_json.show()
#### Read & Write Parquet File:
df_parquet = spark.read.parquet("data.parquet")
df_parquet.write.parquet("output_folder/")
---
### 3. Using PySpark MLlib (Machine Learning Library)
MLlib is Spark's scalable ML library with tools for classification, regression, clustering, and more.
---
#### Steps in a Typical ML Pipeline:
• Load and prepare data
• Feature engineering
• Model training
• Evaluation
• Prediction
---
### 4. Example: Logistic Regression in PySpark
#### Step 1: Prepare Data
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
# Sample DataFrame
data = spark.createDataFrame([
(1.0, 2.0, 3.0, 1.0),
(2.0, 3.0, 4.0, 0.0),
(1.5, 2.5, 3.5, 1.0)
], ["f1", "f2", "f3", "label"])
# Combine features into a single vector
vec = VectorAssembler(inputCols=["f1", "f2", "f3"], outputCol="features")
data = vec.transform(data)
#### Step 2: Train Model
lr = LogisticRegression(featuresCol="features", labelCol="label")
model = lr.fit(data)
#### Step 3: Make Predictions
predictions = model.transform(data)
predictions.select("features", "label", "prediction").show()
---
### 5. Model Evaluation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print("Accuracy:", evaluator.evaluate(predictions))
---
### 6. Save and Load Models
# Save
model.save("models/logistic_model")
# Load
from pyspark.ml.classification import LogisticRegressionModel
loaded_model = LogisticRegressionModel.load("models/logistic_model")
---
### 7. PySpark with Pandas API on Spark
For small-medium data (pandas-compatible), use
pyspark.pandas
:import pyspark.pandas as ps
pdf = ps.read_csv("data.csv")
pdf.head()
> Works like Pandas, but with Spark backend.
---
### 8. Scheduling & Cluster Deployment
PySpark can run:
• Locally
• On YARN (Hadoop)
• Mesos
• Kubernetes
• In Databricks, AWS EMR, Google Cloud Dataproc
Use
spark-submit
for production scripts:spark-submit my_script.py
---
### 9. Tuning and Optimization Tips
• Cache reused DataFrames
• Use built-in functions instead of UDFs
• Repartition if data is skewed
• Avoid using
collect()
on large datasets---
### 10. Summary of Part 3
• Custom logic with UDFs
• Working with JSON, Parquet, and other formats
• Machine Learning with MLlib (Logistic Regression)
• Model evaluation and saving
• Integration with Pandas
• Deployment and optimization techniques
---
### Exercise
1. Load a dataset and train a logistic regression model
2. Add feature engineering using
VectorAssembler
3. Save and reload the model
4. Use UDFs to label predictions as “Yes/No”
5. Deploy your pipeline using
spark-submit
---
#Python #PySpark #MLlib #BigData #MachineLearning #ETL #ApacheSpark
https://t.iss.one/DataScienceM
❤3
Topic: Python Matplotlib – From Easy to Top: Part 5 of 6: Images, Heatmaps, and Colorbars
---
### 1. Introduction
Matplotlib can handle images, heatmaps, and color mapping effectively, making it a great tool for visualizing:
• Image data (grayscale or color)
• Matrix-like data with heatmaps
• Any data that needs a gradient of colors
---
### 2. Displaying Images with `imshow()`
Key parameters:
•
•
---
### 3. Displaying Color Images
Note: Image should be PNG or JPG. For real projects, use PIL or OpenCV for more control.
---
### 4. Creating a Heatmap from a 2D Matrix
---
### 5. Customizing Color Maps
You can reverse or customize color maps:
You can also create custom color ranges using
---
### 6. Using `matshow()` for Matrix-Like Data
---
### 7. Annotating Heatmaps
---
### 8. Displaying Multiple Images in Subplots
---
### 9. Saving Heatmaps and Figures
---
### 10. Summary
•
• Heatmaps are great for matrix or correlation data
• Use colorbars and annotations to add context
• Customize colormaps with
• Save your visualizations easily using
---
### Exercise
• Load a grayscale image using NumPy and display it.
• Create a 10×10 heatmap with annotations.
• Display 3 subplots of the same matrix using 3 different colormaps.
• Save one of the heatmaps with high resolution.
---
#Python #Matplotlib #Heatmaps #DataVisualization #Images #ColorMapping
https://t.iss.one/DataScienceM
---
### 1. Introduction
Matplotlib can handle images, heatmaps, and color mapping effectively, making it a great tool for visualizing:
• Image data (grayscale or color)
• Matrix-like data with heatmaps
• Any data that needs a gradient of colors
---
### 2. Displaying Images with `imshow()`
import matplotlib.pyplot as plt
import numpy as np
# Create a random grayscale image
img = np.random.rand(10, 10)
plt.imshow(img, cmap='gray')
plt.title("Grayscale Image")
plt.colorbar()
plt.show()
Key parameters:
•
cmap
– color map (gray
, hot
, viridis
, coolwarm
, etc.)•
interpolation
– for smoothing pixelation (nearest
, bilinear
, bicubic
)---
### 3. Displaying Color Images
import matplotlib.image as mpimg
img = mpimg.imread('example.png') # image must be in your directory
plt.imshow(img)
plt.title("Color Image")
plt.axis('off') # Hide axes
plt.show()
Note: Image should be PNG or JPG. For real projects, use PIL or OpenCV for more control.
---
### 4. Creating a Heatmap from a 2D Matrix
matrix = np.random.rand(6, 6)
plt.imshow(matrix, cmap='viridis', interpolation='nearest')
plt.title("Heatmap Example")
plt.colorbar(label="Intensity")
plt.xticks(range(6), ['A', 'B', 'C', 'D', 'E', 'F'])
plt.yticks(range(6), ['P', 'Q', 'R', 'S', 'T', 'U'])
plt.show()
---
### 5. Customizing Color Maps
You can reverse or customize color maps:
plt.imshow(matrix, cmap='coolwarm_r') # Reversed coolwarm
You can also create custom color ranges using
vmin
and vmax
:plt.imshow(matrix, cmap='hot', vmin=0.2, vmax=0.8)
---
### 6. Using `matshow()` for Matrix-Like Data
matshow()
is optimized for visualizing 2D arrays:plt.matshow(matrix)
plt.title("Matrix View with matshow()")
plt.colorbar()
plt.show()
---
### 7. Annotating Heatmaps
fig, ax = plt.subplots()
cax = ax.imshow(matrix, cmap='plasma')
# Add text annotations
for i in range(matrix.shape[0]):
for j in range(matrix.shape[1]):
ax.text(j, i, f'{matrix[i, j]:.2f}', ha='center', va='center', color='white')
plt.title("Annotated Heatmap")
plt.colorbar(cax)
plt.show()
---
### 8. Displaying Multiple Images in Subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 4))
axs[0].imshow(matrix, cmap='Blues')
axs[0].set_title("Blues")
axs[1].imshow(matrix, cmap='Greens')
axs[1].set_title("Greens")
plt.tight_layout()
plt.show()
---
### 9. Saving Heatmaps and Figures
plt.imshow(matrix, cmap='magma')
plt.title("Save This Heatmap")
plt.colorbar()
plt.savefig("heatmap.png", dpi=300)
plt.close()
---
### 10. Summary
•
imshow()
and matshow()
visualize 2D data or images• Heatmaps are great for matrix or correlation data
• Use colorbars and annotations to add context
• Customize colormaps with
cmap
, vmin
, vmax
• Save your visualizations easily using
savefig()
---
### Exercise
• Load a grayscale image using NumPy and display it.
• Create a 10×10 heatmap with annotations.
• Display 3 subplots of the same matrix using 3 different colormaps.
• Save one of the heatmaps with high resolution.
---
#Python #Matplotlib #Heatmaps #DataVisualization #Images #ColorMapping
https://t.iss.one/DataScienceM
❤6
Topic: Python Matplotlib – From Easy to Top: Part 6 of 6: 3D Plotting, Animation, and Interactive Visuals
---
### 1. Introduction
Matplotlib supports advanced visualizations including:
• 3D plots using
• Animations with
• Interactive plots using widgets and event handling
---
### 2. Creating 3D Plots
You need to import the 3D toolkit:
---
### 3. 3D Line Plot
---
### 4. 3D Surface Plot
---
### 5. 3D Scatter Plot
---
### 6. Creating Animations
Use
---
### 7. Save Animation as a File
Make sure to install
---
### 8. Adding Interactivity with Widgets
---
### 9. Mouse Interaction with Events
---
### 10. Summary
• 3D plots are ideal for visualizing spatial data and surfaces
• Animations help convey dynamic changes in data
• Widgets and events add interactivity for data exploration
• Mastering these tools enables the creation of interactive dashboards and visual storytelling
---
### Exercise
• Plot a 3D surface of
• Create a slider to change frequency of a sine wave in real-time.
• Animate a circle that rotates along time.
• Build a 3D scatter plot of 3 correlated variables.
---
#Python #Matplotlib #3DPlots #Animations #InteractiveVisuals #DataVisualization
https://t.iss.one/DataScienceM
---
### 1. Introduction
Matplotlib supports advanced visualizations including:
• 3D plots using
mpl_toolkits.mplot3d
• Animations with
FuncAnimation
• Interactive plots using widgets and event handling
---
### 2. Creating 3D Plots
You need to import the 3D toolkit:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
---
### 3. 3D Line Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
z = np.linspace(0, 15, 100)
x = np.sin(z)
y = np.cos(z)
ax.plot3D(x, y, z, 'purple')
ax.set_title("3D Line Plot")
plt.show()
---
### 4. 3D Surface Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
X = np.linspace(-5, 5, 50)
Y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(X, Y)
Z = np.sin(np.sqrt(X**2 + Y**2))
surf = ax.plot_surface(X, Y, Z, cmap='viridis')
fig.colorbar(surf)
ax.set_title("3D Surface Plot")
plt.show()
---
### 5. 3D Scatter Plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = np.random.rand(100)
y = np.random.rand(100)
z = np.random.rand(100)
ax.scatter(x, y, z, c=z, cmap='plasma')
ax.set_title("3D Scatter Plot")
plt.show()
---
### 6. Creating Animations
Use
FuncAnimation
for animated plots.import matplotlib.animation as animation
fig, ax = plt.subplots()
x = np.linspace(0, 2*np.pi, 128)
line, = ax.plot(x, np.sin(x))
def update(frame):
line.set_ydata(np.sin(x + frame / 10))
return line,
ani = animation.FuncAnimation(fig, update, frames=100, interval=50)
plt.title("Sine Wave Animation")
plt.show()
---
### 7. Save Animation as a File
ani.save("sine_wave.gif", writer='pillow')
Make sure to install
pillow
using:pip install pillow
---
### 8. Adding Interactivity with Widgets
import matplotlib.widgets as widgets
fig, ax = plt.subplots()
plt.subplots_adjust(left=0.1, bottom=0.25)
x = np.linspace(0, 2*np.pi, 100)
freq = 1
line, = ax.plot(x, np.sin(freq * x))
ax_slider = plt.axes([0.25, 0.1, 0.65, 0.03])
slider = widgets.Slider(ax_slider, 'Frequency', 0.1, 5.0, valinit=freq)
def update(val):
line.set_ydata(np.sin(slider.val * x))
fig.canvas.draw_idle()
slider.on_changed(update)
plt.title("Interactive Sine Wave")
plt.show()
---
### 9. Mouse Interaction with Events
def onclick(event):
print(f'You clicked at x={event.xdata:.2f}, y={event.ydata:.2f}')
fig, ax = plt.subplots()
ax.plot([1, 2, 3], [4, 5, 6])
fig.canvas.mpl_connect('button_press_event', onclick)
plt.title("Click to Print Coordinates")
plt.show()
---
### 10. Summary
• 3D plots are ideal for visualizing spatial data and surfaces
• Animations help convey dynamic changes in data
• Widgets and events add interactivity for data exploration
• Mastering these tools enables the creation of interactive dashboards and visual storytelling
---
### Exercise
• Plot a 3D surface of
z = cos(sqrt(x² + y²))
.• Create a slider to change frequency of a sine wave in real-time.
• Animate a circle that rotates along time.
• Build a 3D scatter plot of 3 correlated variables.
---
#Python #Matplotlib #3DPlots #Animations #InteractiveVisuals #DataVisualization
https://t.iss.one/DataScienceM
❤3
Topic: Python Matplotlib – Important 20 Interview Questions with Answers
---
### 1. What is Matplotlib in Python?
Answer:
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is highly customizable and works well with NumPy and pandas.
---
### 2. What is the difference between `plt.plot()` and `plt.scatter()`?
Answer:
•
•
---
### 3. How do you add a title and axis labels to a plot?
Answer:
---
### 4. How can you create multiple subplots in one figure?
Answer:
Use
---
### 5. How do you save a plot to a file?
Answer:
---
### 6. What is the role of `plt.show()`?
Answer:
It displays the figure window containing the plot. Required for interactive sessions or scripts.
---
### 7. What is a histogram in Matplotlib?
Answer:
A histogram is used to visualize the frequency distribution of numeric data using
---
### 8. What does `plt.figure(figsize=(8,6))` do?
Answer:
It creates a new figure with a specified width and height (in inches).
---
### 9. How do you add a legend to your plot?
Answer:
You must specify
---
### 10. What are some common `cmap` (color map) options?
Answer:
---
### 11. How do you create a bar chart?
Answer:
---
### 12. How can you rotate x-axis tick labels?
Answer:
---
### 13. How do you add a grid to the plot?
Answer:
---
### 14. What is the difference between `imshow()` and `matshow()`?
Answer:
•
•
---
### 15. How do you change the style of a plot globally?
Answer:
---
### 16. How can you add annotations to specific data points?
Answer:
---
### 17. How do you create a pie chart in Matplotlib?
Answer:
---
### 18. How do you plot a heatmap in Matplotlib?
Answer:
---
### 19. Can Matplotlib create 3D plots?
Answer:
Yes. Use:
Then:
---
### 20. How do you add error bars to your data?
Answer:
---
### Exercise
Choose 5 of the above functions and implement a mini-dashboard with line, bar, and pie plots in one figure layout.
---
#Python #Matplotlib #InterviewQuestions #DataVisualization #TechInterview
https://t.iss.one/DataScienceM
---
### 1. What is Matplotlib in Python?
Answer:
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It is highly customizable and works well with NumPy and pandas.
---
### 2. What is the difference between `plt.plot()` and `plt.scatter()`?
Answer:
•
plt.plot()
is used for line plots.•
plt.scatter()
is used for creating scatter (dot) plots.---
### 3. How do you add a title and axis labels to a plot?
Answer:
plt.title("My Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
---
### 4. How can you create multiple subplots in one figure?
Answer:
Use
plt.subplots()
to create a grid layout of subplots.fig, axs = plt.subplots(2, 2)
---
### 5. How do you save a plot to a file?
Answer:
plt.savefig("myplot.png", dpi=300)
---
### 6. What is the role of `plt.show()`?
Answer:
It displays the figure window containing the plot. Required for interactive sessions or scripts.
---
### 7. What is a histogram in Matplotlib?
Answer:
A histogram is used to visualize the frequency distribution of numeric data using
plt.hist()
.---
### 8. What does `plt.figure(figsize=(8,6))` do?
Answer:
It creates a new figure with a specified width and height (in inches).
---
### 9. How do you add a legend to your plot?
Answer:
plt.legend()
You must specify
label='something'
in your plot function.---
### 10. What are some common `cmap` (color map) options?
Answer:
'viridis'
, 'plasma'
, 'hot'
, 'coolwarm'
, 'gray'
, 'jet'
, etc.---
### 11. How do you create a bar chart?
Answer:
plt.bar(categories, values)
---
### 12. How can you rotate x-axis tick labels?
Answer:
plt.xticks(rotation=45)
---
### 13. How do you add a grid to the plot?
Answer:
plt.grid(True)
---
### 14. What is the difference between `imshow()` and `matshow()`?
Answer:
•
imshow()
is general-purpose for image data.•
matshow()
is optimized for 2D matrices and auto-configures the axes.---
### 15. How do you change the style of a plot globally?
Answer:
plt.style.use('ggplot')
---
### 16. How can you add annotations to specific data points?
Answer:
plt.annotate('label', xy=(x, y), xytext=(x+1, y+1), arrowprops=dict(arrowstyle='->'))
---
### 17. How do you create a pie chart in Matplotlib?
Answer:
plt.pie(data, labels=labels, autopct='%1.1f%%')
---
### 18. How do you plot a heatmap in Matplotlib?
Answer:
plt.imshow(matrix, cmap='hot')
plt.colorbar()
---
### 19. Can Matplotlib create 3D plots?
Answer:
Yes. Use:
from mpl_toolkits.mplot3d import Axes3D
Then:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
---
### 20. How do you add error bars to your data?
Answer:
plt.errorbar(x, y, yerr=errors, fmt='o')
---
### Exercise
Choose 5 of the above functions and implement a mini-dashboard with line, bar, and pie plots in one figure layout.
---
#Python #Matplotlib #InterviewQuestions #DataVisualization #TechInterview
https://t.iss.one/DataScienceM
❤4👍1
Forwarded from Python | Machine Learning | Coding | R
🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸
Join our channel today for free! Tomorrow it will cost 500$!
https://t.iss.one/+QHlfCJcO2lRjZWVl
You can join at this link! 👆👇
https://t.iss.one/+QHlfCJcO2lRjZWVl
Join our channel today for free! Tomorrow it will cost 500$!
https://t.iss.one/+QHlfCJcO2lRjZWVl
You can join at this link! 👆👇
https://t.iss.one/+QHlfCJcO2lRjZWVl
Forwarded from Python | Machine Learning | Coding | R
This channels is for Programmers, Coders, Software Engineers.
0️⃣ Python
1️⃣ Data Science
2️⃣ Machine Learning
3️⃣ Data Visualization
4️⃣ Artificial Intelligence
5️⃣ Data Analysis
6️⃣ Statistics
7️⃣ Deep Learning
8️⃣ programming Languages
✅ https://t.iss.one/addlist/8_rRW2scgfRhOTc0
✅ https://t.iss.one/Codeprogrammer
Please open Telegram to view this post
VIEW IN TELEGRAM
Data Science Machine Learning Data Analysis
Photo
# 📚 PyTorch Tutorial for Beginners - Part 1/6: Fundamentals & Tensors
#PyTorch #DeepLearning #MachineLearning #NeuralNetworks #Tensors
Welcome to Part 1 of our comprehensive PyTorch series! This beginner-friendly lesson covers core concepts, tensor operations, and your first neural network.
---
## 🔹 What is PyTorch?
PyTorch is an open-source deep learning framework developed by Facebook's AI Research Lab (FAIR). Key features:
✔️ Dynamic computation graphs (define-by-run)
✔️ GPU acceleration with CUDA
✔️ Pythonic syntax for intuitive coding
✔️ Automatic differentiation (autograd)
✔️ Rich ecosystem (TorchVision, TorchText, etc.)
---
## 🔹 Tensors: The Building Blocks
Tensors are PyTorch's multi-dimensional arrays (like NumPy but with GPU support).
### 1. Creating Tensors
### 2. Tensor Attributes
### 3. Moving Tensors to GPU
---
## 🔹 Tensor Operations
### 1. Basic Math
### 2. Reshaping Tensors
### 3. Indexing & Slicing
---
## 🔹 Autograd: Automatic Differentiation
PyTorch automatically computes gradients for tensors with
### 1. Basic Example
### 2. Neural Network Context
---
## **🔹 Your First Neural Network**
Let's build a single-layer perceptron for binary classification.
### 1. Define the Model
### 2. Synthetic Dataset
#PyTorch #DeepLearning #MachineLearning #NeuralNetworks #Tensors
Welcome to Part 1 of our comprehensive PyTorch series! This beginner-friendly lesson covers core concepts, tensor operations, and your first neural network.
---
## 🔹 What is PyTorch?
PyTorch is an open-source deep learning framework developed by Facebook's AI Research Lab (FAIR). Key features:
✔️ Dynamic computation graphs (define-by-run)
✔️ GPU acceleration with CUDA
✔️ Pythonic syntax for intuitive coding
✔️ Automatic differentiation (autograd)
✔️ Rich ecosystem (TorchVision, TorchText, etc.)
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
---
## 🔹 Tensors: The Building Blocks
Tensors are PyTorch's multi-dimensional arrays (like NumPy but with GPU support).
### 1. Creating Tensors
# From Python list
a = torch.tensor([1, 2, 3]) # 1D tensor (vector)
# 2D tensor (matrix)
b = torch.tensor([[1., 2.], [3., 4.]])
# Special tensors
zeros = torch.zeros(2, 3) # 2x3 matrix of zeros
ones = torch.ones_like(zeros) # Same shape as zeros, filled with 1s
rand = torch.rand(3, 3) # 3x3 matrix with uniform random values (0-1)
### 2. Tensor Attributes
x = torch.rand(2, 3)
print(f"Shape: {x.shape}") # torch.Size([2, 3])
print(f"Data type: {x.dtype}") # torch.float32
print(f"Device: {x.device}") # cpu/cuda:0
### 3. Moving Tensors to GPU
if torch.cuda.is_available():
x = x.to('cuda') # Move to GPU
print(f"Now on: {x.device}") # cuda:0
---
## 🔹 Tensor Operations
### 1. Basic Math
x = torch.tensor([1., 2., 3.])
y = torch.tensor([4., 5., 6.])
# Element-wise operations
add = x + y # or torch.add(x, y)
sub = x - y
mul = x * y
div = x / y
# Matrix multiplication
mat1 = torch.rand(2, 3)
mat2 = torch.rand(3, 2)
matmul = torch.mm(mat1, mat2) # or mat1 @ mat2
### 2. Reshaping Tensors
x = torch.arange(6) # [0, 1, 2, 3, 4, 5]
x_reshaped = x.view(2, 3) # [[0, 1, 2], [3, 4, 5]]
x_flattened = x.flatten() # Back to 1D
### 3. Indexing & Slicing
x = torch.tensor([[1, 2], [3, 4], [5, 6]])
print(x[0, 1]) # 2 (first row, second column)
print(x[:, 0]) # [1, 3, 5] (all rows, first column)
---
## 🔹 Autograd: Automatic Differentiation
PyTorch automatically computes gradients for tensors with
requires_grad=True
.### 1. Basic Example
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1
y.backward() # Compute gradients
print(x.grad) # dy/dx = 2x + 3 → 7.0
### 2. Neural Network Context
# Simple linear regression
w = torch.randn(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# Forward pass
inputs = torch.tensor([[1.0], [2.0], [3.0]])
targets = torch.tensor([[2.0], [4.0], [6.0]])
predictions = inputs * w + b
# Loss and backward pass
loss = torch.mean((predictions - targets)**2)
loss.backward() # Computes dloss/dw, dloss/db
print(f"Gradient of w: {w.grad}")
print(f"Gradient of b: {b.grad}")
---
## **🔹 Your First Neural Network**
Let's build a single-layer perceptron for binary classification.
### 1. Define the Model
import torch.nn as nn
class Perceptron(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.linear = nn.Linear(input_dim, 1) # 1 output neuron
def forward(self, x):
return torch.sigmoid(self.linear(x)) # Sigmoid for probability
model = Perceptron(input_dim=2)
print(model)
### 2. Synthetic Dataset
# XOR-like dataset
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)
🔥1
Data Science Machine Learning Data Analysis
Photo
### 3. Training Loop
---
## 🔹 Best Practices for Beginners
1. Always clear gradients with
2. Use `with torch.no_grad():` for inference (disables gradient tracking)
3. Normalize input data (e.g., scale to [0, 1] or standardize)
4. Start simple before using complex architectures
5. Leverage GPU for larger models/datasets
---
### 📌 What's Next?
In Part 2, we'll cover:
➡️ Deep Neural Networks (DNNs)
➡️ Activation Functions
➡️ Batch Normalization
➡️ Handling Real Datasets
#PyTorch #DeepLearning #MachineLearning 🚀
Practice Exercise:
1. Create a tensor of shape (3, 4) with random values (0-1)
2. Compute the mean of each column
3. Build a perceptron for OR gate (modify the XOR example)
4. Plot the loss curve during training
criterion = nn.BCELoss() # Binary Cross Entropy
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
for epoch in range(1000):
# Forward pass
outputs = model(X)
loss = criterion(outputs, y)
# Backward pass
optimizer.zero_grad() # Clear old gradients
loss.backward() # Compute gradients
optimizer.step() # Update weights
if (epoch+1) % 100 == 0:
print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')
# Test
with torch.no_grad():
predictions = model(X).round()
print(f"Final predictions: {predictions.squeeze()}")
---
## 🔹 Best Practices for Beginners
1. Always clear gradients with
optimizer.zero_grad()
before backward()
2. Use `with torch.no_grad():` for inference (disables gradient tracking)
3. Normalize input data (e.g., scale to [0, 1] or standardize)
4. Start simple before using complex architectures
5. Leverage GPU for larger models/datasets
---
### 📌 What's Next?
In Part 2, we'll cover:
➡️ Deep Neural Networks (DNNs)
➡️ Activation Functions
➡️ Batch Normalization
➡️ Handling Real Datasets
#PyTorch #DeepLearning #MachineLearning 🚀
Practice Exercise:
1. Create a tensor of shape (3, 4) with random values (0-1)
2. Compute the mean of each column
3. Build a perceptron for OR gate (modify the XOR example)
4. Plot the loss curve during training
# Solution for exercise 1-2
x = torch.rand(3, 4)
col_means = x.mean(dim=0) # dim=0 → average along rows
❤3🔥2
Data Science Machine Learning Data Analysis
Photo
# 📚 PyTorch Tutorial for Beginners - Part 2/6: Deep Neural Networks & Training Techniques
#PyTorch #DeepLearning #MachineLearning #NeuralNetworks #Training
Welcome to Part 2 of our comprehensive PyTorch series! This lesson dives deep into building and training neural networks, covering architectures, activation functions, optimization, and more.
---
## 🔹 Recap & Setup
---
## 🔹 Deep Neural Network (DNN) Architecture
### 1. Key Components
| Component | Purpose | PyTorch Implementation |
|--------------------|-------------------------------------------------------------------------|------------------------------|
| Input Layer | Receives raw features |
| Hidden Layers | Learn hierarchical representations | Multiple
| Output Layer | Produces final predictions |
| Activation | Introduces non-linearity |
| Loss Function | Measures prediction error |
| Optimizer | Updates weights to minimize loss |
### 2. Building a DNN
---
## 🔹 Activation Functions
### 1. Common Choices
| Activation | Formula | Range | Use Case | PyTorch |
|-----------------|----------------------|------------|------------------------------|------------------|
| ReLU | max(0, x) | [0, ∞) | Hidden layers |
| Leaky ReLU | max(0.01x, x) | (-∞, ∞) | Avoid dead neurons |
| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | Binary classification |
| Tanh | (e^x - e^(-x)) / ... | (-1, 1) | RNNs, some hidden layers |
| Softmax | e^x / sum(e^x) | (0, 1) | Multi-class classification |
### 2. Visual Comparison
---
#PyTorch #DeepLearning #MachineLearning #NeuralNetworks #Training
Welcome to Part 2 of our comprehensive PyTorch series! This lesson dives deep into building and training neural networks, covering architectures, activation functions, optimization, and more.
---
## 🔹 Recap & Setup
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset
# Check GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
---
## 🔹 Deep Neural Network (DNN) Architecture
### 1. Key Components
| Component | Purpose | PyTorch Implementation |
|--------------------|-------------------------------------------------------------------------|------------------------------|
| Input Layer | Receives raw features |
nn.Linear(input_dim, hidden_dim)
|| Hidden Layers | Learn hierarchical representations | Multiple
nn.Linear
+ Activation || Output Layer | Produces final predictions |
nn.Linear(hidden_dim, output_dim)
|| Activation | Introduces non-linearity |
nn.ReLU()
, nn.Sigmoid()
, etc. || Loss Function | Measures prediction error |
nn.MSELoss()
, nn.CrossEntropyLoss()
|| Optimizer | Updates weights to minimize loss |
optim.SGD()
, optim.Adam()
|### 2. Building a DNN
class DNN(nn.Module):
def __init__(self, input_size, hidden_sizes, output_size):
super().__init__()
layers = []
# Hidden layers
prev_size = input_size
for hidden_size in hidden_sizes:
layers.append(nn.Linear(prev_size, hidden_size))
layers.append(nn.ReLU())
prev_size = hidden_size
# Output layer (no activation for regression)
layers.append(nn.Linear(prev_size, output_size))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
# Example: 3-layer network (input=10, hidden=[64,32], output=1)
model = DNN(10, [64, 32], 1).to(device)
print(model)
---
## 🔹 Activation Functions
### 1. Common Choices
| Activation | Formula | Range | Use Case | PyTorch |
|-----------------|----------------------|------------|------------------------------|------------------|
| ReLU | max(0, x) | [0, ∞) | Hidden layers |
nn.ReLU()
|| Leaky ReLU | max(0.01x, x) | (-∞, ∞) | Avoid dead neurons |
nn.LeakyReLU()
|| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | Binary classification |
nn.Sigmoid()
|| Tanh | (e^x - e^(-x)) / ... | (-1, 1) | RNNs, some hidden layers |
nn.Tanh()
|| Softmax | e^x / sum(e^x) | (0, 1) | Multi-class classification |
nn.Softmax()
|### 2. Visual Comparison
x = torch.linspace(-5, 5, 100)
activations = {
"ReLU": nn.ReLU()(x),
"LeakyReLU": nn.LeakyReLU(0.1)(x),
"Sigmoid": nn.Sigmoid()(x),
"Tanh": nn.Tanh()(x)
}
plt.figure(figsize=(12, 4))
for i, (name, y) in enumerate(activations.items()):
plt.subplot(1, 4, i+1)
plt.plot(x.numpy(), y.numpy())
plt.title(name)
plt.tight_layout()
plt.show()
---
🔥2
Data Science Machine Learning Data Analysis
Photo
## 🔹 Loss Functions
### 1. Common Loss Functions
| Task | Loss Function | PyTorch Implementation |
|-----------------------|----------------------------|------------------------------|
| Regression | Mean Squared Error (MSE) |
| Binary Classification | Binary Cross Entropy (BCE) |
| Multi-class | Cross Entropy |
| Imbalanced Data | Focal Loss | Custom implementation |
### 2. Custom Loss Example
---
## **🔹 Optimization Techniques**
### 1. Optimizers Comparison
| Optimizer | Key Features | Use Case |
|-----------------|-------------------------------------------|------------------------------|
| SGD | Simple, can get stuck in local minima | Basic models |
| SGD+Momentum| Accumulates velocity for smoother updates | Most scenarios |
| Adam | Adaptive learning rates | Default for many problems |
| RMSprop | Adapts learning rates per parameter | RNNs, some CNN architectures |
### 2. Learning Rate Scheduling
---
## 🔹 Batch Normalization & Dropout
### 1. Batch Normalization
Normalizes layer inputs to reduce internal covariate shift.
### 2. Dropout
Randomly deactivates neurons to prevent overfitting.
---
## 🔹 Data Loading & Preprocessing
### 1. Using DataLoader
### 2. Custom Dataset
---
### 1. Common Loss Functions
| Task | Loss Function | PyTorch Implementation |
|-----------------------|----------------------------|------------------------------|
| Regression | Mean Squared Error (MSE) |
nn.MSELoss()
|| Binary Classification | Binary Cross Entropy (BCE) |
nn.BCELoss()
|| Multi-class | Cross Entropy |
nn.CrossEntropyLoss()
|| Imbalanced Data | Focal Loss | Custom implementation |
### 2. Custom Loss Example
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, inputs, targets):
BCE_loss = nn.BCEWithLogitsLoss(reduction='none')(inputs, targets)
pt = torch.exp(-BCE_loss)
loss = self.alpha * (1-pt)**self.gamma * BCE_loss
return loss.mean()
---
## **🔹 Optimization Techniques**
### 1. Optimizers Comparison
| Optimizer | Key Features | Use Case |
|-----------------|-------------------------------------------|------------------------------|
| SGD | Simple, can get stuck in local minima | Basic models |
| SGD+Momentum| Accumulates velocity for smoother updates | Most scenarios |
| Adam | Adaptive learning rates | Default for many problems |
| RMSprop | Adapts learning rates per parameter | RNNs, some CNN architectures |
# Example optimizers
optimizer_SGD = optim.SGD(model.parameters(), lr=0.01)
optimizer_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer_Adam = optim.Adam(model.parameters(), lr=0.001)
### 2. Learning Rate Scheduling
# Step LR scheduler
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine annealing
scheduler_cosine = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
# Usage in training loop
for epoch in range(100):
# Training steps...
scheduler.step()
---
## 🔹 Batch Normalization & Dropout
### 1. Batch Normalization
Normalizes layer inputs to reduce internal covariate shift.
class DNNWithBN(nn.Module):
def __init__(self, input_size, hidden_sizes, output_size):
super().__init__()
layers = []
prev_size = input_size
for hidden_size in hidden_sizes:
layers.extend([
nn.Linear(prev_size, hidden_size),
nn.BatchNorm1d(hidden_size),
nn.ReLU()
])
prev_size = hidden_size
layers.append(nn.Linear(prev_size, output_size))
self.net = nn.Sequential(*layers)
### 2. Dropout
Randomly deactivates neurons to prevent overfitting.
self.net = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.5), # 50% dropout
nn.Linear(256, 10)
)
---
## 🔹 Data Loading & Preprocessing
### 1. Using DataLoader
from torchvision import datasets, transforms
# Transformations
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,))
])
# Load MNIST
train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_data = datasets.MNIST('./data', train=False, transform=transform)
# Create DataLoaders
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)
### 2. Custom Dataset
class CustomDataset(torch.utils.data.Dataset):
def __init__(self, X, y, transform=None):
self.X = X
self.y = y
self.transform = transform
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
sample = self.X[idx], self.y[idx]
if self.transform:
sample = self.transform(sample)
return sample
---
🔥2❤1
Data Science Machine Learning Data Analysis
Photo
## 🔹 Complete Training Pipeline
### 1. Training Loop
### 2. Evaluation Function
### 3. Full Execution
---
## 🔹 Debugging & Visualization
### 1. Gradient Checking
### 2. Weight Histograms
---
## 🔹 Advanced Techniques
### 1. Weight Initialization
### 2. Early Stopping
---
## 🔹 Best Practices
1. Always normalize input data (e.g., scale to [0,1] or standardize)
2. Use batch normalization for deeper networks
3. Start with Adam optimizer (lr=0.001) as default
4. Monitor training with validation set to detect overfitting
5. Visualize weight distributions periodically
6. Use GPU for training (
---
### 📌 What's Next?
In Part 3, we'll cover:
➡️ Convolutional Neural Networks (CNNs)
➡️ Transfer Learning
➡️ Image Augmentation Techniques
➡️ Visualizing CNNs
#PyTorch #DeepLearning #MachineLearning 🚀
### 1. Training Loop
def train(model, train_loader, criterion, optimizer, epochs=10):
model.train()
losses = []
for epoch in range(epochs):
running_loss = 0.0
for inputs, labels in train_loader:
inputs, labels = inputs.to(device), labels.to(device)
# Forward pass
outputs = model(inputs)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
running_loss += loss.item()
epoch_loss = running_loss / len(train_loader)
losses.append(epoch_loss)
print(f'Epoch {epoch+1}/{epochs}, Loss: {epoch_loss:.4f}')
return losses
### 2. Evaluation Function
def evaluate(model, test_loader):
model.eval()
correct = 0
total = 0
with torch.no_grad():
for inputs, labels in test_loader:
inputs, labels = inputs.to(device), labels.to(device)
outputs = model(inputs)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
accuracy = 100 * correct / total
print(f'Test Accuracy: {accuracy:.2f}%')
return accuracy
### 3. Full Execution
# Hyperparameters
input_size = 784 # MNIST images (28x28)
hidden_sizes = [128, 64]
output_size = 10 # Digits 0-9
lr = 0.001
epochs = 10
# Initialize
model = DNN(input_size, hidden_sizes, output_size).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
# Flatten MNIST images
train_loader.dataset.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
transforms.Lambda(lambda x: x.view(-1)) # Flatten
])
# Train and evaluate
losses = train(model, train_loader, criterion, optimizer, epochs)
evaluate(model, test_loader)
# Plot training curve
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss Curve')
plt.show()
---
## 🔹 Debugging & Visualization
### 1. Gradient Checking
# After loss.backward()
for name, param in model.named_parameters():
if param.grad is not None:
print(f"{name} gradient mean: {param.grad.mean().item():.6f}")
### 2. Weight Histograms
def plot_weights(model):
for name, param in model.named_parameters():
if 'weight' in name:
plt.figure()
plt.hist(param.detach().cpu().numpy().flatten(), bins=50)
plt.title(name)
plt.show()
---
## 🔹 Advanced Techniques
### 1. Weight Initialization
def init_weights(m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
nn.init.zeros_(m.bias)
model.apply(init_weights)
### 2. Early Stopping
best_loss = float('inf')
patience = 3
trigger_times = 0
for epoch in range(100):
# Training...
val_loss = validate(model, val_loader, criterion)
if val_loss < best_loss:
best_loss = val_loss
trigger_times = 0
torch.save(model.state_dict(), 'best_model.pth')
else:
trigger_times += 1
if trigger_times >= patience:
print("Early stopping!")
break
---
## 🔹 Best Practices
1. Always normalize input data (e.g., scale to [0,1] or standardize)
2. Use batch normalization for deeper networks
3. Start with Adam optimizer (lr=0.001) as default
4. Monitor training with validation set to detect overfitting
5. Visualize weight distributions periodically
6. Use GPU for training (
model.to(device)
)---
### 📌 What's Next?
In Part 3, we'll cover:
➡️ Convolutional Neural Networks (CNNs)
➡️ Transfer Learning
➡️ Image Augmentation Techniques
➡️ Visualizing CNNs
#PyTorch #DeepLearning #MachineLearning 🚀
❤2🔥2
Data Science Machine Learning Data Analysis
Photo
Practice Exercise:
1. Modify the DNN to have 4 hidden layers [256, 128, 64, 32]
2. Try different activation functions (LeakyReLU, Tanh)
3. Implement learning rate scheduling
4. Add dropout and compare results
5. Plot accuracy vs. epoch during training
1. Modify the DNN to have 4 hidden layers [256, 128, 64, 32]
2. Try different activation functions (LeakyReLU, Tanh)
3. Implement learning rate scheduling
4. Add dropout and compare results
5. Plot accuracy vs. epoch during training
# Sample solution for exercise 1
model = DNN(784, [256, 128, 64, 32], 10).to(device)
❤2🔥1
Data Science Machine Learning Data Analysis
Photo
# 📚 PyTorch Tutorial for Beginners - Part 3/6: Convolutional Neural Networks (CNNs) & Computer Vision
#PyTorch #DeepLearning #ComputerVision #CNNs #TransferLearning
Welcome to Part 3 of our PyTorch series! This comprehensive lesson dives deep into Convolutional Neural Networks (CNNs), the powerhouse behind modern computer vision applications. We'll cover architecture design, implementation tricks, transfer learning, and visualization techniques.
---
## 🔹 Introduction to CNNs
### Why CNNs for Images?
Traditional fully-connected networks (DNNs) fail for images because:
- Parameter explosion: A 256x256 RGB image → 196,608 input features
- No spatial awareness: DNNs treat pixels as independent features
- Translation variance: Objects in different positions require re-learning
### CNN Key Innovations
| Concept | Purpose | Visual Example |
|--------------------|-------------------------------------------------------------------------|-----------------------------|
| Local Receptive Fields | Processes small regions at a time (e.g., 3x3 windows) |  |
| Weight Sharing | Same filters applied across entire image (reduces parameters) | |
| Hierarchical Features | Early layers detect edges → textures → object parts → whole objects |  |
---
## 🔹 Core CNN Components
### 1. Convolutional Layers
### 2. Pooling Layers
### 3. Normalization Layers
### 4. Dropout
---
## 🔹 Building a CNN from Scratch
### Complete Architecture
### Shape Calculation Formula
For a layer with:
- Input size: (Hᵢₙ, Wᵢₙ)
- Kernel: K
- Padding: P
- Stride: S
Output dimensions:
---
#PyTorch #DeepLearning #ComputerVision #CNNs #TransferLearning
Welcome to Part 3 of our PyTorch series! This comprehensive lesson dives deep into Convolutional Neural Networks (CNNs), the powerhouse behind modern computer vision applications. We'll cover architecture design, implementation tricks, transfer learning, and visualization techniques.
---
## 🔹 Introduction to CNNs
### Why CNNs for Images?
Traditional fully-connected networks (DNNs) fail for images because:
- Parameter explosion: A 256x256 RGB image → 196,608 input features
- No spatial awareness: DNNs treat pixels as independent features
- Translation variance: Objects in different positions require re-learning
### CNN Key Innovations
| Concept | Purpose | Visual Example |
|--------------------|-------------------------------------------------------------------------|-----------------------------|
| Local Receptive Fields | Processes small regions at a time (e.g., 3x3 windows) |  |
| Weight Sharing | Same filters applied across entire image (reduces parameters) | |
| Hierarchical Features | Early layers detect edges → textures → object parts → whole objects |  |
---
## 🔹 Core CNN Components
### 1. Convolutional Layers
import torch.nn as nn
# 2D convolution (for images)
conv = nn.Conv2d(
in_channels=3, # Input channels (RGB=3, grayscale=1)
out_channels=16, # Number of filters
kernel_size=3, # 3x3 filter
stride=1, # Filter movement step
padding=1 # Preserves spatial dimensions (with stride=1)
)
# Shape transformation: (batch, channels, height, width)
x = torch.randn(32, 3, 64, 64) # 32 RGB images of 64x64
print(conv(x).shape) # → torch.Size([32, 16, 64, 64])
### 2. Pooling Layers
# Max pooling (common for downsampling)
pool = nn.MaxPool2d(kernel_size=2, stride=2)
print(pool(conv(x)).shape) # → torch.Size([32, 16, 32, 32])
# Adaptive pooling (useful for varying input sizes)
adaptive_pool = nn.AdaptiveAvgPool2d((7, 7))
print(adaptive_pool(x).shape) # → torch.Size([32, 3, 7, 7])
### 3. Normalization Layers
# Batch Normalization
bn = nn.BatchNorm2d(16) # num_features = out_channels
x = conv(x)
x = bn(x)
# Layer Normalization (for NLP/sequences)
ln = nn.LayerNorm([16, 64, 64])
### 4. Dropout
# Spatial dropout (drops entire channels)
dropout = nn.Dropout2d(p=0.25)
---
## 🔹 Building a CNN from Scratch
### Complete Architecture
class CNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.features = nn.Sequential(
# Block 1
nn.Conv2d(3, 32, kernel_size=3, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 2
nn.Conv2d(32, 64, kernel_size=3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2),
# Block 3
nn.Conv2d(64, 128, kernel_size=3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.MaxPool2d(2),
)
self.classifier = nn.Sequential(
nn.Linear(128 * 4 * 4, 512), # Adjusted based on input size
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1) # Flatten all dimensions except batch
x = self.classifier(x)
return x
# Usage
model = CNN().to(device)
print(model)
### Shape Calculation Formula
For a layer with:
- Input size: (Hᵢₙ, Wᵢₙ)
- Kernel: K
- Padding: P
- Stride: S
Output dimensions:
Hₒᵤₜ = ⌊(Hᵢₙ + 2P - K)/S⌋ + 1
Wₒᵤₜ = ⌊(Wᵢₙ + 2P - K)/S⌋ + 1
---