Python Data Science Jobs & Interviews

Question 2 (Intermediate):
What is a common use case for the PCA (Principal Component Analysis) algorithm in machine learning?

A) Hyperparameter tuning
B) Data visualization and dimensionality reduction
C) Gradient descent optimization
D) Model ensembling

#MachineLearning #PCA #DimensionalityReduction #MLQuiz #DataScience

1.17K views06:40

#How can I implement Principal Component Analysis (PCA) for dimensionality reduction using scikit-learn? Provide a Python example, explain the concept of variance maximization, and discuss how to choose the number of principal components.

Answer:
PCA reduces the dimensionality of data while preserving as much variance as possible. It transforms features into new uncorrelated variables (principal components) ordered by explained variance.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Print explained variance ratio
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("Total Explained Variance:", sum(pca.explained_variance_ratio_))

# Plot results
plt.figure(figsize=(8, 6))
colors = ['red', 'green', 'blue']
for i in range(3):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], c=colors[i], label=data.target_names[i])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.legend()
plt.grid(True)
plt.show()

# Determine optimal number of components
pca_full = PCA()
pca_full.fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

plt.figure(figsize=(8, 6))
plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% Variance Threshold')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Choosing Number of Components')
plt.legend()
plt.grid(True)
plt.show()

Explanation:
- Standardization: Essential because PCA is sensitive to scale.
- PCA transformation: Finds directions (components) that maximize variance in the data.
- Components: The first component captures the most variance, the second the next highest, etc.

Choosing Number of Components:
Use the "elbow method" or set a threshold (e.g., 95% total variance). In the example, n_components=2 retains ~97% of variance, showing effective reduction from 4D to 2D.

Time Complexity: O(nm² + m³) where n is samples and m is features.
Use Case: #PCA is ideal for visualization, noise reduction, and improving model performance on high-dimensional data.

By: @DataScienceQ

🚀

Please open Telegram to view this post

VIEW IN TELEGRAM

151 viewsedited 10:42

About

Blog

Apps

Platform