Python Data Science Jobs & Interviews

#How can I use scikit-learn to build a machine learning pipeline for classification? Provide a Python example, explain the steps involved in preprocessing, model training, and evaluation, and demonstrate how to use cross-validation.

Answer:
Scikit-learn is a powerful Python library for machine learning that provides simple and efficient tools for data mining and data analysis. It supports various algorithms, preprocessing techniques, and evaluation metrics.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Load dataset
data = datasets.load_iris()
X = data.data
y = data.target
feature_names = data.feature_names
target_names = data.target_names

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline with preprocessing and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', SVC(kernel='rbf', random_state=42))
])

# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()

# Cross-validation
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.2f} ± {cv_scores.std():.2f}")

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__gamma': ['scale', 'auto', 0.1, 1]
}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

# Final model with best parameters
best_model = grid_search.best_estimator_
final_predictions = best_model.predict(X_test)
final_accuracy = accuracy_score(y_test, final_predictions)
print(f"Final Accuracy with tuned model: {final_accuracy:.2f}")

Explanation:
- Pipeline: Combines preprocessing (StandardScaler) and model (SVC) into one unit for clean workflow and avoiding data leakage.
- StandardScaler: Normalizes features to have zero mean and unit variance.
- SVC: Support Vector Classifier for classification; RBF kernel handles non-linear data.
- Cross-validation: Evaluates model performance on multiple folds to reduce overfitting.
- GridSearchCV: Automates hyperparameter tuning by testing combinations of parameters.

Key Features of scikit-learn:
- Consistent API across models and utilities.
- Built-in support for preprocessing, feature selection, model evaluation, and ensemble methods.
- Extensive documentation and community support.

Use Case: Ideal for beginners and professionals alike to quickly prototype, evaluate, and optimize machine learning models.

#MachineLearning #ScikitLearn #Python #DataScience #MLPipeline #Classification #CrossValidation #HyperparameterTuning #SVM #GridSearchCV #DataPreprocessing

By: @DataScienceQ 🚀

168 viewsedited 10:45

About

Blog

Apps

Platform