How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts.
Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset.
Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds.
Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
One of the most common approaches is splitting data into train, validation and test parts.
Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset.
Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds.
Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
π6β€1
How do you typically validate a machine learning model?
1. Train-test split
2. Cross-validation
3. Holdout validation
4. Bootstrap methods
5. Other (please specify in comments) ππ
1. Train-test split
2. Cross-validation
3. Holdout validation
4. Bootstrap methods
5. Other (please specify in comments) ππ
π7β€1
Is accuracy always a good metric?
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What are precision, recall, and F1-score?
Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).
Where TP is true positives, FP is false positives and FN is false negatives
In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What are precision, recall, and F1-score?
Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).
Where TP is true positives, FP is false positives and FN is false negatives
In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.
π16β€5
What is your go-to tool or library for data visualization?
1. Matplotlib
2. Seaborn
3. Plotly
4. ggplot (in R)
5. Tableau
If you prefer a different tool, share it in the comments below! ππ
1. Matplotlib
2. Seaborn
3. Plotly
4. ggplot (in R)
5. Tableau
If you prefer a different tool, share it in the comments below! ππ
π1
Which of the following is NOT a supervised learning algorithm?
A. Decision Trees
B. K-Means Clustering
C. Support Vector Machines
D. Linear Regression
Comment your answer ππ
A. Decision Trees
B. K-Means Clustering
C. Support Vector Machines
D. Linear Regression
Comment your answer ππ
π2
Data Science Projects
Which of the following is NOT a supervised learning algorithm? A. Decision Trees B. K-Means Clustering C. Support Vector Machines D. Linear Regression Comment your answer ππ
The correct answer is:
B. K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm, whereas Decision Trees, Support Vector Machines, and Linear Regression are all supervised learning algorithms.
B. K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm, whereas Decision Trees, Support Vector Machines, and Linear Regression are all supervised learning algorithms.
π1
How do you typically evaluate the performance of your machine learning models?
1. Accuracy
2. Precision and recall
3. F1-score
4. ROC-AUC curve
5. Mean Squared Error (MSE)
Share your preferred metrics or methods in the comments below! ππ
1. Accuracy
2. Precision and recall
3. F1-score
4. ROC-AUC curve
5. Mean Squared Error (MSE)
Share your preferred metrics or methods in the comments below! ππ
π5β€2
What is your favorite machine learning algorithm and why?
Share your thoughts below! π
Share your thoughts below! π
Which evaluation metric is most appropriate for imbalanced classification tasks where detecting positive cases is crucial?
A. Accuracy
B. Precision
C. F1-score
D. ROC-AUC score
Choose the correct answer!
A. Accuracy
B. Precision
C. F1-score
D. ROC-AUC score
Choose the correct answer!
π2π1
Last question was little tricky!
The correct answer is B. Precision. Congrats to all those who answered correctly
In imbalanced classification tasks, where one class (usually the minority class) is significantly less frequent than the other, accuracy can be misleading because it tends to favor the majority class. Precision, on the other hand, measures the proportion of true positive predictions among all positive predictions made by the model. It is particularly important in scenarios where correctly identifying positive cases (such as detecting fraud or diseases) is crucial, and false positives need to be minimized.
It focuses on the accuracy of positive predictions, making it a more suitable metric than accuracy for imbalanced datasets where the positive class is of interest.
The correct answer is B. Precision. Congrats to all those who answered correctly
In imbalanced classification tasks, where one class (usually the minority class) is significantly less frequent than the other, accuracy can be misleading because it tends to favor the majority class. Precision, on the other hand, measures the proportion of true positive predictions among all positive predictions made by the model. It is particularly important in scenarios where correctly identifying positive cases (such as detecting fraud or diseases) is crucial, and false positives need to be minimized.
It focuses on the accuracy of positive predictions, making it a more suitable metric than accuracy for imbalanced datasets where the positive class is of interest.
π18π2
What is the most exciting application of artificial intelligence in your opinion?
Share your thoughts below! π
Share your thoughts below! π
π4
Choosing the right chart type can make or break your data story. Todayβs tip: Use bar charts for comparisons. Use Line Chart For WoW, MoM, YoY Analysis. Whatβs your go-to chart?
β€1
9 Distance Metrics used in Data Science & Machine Learning.
In data science, distance measures are crucial for various tasks such as clustering, classification, and regression. Below are nine commonly used distance methods:
1. Euclidean Distance:
This measures the straight-line distance between two points in space, similar to measuring with a ruler.
2. Manhattan Distance (L1 Norm):
This distance is calculated by summing the absolute differences between the coordinates of the points, similar to navigating a grid-like city layout.
3. Minkowski Distance:
A general form of distance measurement that includes both Euclidean and Manhattan distances as special cases, depending on a parameter.
4. Chebyshev Distance:
This measures the maximum absolute difference between coordinates of the points, akin to the greatest difference along any dimension.
5. Cosine Similarity:
This assesses how similar two vectors are based on the angle between them, used to measure similarity rather than distance. For distance, it's often inverted.
6. Hamming Distance:
This counts the number of positions at which corresponding symbols differ, commonly used for comparing strings or binary data.
7. Jaccard Distance:
This measures the dissimilarity between two sets by comparing the size of their intersection relative to their union.
8. Mahalanobis Distance:
This measures the distance between a point and a distribution, accounting for correlations among variables, making it useful for multivariate data.
9. Bray-Curtis Distance:
This measures dissimilarity between two samples based on the differences in counts or proportions, often used in ecological and environmental studies.
These distance measures are essential tools in data science for tasks such as clustering, classification, and pattern recognition.
In data science, distance measures are crucial for various tasks such as clustering, classification, and regression. Below are nine commonly used distance methods:
1. Euclidean Distance:
This measures the straight-line distance between two points in space, similar to measuring with a ruler.
2. Manhattan Distance (L1 Norm):
This distance is calculated by summing the absolute differences between the coordinates of the points, similar to navigating a grid-like city layout.
3. Minkowski Distance:
A general form of distance measurement that includes both Euclidean and Manhattan distances as special cases, depending on a parameter.
4. Chebyshev Distance:
This measures the maximum absolute difference between coordinates of the points, akin to the greatest difference along any dimension.
5. Cosine Similarity:
This assesses how similar two vectors are based on the angle between them, used to measure similarity rather than distance. For distance, it's often inverted.
6. Hamming Distance:
This counts the number of positions at which corresponding symbols differ, commonly used for comparing strings or binary data.
7. Jaccard Distance:
This measures the dissimilarity between two sets by comparing the size of their intersection relative to their union.
8. Mahalanobis Distance:
This measures the distance between a point and a distribution, accounting for correlations among variables, making it useful for multivariate data.
9. Bray-Curtis Distance:
This measures dissimilarity between two samples based on the differences in counts or proportions, often used in ecological and environmental studies.
These distance measures are essential tools in data science for tasks such as clustering, classification, and pattern recognition.
π16
What is your preferred method for handling imbalanced datasets in machine learning?
1. Resampling techniques (oversampling/undersampling)
2. Synthetic data generation (SMOTE, ADASYN)
3. Algorithm-specific techniques (class weights, cost-sensitive learning)
4. Ensemble methods (bagging, boosting)
5. Other (share your approach in the comments below!) ππ
1. Resampling techniques (oversampling/undersampling)
2. Synthetic data generation (SMOTE, ADASYN)
3. Algorithm-specific techniques (class weights, cost-sensitive learning)
4. Ensemble methods (bagging, boosting)
5. Other (share your approach in the comments below!) ππ
In todayβs world,
itβs crucial to focus on leading technologies like full-stack development or AI/ML.
However, many students are just copying projects instead of learning. To succeed,
itβs important to work on real, hands-on projects and truly understand the concepts.
itβs crucial to focus on leading technologies like full-stack development or AI/ML.
However, many students are just copying projects instead of learning. To succeed,
itβs important to work on real, hands-on projects and truly understand the concepts.
π24
Has anyone went through interview for data science related roles recently? Feel free to share your experience π
π11π1
Here is the list of few projects (found on kaggle). They cover Basics of Python, Advanced Statistics, Supervised Learning (Regression and Classification problems) & Data Science
Please also check the discussions and notebook submissions for different approaches and solution after you tried yourself.
1. Basic python and statistics
Pima Indians :- https://www.kaggle.com/uciml/pima-indians-diabetes-database
Cardio Goodness fit :- https://www.kaggle.com/saurav9786/cardiogoodfitness
Automobile :- https://www.kaggle.com/toramky/automobile-dataset
2. Advanced Statistics
Game of Thrones:-https://www.kaggle.com/mylesoneill/game-of-thrones
World University Ranking:-https://www.kaggle.com/mylesoneill/world-university-rankings
IMDB Movie Dataset:- https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
3. Supervised Learning
a) Regression Problems
How much did it rain :- https://www.kaggle.com/c/how-much-did-it-rain-ii/overview
Inventory Demand:- https://www.kaggle.com/c/grupo-bimbo-inventory-demand
Property Inspection predictiion:- https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
Restaurant Revenue prediction:- https://www.kaggle.com/c/restaurant-revenue-prediction/data
IMDB Box office Prediction:-https://www.kaggle.com/c/tmdb-box-office-prediction/overview
b) Classification problems
Employee Access challenge :- https://www.kaggle.com/c/amazon-employee-access-challenge/overview
Titanic :- https://www.kaggle.com/c/titanic
San Francisco crime:- https://www.kaggle.com/c/sf-crime
Customer satisfcation:-https://www.kaggle.com/c/santander-customer-satisfaction
Trip type classification:- https://www.kaggle.com/c/walmart-recruiting-trip-type-classification
Categorize cusine:- https://www.kaggle.com/c/whats-cooking
4. Some helpful Data science projects for beginners
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://www.kaggle.com/c/digit-recognizer
https://www.kaggle.com/c/titanic
5. Intermediate Level Data science Projects
Black Friday Data : https://www.kaggle.com/sdolezel/black-friday
Human Activity Recognition Data : https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
Trip History Data : https://www.kaggle.com/pronto/cycle-share-dataset
Million Song Data : https://www.kaggle.com/c/msdchallenge
Census Income Data : https://www.kaggle.com/c/census-income/data
Movie Lens Data : https://www.kaggle.com/grouplens/movielens-20m-dataset
Twitter Classification Data : https://www.kaggle.com/c/twitter-sentiment-analysis2
Share with credits: https://t.iss.one/sqlproject
ENJOY LEARNING ππ
Please also check the discussions and notebook submissions for different approaches and solution after you tried yourself.
1. Basic python and statistics
Pima Indians :- https://www.kaggle.com/uciml/pima-indians-diabetes-database
Cardio Goodness fit :- https://www.kaggle.com/saurav9786/cardiogoodfitness
Automobile :- https://www.kaggle.com/toramky/automobile-dataset
2. Advanced Statistics
Game of Thrones:-https://www.kaggle.com/mylesoneill/game-of-thrones
World University Ranking:-https://www.kaggle.com/mylesoneill/world-university-rankings
IMDB Movie Dataset:- https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
3. Supervised Learning
a) Regression Problems
How much did it rain :- https://www.kaggle.com/c/how-much-did-it-rain-ii/overview
Inventory Demand:- https://www.kaggle.com/c/grupo-bimbo-inventory-demand
Property Inspection predictiion:- https://www.kaggle.com/c/liberty-mutual-group-property-inspection-prediction
Restaurant Revenue prediction:- https://www.kaggle.com/c/restaurant-revenue-prediction/data
IMDB Box office Prediction:-https://www.kaggle.com/c/tmdb-box-office-prediction/overview
b) Classification problems
Employee Access challenge :- https://www.kaggle.com/c/amazon-employee-access-challenge/overview
Titanic :- https://www.kaggle.com/c/titanic
San Francisco crime:- https://www.kaggle.com/c/sf-crime
Customer satisfcation:-https://www.kaggle.com/c/santander-customer-satisfaction
Trip type classification:- https://www.kaggle.com/c/walmart-recruiting-trip-type-classification
Categorize cusine:- https://www.kaggle.com/c/whats-cooking
4. Some helpful Data science projects for beginners
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://www.kaggle.com/c/digit-recognizer
https://www.kaggle.com/c/titanic
5. Intermediate Level Data science Projects
Black Friday Data : https://www.kaggle.com/sdolezel/black-friday
Human Activity Recognition Data : https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
Trip History Data : https://www.kaggle.com/pronto/cycle-share-dataset
Million Song Data : https://www.kaggle.com/c/msdchallenge
Census Income Data : https://www.kaggle.com/c/census-income/data
Movie Lens Data : https://www.kaggle.com/grouplens/movielens-20m-dataset
Twitter Classification Data : https://www.kaggle.com/c/twitter-sentiment-analysis2
Share with credits: https://t.iss.one/sqlproject
ENJOY LEARNING ππ
π12β€6π1
Here are some of the most popular python project ideas: π‘
Simple Calculator
Text-Based Adventure Game
Number Guessing Game
Password Generator
Dice Rolling Simulator
Mad Libs Generator
Currency Converter
Leap Year Checker
Word Counter
Quiz Program
Email Slicer
Rock-Paper-Scissors Game
Web Scraper (Simple)
Text Analyzer
Interest Calculator
Unit Converter
Simple Drawing Program
File Organizer
BMI Calculator
Tic-Tac-Toe Game
To-Do List Application
Inspirational Quote Generator
Task Automation Script
Simple Weather App
Automate data cleaning and analysis (EDA)
Sales analysis
Sentiment analysis
Price prediction
Customer Segmentation
Time series forecasting
Image classification
Spam email detection
Credit card fraud detection
Market basket analysis
NLP, etc
These are just starting points. Feel free to explore, combine ideas, and personalize your projects based on your interest and skills. π―
Simple Calculator
Text-Based Adventure Game
Number Guessing Game
Password Generator
Dice Rolling Simulator
Mad Libs Generator
Currency Converter
Leap Year Checker
Word Counter
Quiz Program
Email Slicer
Rock-Paper-Scissors Game
Web Scraper (Simple)
Text Analyzer
Interest Calculator
Unit Converter
Simple Drawing Program
File Organizer
BMI Calculator
Tic-Tac-Toe Game
To-Do List Application
Inspirational Quote Generator
Task Automation Script
Simple Weather App
Automate data cleaning and analysis (EDA)
Sales analysis
Sentiment analysis
Price prediction
Customer Segmentation
Time series forecasting
Image classification
Spam email detection
Credit card fraud detection
Market basket analysis
NLP, etc
These are just starting points. Feel free to explore, combine ideas, and personalize your projects based on your interest and skills. π―
β€15π12π₯°1
What is your favorite machine learning project that you've worked on, and what made it memorable?
Share your experience below! π
Share your experience below! π
Data Science Projects
What is your favorite machine learning project that you've worked on, and what made it memorable? Share your experience below! π
This is a simple example of ML Project with the steps involved ππ
https://t.iss.one/datasciencefun/1800
https://t.iss.one/datasciencefun/1800
π2β€1