Data Science Projects
Hi guys, This post is for all those who are confused with which path to go. Whether ai will take over what they're learning. See one thing is for sure, life is random, now ai is trending and in future it might be something else. So better be prepared withβ¦
One of the very important and underrated skills while learning data science, machine learning, or any other new skill is patience.
Everything takes time, but patience helps you stay calm and focused. Learn from your mistakes, keep practicing, and steadily improve.
These early struggles will slowly turn into success ππͺ
Everything takes time, but patience helps you stay calm and focused. Learn from your mistakes, keep practicing, and steadily improve.
These early struggles will slowly turn into success ππͺ
π9β€1π₯1
What is your preferred programming language for data manipulation?
1. Python
2. R
3. Julia
4. MATLAB
5. SAS
Feel free to mention any other language you prefer in the comments! ππ
1. Python
2. R
3. Julia
4. MATLAB
5. SAS
Feel free to mention any other language you prefer in the comments! ππ
π10β€2
How do we evaluate classification models?
Depending on the classification problem, we can use the following evaluation metrics:
Accuracy
Precision
Recall
F1 Score
Logistic loss (also known as Cross-entropy loss)
Jaccard similarity coefficient score
Depending on the classification problem, we can use the following evaluation metrics:
Accuracy
Precision
Recall
F1 Score
Logistic loss (also known as Cross-entropy loss)
Jaccard similarity coefficient score
π17β€5
Which machine learning framework do you find most effective?
1. TensorFlow
2. PyTorch
3. Scikit-learn
4. Keras
5. XGBoost
If you have a different favorite, share it in the comments below! ππ
1. TensorFlow
2. PyTorch
3. Scikit-learn
4. Keras
5. XGBoost
If you have a different favorite, share it in the comments below! ππ
π3
Where to get data for your next machine learning project?
An overview of 5 amazing resources to accelerate your next project with data!
π Google Datasets
Easy to search Datasets on Google Dataset Search engine as it is to search for anything on Google Search! You just enter the topic on which you need to find a Dataset.
π Kaggle Dataset
Explore, analyze, and share quality data.
π Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources
π Awesome Public Datasets
A topic-centric list of HQ open datasets.
π Azure public data sets
Public data sets for testing and prototyping.
An overview of 5 amazing resources to accelerate your next project with data!
π Google Datasets
Easy to search Datasets on Google Dataset Search engine as it is to search for anything on Google Search! You just enter the topic on which you need to find a Dataset.
π Kaggle Dataset
Explore, analyze, and share quality data.
π Open Data on AWS
This registry exists to help people discover and share datasets that are available via AWS resources
π Awesome Public Datasets
A topic-centric list of HQ open datasets.
π Azure public data sets
Public data sets for testing and prototyping.
π12β€4
Data Science Projects
Can you write a program to print "Hello World" in python?
Without using print statement π
π5π3β€1π1
Many of you already guessed it correctly. Brilliant people β€οΈ
Here is the correct solution
Here is the correct solution
import sys
sys.stdout.write("Hello World\n")
π17β€5π₯°1
What is your preferred method for handling missing data in datasets?
1. Imputation techniques (mean, median, mode)
2. Deleting rows/columns with missing data
3. Using predictive models for imputation
4. Handling missing data as a separate category
5. Other (please specify in comments) ππ
1. Imputation techniques (mean, median, mode)
2. Deleting rows/columns with missing data
3. Using predictive models for imputation
4. Handling missing data as a separate category
5. Other (please specify in comments) ππ
π9β€1
Young people,
Go to the gym,
Even if youβre tired.
Start that business,
Even if youβre poor.
Invest in education,
Even if youβre broke.
Approach that boy or girl,
Even if youβre shy.
Do that work,
Even if youβre unmotivated.
You are a not weak.
Find a way to get things done.
TrueMinds
Go to the gym,
Even if youβre tired.
Start that business,
Even if youβre poor.
Invest in education,
Even if youβre broke.
Approach that boy or girl,
Even if youβre shy.
Do that work,
Even if youβre unmotivated.
You are a not weak.
Find a way to get things done.
TrueMinds
π27β€12β‘2
How to validate your models?
One of the most common approaches is splitting data into train, validation and test parts.
Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset.
Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds.
Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
One of the most common approaches is splitting data into train, validation and test parts.
Models are trained on train data, hyperparameters (for example early stopping) are selected based on the validation data, the final measurement is done on test dataset.
Another approach is cross-validation: split dataset into K folds and each time train models on training folds and measure the performance on the validation folds.
Also you could combine these approaches: make a test/holdout dataset and do cross-validation on the rest of the data. The final quality is measured on test dataset.
π6β€1
How do you typically validate a machine learning model?
1. Train-test split
2. Cross-validation
3. Holdout validation
4. Bootstrap methods
5. Other (please specify in comments) ππ
1. Train-test split
2. Cross-validation
3. Holdout validation
4. Bootstrap methods
5. Other (please specify in comments) ππ
π7β€1
Is accuracy always a good metric?
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What are precision, recall, and F1-score?
Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).
Where TP is true positives, FP is false positives and FN is false negatives
In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.
Accuracy is not a good performance metric when there is imbalance in the dataset. For example, in binary classification with 95% of A class and 5% of B class, a constant prediction of A class would have an accuracy of 95%. In case of imbalance dataset, we need to choose Precision, recall, or F1 Score depending on the problem we are trying to solve.
What are precision, recall, and F1-score?
Precision and recall are classification evaluation metrics:
P = TP / (TP + FP) and R = TP / (TP + FN).
Where TP is true positives, FP is false positives and FN is false negatives
In both cases the score of 1 is the best: we get no false positives or false negatives and only true positives.
F1 is a combination of both precision and recall in one score (harmonic mean):
F1 = 2 * PR / (P + R).
Max F score is 1 and min is 0, with 1 being the best.
π16β€5
What is your go-to tool or library for data visualization?
1. Matplotlib
2. Seaborn
3. Plotly
4. ggplot (in R)
5. Tableau
If you prefer a different tool, share it in the comments below! ππ
1. Matplotlib
2. Seaborn
3. Plotly
4. ggplot (in R)
5. Tableau
If you prefer a different tool, share it in the comments below! ππ
π1
Which of the following is NOT a supervised learning algorithm?
A. Decision Trees
B. K-Means Clustering
C. Support Vector Machines
D. Linear Regression
Comment your answer ππ
A. Decision Trees
B. K-Means Clustering
C. Support Vector Machines
D. Linear Regression
Comment your answer ππ
π2
Data Science Projects
Which of the following is NOT a supervised learning algorithm? A. Decision Trees B. K-Means Clustering C. Support Vector Machines D. Linear Regression Comment your answer ππ
The correct answer is:
B. K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm, whereas Decision Trees, Support Vector Machines, and Linear Regression are all supervised learning algorithms.
B. K-Means Clustering
K-Means Clustering is an unsupervised learning algorithm, whereas Decision Trees, Support Vector Machines, and Linear Regression are all supervised learning algorithms.
π1
How do you typically evaluate the performance of your machine learning models?
1. Accuracy
2. Precision and recall
3. F1-score
4. ROC-AUC curve
5. Mean Squared Error (MSE)
Share your preferred metrics or methods in the comments below! ππ
1. Accuracy
2. Precision and recall
3. F1-score
4. ROC-AUC curve
5. Mean Squared Error (MSE)
Share your preferred metrics or methods in the comments below! ππ
π5β€2
What is your favorite machine learning algorithm and why?
Share your thoughts below! π
Share your thoughts below! π
Which evaluation metric is most appropriate for imbalanced classification tasks where detecting positive cases is crucial?
A. Accuracy
B. Precision
C. F1-score
D. ROC-AUC score
Choose the correct answer!
A. Accuracy
B. Precision
C. F1-score
D. ROC-AUC score
Choose the correct answer!
π2π1
Last question was little tricky!
The correct answer is B. Precision. Congrats to all those who answered correctly
In imbalanced classification tasks, where one class (usually the minority class) is significantly less frequent than the other, accuracy can be misleading because it tends to favor the majority class. Precision, on the other hand, measures the proportion of true positive predictions among all positive predictions made by the model. It is particularly important in scenarios where correctly identifying positive cases (such as detecting fraud or diseases) is crucial, and false positives need to be minimized.
It focuses on the accuracy of positive predictions, making it a more suitable metric than accuracy for imbalanced datasets where the positive class is of interest.
The correct answer is B. Precision. Congrats to all those who answered correctly
In imbalanced classification tasks, where one class (usually the minority class) is significantly less frequent than the other, accuracy can be misleading because it tends to favor the majority class. Precision, on the other hand, measures the proportion of true positive predictions among all positive predictions made by the model. It is particularly important in scenarios where correctly identifying positive cases (such as detecting fraud or diseases) is crucial, and false positives need to be minimized.
It focuses on the accuracy of positive predictions, making it a more suitable metric than accuracy for imbalanced datasets where the positive class is of interest.
π18π2