โค5๐4๐คฃ1
What more free resources do you want?
Anonymous Poll
18%
Python
22%
Artificial Intelligence
16%
Machine Learning
20%
Data Science
6%
Data Engineering
4%
Programming languages
15%
Projects
๐7โค3
Important Topics to become a data scientist
[Advanced Level]
๐๐
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Description
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learning important data science and machine learning concepts
ENJOY LEARNING ๐๐
[Advanced Level]
๐๐
1. Mathematics
Linear Algebra
Analytic Geometry
Matrix
Vector Calculus
Optimization
Regression
Dimensionality Reduction
Density Estimation
Classification
2. Probability
Introduction to Probability
1D Random Variable
The function of One Random Variable
Joint Probability Distribution
Discrete Distribution
Normal Distribution
3. Statistics
Introduction to Statistics
Data Description
Random Samples
Sampling Distribution
Parameter Estimation
Hypotheses Testing
Regression
4. Programming
Python:
Python Basics
List
Set
Tuples
Dictionary
Function
NumPy
Pandas
Matplotlib/Seaborn
R Programming:
R Basics
Vector
List
Data Frame
Matrix
Array
Function
dplyr
ggplot2
Tidyr
Shiny
DataBase:
SQL
MongoDB
Data Structures
Web scraping
Linux
Git
5. Machine Learning
How Model Works
Basic Data Exploration
First ML Model
Model Validation
Underfitting & Overfitting
Random Forest
Handling Missing Values
Handling Categorical Variables
Pipelines
Cross-Validation(R)
XGBoost(Python|R)
Data Leakage
6. Deep Learning
Artificial Neural Network
Convolutional Neural Network
Recurrent Neural Network
TensorFlow
Keras
PyTorch
A Single Neuron
Deep Neural Network
Stochastic Gradient Descent
Overfitting and Underfitting
Dropout Batch Normalization
Binary Classification
7. Feature Engineering
Baseline Model
Categorical Encodings
Feature Generation
Feature Selection
8. Natural Language Processing
Text Classification
Word Vectors
9. Data Visualization Tools
BI (Business Intelligence):
Tableau
Power BI
Qlik View
Qlik Sense
10. Deployment
Microsoft Azure
Heroku
Google Cloud Platform
Flask
Django
Join @datasciencefun to learning important data science and machine learning concepts
ENJOY LEARNING ๐๐
๐7โค3๐คฃ1
Machine Learning with Decision Trees and Random Forest ๐.pdf
1.8 MB
Machine Learning with Decision Trees and Random Forest ๐.pdf
๐6๐1
Machine Learning Algorithm
๐9๐1
Data Scientist Interview Questions
1. How would you test whether a given dataset follows a normal distribution?
2. Explain the difference between Type I and Type II errors. How do they impact hypothesis testing?
3. You roll two dice. What is the probability that the sum is at least 8?
4. Given a biased coin that lands on heads with probability p, how can you generate a fair coin flip using it?
5. How would you detect and handle outliers in a dataset?
6. How do you deal with an imbalanced dataset in classification problems?
7. Explain how the Gradient Boosting Algorithm works. How is it different from Random Forest?
8. You are given a trained model with poor performance on new data. How would you debug the issue?
9. What is the curse of dimensionality? How do you mitigate its effects?
10. How do you choose the best number of clusters in K-means clustering?
11. Given a table of transactions, write an SQL query to find the top 3 customers with the highest total purchase amount.
12. How would you optimize a slow SQL query that joins multiple large tables?
13. Write an SQL query to calculate the rolling average of sales over the past 7 days.
14. How would you handle NULL values in an SQL dataset when performing aggregations?
15. How would you design a real-time recommendation system for an e-commerce website?
Answering these questions requires an in-depth knowledge of Data Scientist concepts.
1. How would you test whether a given dataset follows a normal distribution?
2. Explain the difference between Type I and Type II errors. How do they impact hypothesis testing?
3. You roll two dice. What is the probability that the sum is at least 8?
4. Given a biased coin that lands on heads with probability p, how can you generate a fair coin flip using it?
5. How would you detect and handle outliers in a dataset?
6. How do you deal with an imbalanced dataset in classification problems?
7. Explain how the Gradient Boosting Algorithm works. How is it different from Random Forest?
8. You are given a trained model with poor performance on new data. How would you debug the issue?
9. What is the curse of dimensionality? How do you mitigate its effects?
10. How do you choose the best number of clusters in K-means clustering?
11. Given a table of transactions, write an SQL query to find the top 3 customers with the highest total purchase amount.
12. How would you optimize a slow SQL query that joins multiple large tables?
13. Write an SQL query to calculate the rolling average of sales over the past 7 days.
14. How would you handle NULL values in an SQL dataset when performing aggregations?
15. How would you design a real-time recommendation system for an e-commerce website?
Answering these questions requires an in-depth knowledge of Data Scientist concepts.
๐7๐1
5 data science questions you should be able to answer for a data scientist role.
๐๐๐๐ข๐ฎ๐ฆ ๐ฅ๐๐ฏ๐๐ฅ
1. Name ML algorithms that do not use Gradient Descent for optimization.
2. Explain how you construct an ROC-AUC curve.
3. Give examples of business cases where precision is more important than recall, and vice versa.
4. Whatโs the difference between bagging and boosting, and when would you use one over the other?
5. How do MLE and MAP differ?
๐๐๐๐ข๐ฎ๐ฆ ๐ฅ๐๐ฏ๐๐ฅ
1. Name ML algorithms that do not use Gradient Descent for optimization.
2. Explain how you construct an ROC-AUC curve.
3. Give examples of business cases where precision is more important than recall, and vice versa.
4. Whatโs the difference between bagging and boosting, and when would you use one over the other?
5. How do MLE and MAP differ?
๐1๐คฃ1
Many data scientists don't know how to push ML models to production. Here's the recipe ๐
๐๐ฒ๐ ๐๐ป๐ด๐ฟ๐ฒ๐ฑ๐ถ๐ฒ๐ป๐๐
๐น ๐ง๐ฟ๐ฎ๐ถ๐ป / ๐ง๐ฒ๐๐ ๐๐ฎ๐๐ฎ๐๐ฒ๐ - Ensure Test is representative of Online data
๐น ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ - Generate features in real-time
๐น ๐ ๐ผ๐ฑ๐ฒ๐น ๐ข๐ฏ๐ท๐ฒ๐ฐ๐ - Trained SkLearn or Tensorflow Model
๐น ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ ๐๐ผ๐ฑ๐ฒ ๐ฅ๐ฒ๐ฝ๐ผ - Save model project code to Github
๐น ๐๐ฃ๐ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ - Use FastAPI or Flask to build a model API
๐น ๐๐ผ๐ฐ๐ธ๐ฒ๐ฟ - Containerize the ML model API
๐น ๐ฅ๐ฒ๐บ๐ผ๐๐ฒ ๐ฆ๐ฒ๐ฟ๐๐ฒ๐ฟ - Choose a cloud service; e.g. AWS sagemaker
๐น ๐จ๐ป๐ถ๐ ๐ง๐ฒ๐๐๐ - Test inputs & outputs of functions and APIs
๐น ๐ ๐ผ๐ฑ๐ฒ๐น ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด - Evidently AI, a simple, open-source for ML monitoring
๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐ฑ๐๐ฟ๐ฒ
๐ฆ๐๐ฒ๐ฝ ๐ญ - ๐๐ฎ๐๐ฎ ๐ฃ๐ฟ๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ถ๐ผ๐ป & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Don't push a model with 90% accuracy on train set. Do it based on the test set - if and only if, the test set is representative of the online data. Use SkLearn pipeline to chain a series of model preprocessing functions like null handling.
๐ฆ๐๐ฒ๐ฝ ๐ฎ - ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐บ๐ฒ๐ป๐
Train your model with frameworks like Sklearn or Tensorflow. Push the model code including preprocessing, training and validation scripts to Github for reproducibility.
๐ฆ๐๐ฒ๐ฝ ๐ฏ - ๐๐ฃ๐ ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐บ๐ฒ๐ป๐ & ๐๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป
Your model needs a "/predict" endpoint, which receives a JSON object in the request input and generates a JSON object with the model score in the response output. You can use frameworks like FastAPI or Flask. Containzerize this API so that it's agnostic to server environment
๐ฆ๐๐ฒ๐ฝ ๐ฐ - ๐ง๐ฒ๐๐๐ถ๐ป๐ด & ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐
Write tests to validate inputs & outputs of API functions to prevent errors. Push the code to remote services like AWS Sagemaker.
๐ฆ๐๐ฒ๐ฝ ๐ฑ - ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด
Set up monitoring tools like Evidently AI, or use a built-in one within AWS Sagemaker. I use such tools to track performance metrics and data drifts on online data.
๐๐ฒ๐ ๐๐ป๐ด๐ฟ๐ฒ๐ฑ๐ถ๐ฒ๐ป๐๐
๐น ๐ง๐ฟ๐ฎ๐ถ๐ป / ๐ง๐ฒ๐๐ ๐๐ฎ๐๐ฎ๐๐ฒ๐ - Ensure Test is representative of Online data
๐น ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด ๐ฃ๐ถ๐ฝ๐ฒ๐น๐ถ๐ป๐ฒ - Generate features in real-time
๐น ๐ ๐ผ๐ฑ๐ฒ๐น ๐ข๐ฏ๐ท๐ฒ๐ฐ๐ - Trained SkLearn or Tensorflow Model
๐น ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐ ๐๐ผ๐ฑ๐ฒ ๐ฅ๐ฒ๐ฝ๐ผ - Save model project code to Github
๐น ๐๐ฃ๐ ๐๐ฟ๐ฎ๐บ๐ฒ๐๐ผ๐ฟ๐ธ - Use FastAPI or Flask to build a model API
๐น ๐๐ผ๐ฐ๐ธ๐ฒ๐ฟ - Containerize the ML model API
๐น ๐ฅ๐ฒ๐บ๐ผ๐๐ฒ ๐ฆ๐ฒ๐ฟ๐๐ฒ๐ฟ - Choose a cloud service; e.g. AWS sagemaker
๐น ๐จ๐ป๐ถ๐ ๐ง๐ฒ๐๐๐ - Test inputs & outputs of functions and APIs
๐น ๐ ๐ผ๐ฑ๐ฒ๐น ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด - Evidently AI, a simple, open-source for ML monitoring
๐ฃ๐ฟ๐ผ๐ฐ๐ฒ๐ฑ๐๐ฟ๐ฒ
๐ฆ๐๐ฒ๐ฝ ๐ญ - ๐๐ฎ๐๐ฎ ๐ฃ๐ฟ๐ฒ๐ฝ๐ฎ๐ฟ๐ฎ๐๐ถ๐ผ๐ป & ๐๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐๐ป๐ด๐ถ๐ป๐ฒ๐ฒ๐ฟ๐ถ๐ป๐ด
Don't push a model with 90% accuracy on train set. Do it based on the test set - if and only if, the test set is representative of the online data. Use SkLearn pipeline to chain a series of model preprocessing functions like null handling.
๐ฆ๐๐ฒ๐ฝ ๐ฎ - ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐บ๐ฒ๐ป๐
Train your model with frameworks like Sklearn or Tensorflow. Push the model code including preprocessing, training and validation scripts to Github for reproducibility.
๐ฆ๐๐ฒ๐ฝ ๐ฏ - ๐๐ฃ๐ ๐๐ฒ๐๐ฒ๐น๐ผ๐ฝ๐บ๐ฒ๐ป๐ & ๐๐ผ๐ป๐๐ฎ๐ถ๐ป๐ฒ๐ฟ๐ถ๐๐ฎ๐๐ถ๐ผ๐ป
Your model needs a "/predict" endpoint, which receives a JSON object in the request input and generates a JSON object with the model score in the response output. You can use frameworks like FastAPI or Flask. Containzerize this API so that it's agnostic to server environment
๐ฆ๐๐ฒ๐ฝ ๐ฐ - ๐ง๐ฒ๐๐๐ถ๐ป๐ด & ๐๐ฒ๐ฝ๐น๐ผ๐๐บ๐ฒ๐ป๐
Write tests to validate inputs & outputs of API functions to prevent errors. Push the code to remote services like AWS Sagemaker.
๐ฆ๐๐ฒ๐ฝ ๐ฑ - ๐ ๐ผ๐ป๐ถ๐๐ผ๐ฟ๐ถ๐ป๐ด
Set up monitoring tools like Evidently AI, or use a built-in one within AWS Sagemaker. I use such tools to track performance metrics and data drifts on online data.
๐7๐1