Q. What do you understand by Recall and Precision?
A. Precision is defined as the fraction of relevant instances among all retrieved instances. Recall, sometimes referred to as βsensitivity, is the fraction of retrieved instances among all relevant instances. A perfect classifier has precision and recall both equal to 1..
.
A. Precision is defined as the fraction of relevant instances among all retrieved instances. Recall, sometimes referred to as βsensitivity, is the fraction of retrieved instances among all relevant instances. A perfect classifier has precision and recall both equal to 1..
.
π8
DATA SCIENCE INTERVIEW QUESTIONS WITH ANSWERS
1. What are the assumptions required for linear regression? What if some of these assumptions are violated?
Ans: The assumptions are as follows:
The sample data used to fit the model is representative of the population
The relationship between X and the mean of Y is linear
The variance of the residual is the same for any value of X (homoscedasticity)
Observations are independent of each other
For any value of X, Y is normally distributed.
Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.
2.What is multicollinearity and how to remove it?
Ans: Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.
You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables β a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.
3. What is overfitting and how to prevent it?
Ans: Overfitting is an error where the model βfitsβ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.
Few approaches to prevent overfitting are:
- Cross-Validation:Cross-validation is a powerful preventative measure against overfitting. Here we use our initial training data to generate multiple mini train-test splits. Now we use these splits to tune our model.
- Train with more data: It wonβt work every time, but training with more data can help algorithms detect the signal better or it can help my model to understand general trends in particular.
- We can remove irrelevant information or the noise from our dataset.
- Early Stopping: When youβre training a learning algorithm iteratively, you can measure how well each iteration of the model performs.
Up until a certain number of iterations, new iterations improve the model. After that point, however, the modelβs ability to generalize can weaken as it begins to overfit the training data.
Early stopping refers stopping the training process before the learner passes that point.
- Regularization: It refers to a broad range of techniques for artificially forcing your model to be simpler. There are mainly 3 types of Regularization techniques:L1, L2,&,Elastic- net.
- Ensembling : Here we take number of learners and using these we get strong model. They are of two types : Bagging and Boosting.
4. Given two fair dices, what is the probability of getting scores that sum to 4 and 8?
Ans: There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12
There are 5 combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36
ENJOY LEARNING ππ
1. What are the assumptions required for linear regression? What if some of these assumptions are violated?
Ans: The assumptions are as follows:
The sample data used to fit the model is representative of the population
The relationship between X and the mean of Y is linear
The variance of the residual is the same for any value of X (homoscedasticity)
Observations are independent of each other
For any value of X, Y is normally distributed.
Extreme violations of these assumptions will make the results redundant. Small violations of these assumptions will result in a greater bias or variance of the estimate.
2.What is multicollinearity and how to remove it?
Ans: Multicollinearity exists when an independent variable is highly correlated with another independent variable in a multiple regression equation. This can be problematic because it undermines the statistical significance of an independent variable.
You could use the Variance Inflation Factors (VIF) to determine if there is any multicollinearity between independent variables β a standard benchmark is that if the VIF is greater than 5 then multicollinearity exists.
3. What is overfitting and how to prevent it?
Ans: Overfitting is an error where the model βfitsβ the data too well, resulting in a model with high variance and low bias. As a consequence, an overfit model will inaccurately predict new data points even though it has a high accuracy on the training data.
Few approaches to prevent overfitting are:
- Cross-Validation:Cross-validation is a powerful preventative measure against overfitting. Here we use our initial training data to generate multiple mini train-test splits. Now we use these splits to tune our model.
- Train with more data: It wonβt work every time, but training with more data can help algorithms detect the signal better or it can help my model to understand general trends in particular.
- We can remove irrelevant information or the noise from our dataset.
- Early Stopping: When youβre training a learning algorithm iteratively, you can measure how well each iteration of the model performs.
Up until a certain number of iterations, new iterations improve the model. After that point, however, the modelβs ability to generalize can weaken as it begins to overfit the training data.
Early stopping refers stopping the training process before the learner passes that point.
- Regularization: It refers to a broad range of techniques for artificially forcing your model to be simpler. There are mainly 3 types of Regularization techniques:L1, L2,&,Elastic- net.
- Ensembling : Here we take number of learners and using these we get strong model. They are of two types : Bagging and Boosting.
4. Given two fair dices, what is the probability of getting scores that sum to 4 and 8?
Ans: There are 4 combinations of rolling a 4 (1+3, 3+1, 2+2):
P(rolling a 4) = 3/36 = 1/12
There are 5 combinations of rolling an 8 (2+6, 6+2, 3+5, 5+3, 4+4):
P(rolling an 8) = 5/36
ENJOY LEARNING ππ
π5
Which models do you know for solving time series problems?
Simple Exponential Smoothing: approximate the time series with an exponentional function
Trend-Corrected Exponential
Smoothing (Holtβs Method): exponential smoothing that also models the trend
Trend- and Seasonality-Corrected Exponential Smoothing
(Holt-Winterβs Method): exponential smoothing that also models trend and seasonality
Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling varation and irregular component
Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables.
Deep learning approaches (RNN, LSTM, etc.)
Simple Exponential Smoothing: approximate the time series with an exponentional function
Trend-Corrected Exponential
Smoothing (Holtβs Method): exponential smoothing that also models the trend
Trend- and Seasonality-Corrected Exponential Smoothing
(Holt-Winterβs Method): exponential smoothing that also models trend and seasonality
Time Series Decomposition: decomposed a time series into the four components trend, seasonal variation, cycling varation and irregular component
Autoregressive models: similar to multiple linear regression, except that the dependent variable y_t depends on its own previous values rather than other independent variables.
Deep learning approaches (RNN, LSTM, etc.)
π2
How is kNN different from k-means clustering?
kNN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data. Both accomplish different tasks.
kNN, or k-nearest neighbors is a classification algorithm, where the k is an integer describing the number of neighboring data points that influence the classification of a given observation. K-means is a clustering algorithm, where the k is an integer describing the number of clusters to be created from the given data. Both accomplish different tasks.
DATA SCIENCE INTERVIEW QUESTIONS WITH ANSWERS
1. What is a logistic function? What is the range of values of a logistic function?
f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity.
2. What is the difference between R square and adjusted R square?
R square and adjusted R square values are used for model validation in case of linear regression. R square indicates the variation of all the independent variables on the dependent variable. i.e. it considers all the independent variable to explain the variation. In the case of Adjusted R squared, it considers only significant variables(P values less than 0.05) to indicate the percentage of variation in the model.
Thus Adjusted R2 is always lesser then R2.
3. What is stratify in Train_test_split?
Stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset. So if my input data has 60% 0's and 40% 1's as my class label, then my train and test dataset will also have the similar proportions.
4. What is Backpropagation in Artificial Neuron Network?
Backpropagation is the method of fine-tuning the weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable by increasing its generalization.
ENJOY LEARNING ππ
1. What is a logistic function? What is the range of values of a logistic function?
f(z) = 1/(1+e -z )
The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity.
2. What is the difference between R square and adjusted R square?
R square and adjusted R square values are used for model validation in case of linear regression. R square indicates the variation of all the independent variables on the dependent variable. i.e. it considers all the independent variable to explain the variation. In the case of Adjusted R squared, it considers only significant variables(P values less than 0.05) to indicate the percentage of variation in the model.
Thus Adjusted R2 is always lesser then R2.
3. What is stratify in Train_test_split?
Stratification means that the train_test_split method returns training and test subsets that have the same proportions of class labels as the input dataset. So if my input data has 60% 0's and 40% 1's as my class label, then my train and test dataset will also have the similar proportions.
4. What is Backpropagation in Artificial Neuron Network?
Backpropagation is the method of fine-tuning the weights of a neural network based on the error rate obtained in the previous epoch (i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model reliable by increasing its generalization.
ENJOY LEARNING ππ
π7π1
Machine learning .pdf
5.3 MB
Core machine learning concepts explained through memes and simple charts created by Mihail Eric.
π° Python for Machine Learning & Data Science Masterclass
β± 44 Hours π¦ 170 Lessons
Learn about Data Science and Machine Learning with Python! Including Numpy, Pandas, Matplotlib, Scikit-Learn and more!
Taught By: Jose Portilla
Download Full Course: https://t.iss.one/datasciencefree/69
Download All Courses: https://t.iss.one/datasciencefree/2
β± 44 Hours π¦ 170 Lessons
Learn about Data Science and Machine Learning with Python! Including Numpy, Pandas, Matplotlib, Scikit-Learn and more!
Taught By: Jose Portilla
Download Full Course: https://t.iss.one/datasciencefree/69
Download All Courses: https://t.iss.one/datasciencefree/2
π10
You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?
Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, letβs assume itβs a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, letβs assume itβs a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values.
π12β€1
Pattern Recognition and
Machine Learning [ Information Science and Statistics ]
Christopher M. Bishop
#python #machinelearning #statistics #information #ai #ml
Machine Learning [ Information Science and Statistics ]
Christopher M. Bishop
#python #machinelearning #statistics #information #ai #ml
π2
π Introduction to Machine Learning
by Alex Smola and S.V.N. Vishwanathan
University Press, Cambridge
by Alex Smola and S.V.N. Vishwanathan
University Press, Cambridge
#numpy
NumPy
Smart use of β:β to extract the right shape
Sometimes you encounter a 3-dim array that is of shape (N, T, D), while your function requires a shape of (N, D). At a time like this, reshape() will do more harm than good, so you are left with one simple solution:
Example:
NumPy
Smart use of β:β to extract the right shape
Sometimes you encounter a 3-dim array that is of shape (N, T, D), while your function requires a shape of (N, D). At a time like this, reshape() will do more harm than good, so you are left with one simple solution:
Example:
for t in xrange(T):
x[:, t, :] = # ...π6
To become a Machine Learning Engineer:
β’ Python
β’ numpy, pandas, matplotlib, Scikit-Learn
β’ TensorFlow or PyTorch
β’ Jupyter, Colab
β’ Analysis > Code
β’ 99%: Foundational algorithms
β’ 1%: Other algorithms
β’ Solve problems β This is key
β’ Teaching = 2 Γ Learning
β’ Have fun!
β’ Python
β’ numpy, pandas, matplotlib, Scikit-Learn
β’ TensorFlow or PyTorch
β’ Jupyter, Colab
β’ Analysis > Code
β’ 99%: Foundational algorithms
β’ 1%: Other algorithms
β’ Solve problems β This is key
β’ Teaching = 2 Γ Learning
β’ Have fun!
π13β€5
A LITTLE GUIDE TO HANDLING MISSING DATA
Having any Feature missing more than 5-10% of its values? you should consider it to be missing data or feature with high absence rateπ
How can you handle these missing values, ensuring you dont loose important part of your dataπ€·ββοΈ
Not a problemπ. Here are important facts you must knowπ
βοΈInstances with missing values for all features should be eliminated
βοΈFeatures with high absence rate should either be eliminated or filled with values
βοΈMissing values can be replaced using Mean Imputation or Regression Imputation
βοΈ Be careful with mean imputation for it may introduce bias as it evens out all instances
βοΈRegression Imputation might overfit your model
βοΈMean and Regression Imputation can't be applied to Text features with missing values
βοΈText Features with missing values can be eliminated if not needed in data
βοΈImportant Text Features with Missing values can be replaced with a new class or category labelled as uncategorized
Having any Feature missing more than 5-10% of its values? you should consider it to be missing data or feature with high absence rateπ
How can you handle these missing values, ensuring you dont loose important part of your dataπ€·ββοΈ
Not a problemπ. Here are important facts you must knowπ
βοΈInstances with missing values for all features should be eliminated
βοΈFeatures with high absence rate should either be eliminated or filled with values
βοΈMissing values can be replaced using Mean Imputation or Regression Imputation
βοΈ Be careful with mean imputation for it may introduce bias as it evens out all instances
βοΈRegression Imputation might overfit your model
βοΈMean and Regression Imputation can't be applied to Text features with missing values
βοΈText Features with missing values can be eliminated if not needed in data
βοΈImportant Text Features with Missing values can be replaced with a new class or category labelled as uncategorized
π7
Top 8 Github Repos to Learn Data Science and Python
1. All algorithms implemented in Python
By: The Algorithms
Stars βοΈ: 135K
Fork: 35.3K
Repo: https://github.com/TheAlgorithms/Python
2. DataScienceResources
By: jJonathan Bower
Stars βοΈ: 3K
Fork: 1.3K
Repo: https://github.com/jonathan-bower/DataScienceResources
3. Playground and Cheatsheet for Learning Python
By: Oleksii Trekhleb ( Also the Image)
Stars βοΈ: 12.5K
Fork: 2K
Repo: https://github.com/trekhleb/learn-python
4. Learn Python 3
By: Jerry Pussinen
Stars βοΈ: 4,8K
Fork: 1,4K
Repo: https://github.com/jerry-git/learn-python3
5. Awesome Data Science
By: Fatih AktΓΌrk, HΓΌseyin Mert & Osman Ungur, Recep Erol.
Stars βοΈ: 18.4K
Fork: 5K
Repo: https://github.com/academic/awesome-datascience
6. data-scientist-roadmap
By: MrMimic
Stars βοΈ: 5K
Fork: 1.5K
Repo: https://github.com/MrMimic/data-scientist-roadmap
7. Data Science Best Resources
By: Tirthajyoti Sarkar
Stars βοΈ: 1.8K
Fork: 717
Repo: https://github.com/tirthajyoti/Data-science-best-resources/blob/master/README.md
8. Ds-cheatsheets
By: Favio AndrΓ© VΓ‘zquez
Stars βοΈ: 10.4K
Fork: 3.1K
Repo: https://github.com/FavioVazquez/ds-cheatsheets
1. All algorithms implemented in Python
By: The Algorithms
Stars βοΈ: 135K
Fork: 35.3K
Repo: https://github.com/TheAlgorithms/Python
2. DataScienceResources
By: jJonathan Bower
Stars βοΈ: 3K
Fork: 1.3K
Repo: https://github.com/jonathan-bower/DataScienceResources
3. Playground and Cheatsheet for Learning Python
By: Oleksii Trekhleb ( Also the Image)
Stars βοΈ: 12.5K
Fork: 2K
Repo: https://github.com/trekhleb/learn-python
4. Learn Python 3
By: Jerry Pussinen
Stars βοΈ: 4,8K
Fork: 1,4K
Repo: https://github.com/jerry-git/learn-python3
5. Awesome Data Science
By: Fatih AktΓΌrk, HΓΌseyin Mert & Osman Ungur, Recep Erol.
Stars βοΈ: 18.4K
Fork: 5K
Repo: https://github.com/academic/awesome-datascience
6. data-scientist-roadmap
By: MrMimic
Stars βοΈ: 5K
Fork: 1.5K
Repo: https://github.com/MrMimic/data-scientist-roadmap
7. Data Science Best Resources
By: Tirthajyoti Sarkar
Stars βοΈ: 1.8K
Fork: 717
Repo: https://github.com/tirthajyoti/Data-science-best-resources/blob/master/README.md
8. Ds-cheatsheets
By: Favio AndrΓ© VΓ‘zquez
Stars βοΈ: 10.4K
Fork: 3.1K
Repo: https://github.com/FavioVazquez/ds-cheatsheets
π5π₯°1
π₯Deep Learning with Pytorch by Prof.Yann LeCun (CNN Founder)
This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition.
GitHub Link: https://atcold.github.io/pytorch-Deep-Learning/
YouTube Playlist: https://www.youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq
This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition.
GitHub Link: https://atcold.github.io/pytorch-Deep-Learning/
YouTube Playlist: https://www.youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq
YouTube
NYU Deep Learning SP20
Course website: https://bit.ly/DLSP20-web
π4