What is the curse of dimensionality? Why do we care about it?
Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spread the data even further making high dimensional data extremely sparse. We care about it, because it is difficult to use machine learning in sparse spaces.
Data in only one dimension is relatively tightly packed. Adding a dimension stretches the points across that dimension, pushing them further apart. Additional dimensions spread the data even further making high dimensional data extremely sparse. We care about it, because it is difficult to use machine learning in sparse spaces.
๐1
K-means vs DBScan ML Algorithm
DBScan is more robust to noise.
DBScan is better when the amount of clusters is difficult to guess.
K-means has a lower complexity, i.e. it will be much faster, especially with a larger amount of points.
DBScan is more robust to noise.
DBScan is better when the amount of clusters is difficult to guess.
K-means has a lower complexity, i.e. it will be much faster, especially with a larger amount of points.
โค1
Data Science & Machine Learning
Fake_News_Detection_Machine_learning_project.rar
Start working on any project if you are a beginner and want to grow your career as a data scientist
You will learn much more as you practice and work on projects from yourself
You can find dataset in this channel or go to kaggle to find any random dataset and just work on it
Learning concepts is fine but most of the learnings come from projects
I know that might feel boring at first time but as you move forward, it become interesting
You will learn much more as you practice and work on projects from yourself
You can find dataset in this channel or go to kaggle to find any random dataset and just work on it
Learning concepts is fine but most of the learnings come from projects
I know that might feel boring at first time but as you move forward, it become interesting
โค1
๐The Ultimate Guide to the Pandas Library for Data Science in Python
๐๐
https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/amp/
A Visual Intro to NumPy and Data Representation
.
Link : ๐๐
https://jalammar.github.io/visual-numpy/
Matplotlib Cheatsheet ๐๐
https://github.com/rougier/matplotlib-cheatsheet
SQL Cheatsheet ๐๐
https://websitesetup.org/sql-cheat-sheet/
๐๐
https://www.freecodecamp.org/news/the-ultimate-guide-to-the-pandas-library-for-data-science-in-python/amp/
A Visual Intro to NumPy and Data Representation
.
Link : ๐๐
https://jalammar.github.io/visual-numpy/
Matplotlib Cheatsheet ๐๐
https://github.com/rougier/matplotlib-cheatsheet
SQL Cheatsheet ๐๐
https://websitesetup.org/sql-cheat-sheet/
Seeing Theory : A visual introduction to probability and statistics
Link :๐๐
https://seeing-theory.brown.edu/
โThe Projects You Should Do to Get a Data Science Jobโ by Ken Jee
๐๐
https://link.medium.com/Q2DnxSGRO6
Link :๐๐
https://seeing-theory.brown.edu/
โThe Projects You Should Do to Get a Data Science Jobโ by Ken Jee
๐๐
https://link.medium.com/Q2DnxSGRO6
Type-2 error is?
Anonymous Quiz
18%
True Positive
32%
True Negative
15%
False Positive
35%
False Negative
Precision is one indicator of a machine learning model's performance โ the quality of a positive prediction made by the model. Its formula would be?
Anonymous Quiz
43%
True Positive divided by actual yes
10%
True Positive divided by actual no
43%
True Positive divided by predicted yes
4%
True Positive divided by predicted no
Scatter plot is used to?
Anonymous Quiz
27%
Find Correlation between two variables
12%
Detect Outliers
14%
Compare large numbers of data points without regard to time
47%
All of the above
๐2
๐A handy notebook on handling missing values
Link : ๐๐
https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python
A list of NLP Tutorials
Link : ๐๐
https://github.com/lyeoni/nlp-tutorial
โAn Implementation and Explanation of the Random Forest in Pythonโ by Will Koehrsen ๐๐
https://link.medium.com/GCWFv81v95
โHow to analyse 100s of GBs of data on your laptop with Pythonโ by Jovan Veljanoski ๐๐
https://link.medium.com/V8xS82Cax6
Link : ๐๐
https://www.kaggle.com/parulpandey/a-guide-to-handling-missing-values-in-python
A list of NLP Tutorials
Link : ๐๐
https://github.com/lyeoni/nlp-tutorial
โAn Implementation and Explanation of the Random Forest in Pythonโ by Will Koehrsen ๐๐
https://link.medium.com/GCWFv81v95
โHow to analyse 100s of GBs of data on your laptop with Pythonโ by Jovan Veljanoski ๐๐
https://link.medium.com/V8xS82Cax6
๐1
Skills Required For A Data Analyst in 2021
Basic Excel-
https://www.youtube.com/playlist?list=PLmQAMKHKeLZ_ADx6nJcoTM5t2S1bmsMdm
Advanced Excel-
https://www.youtube.com/playlist?list=PLmQAMKHKeLZ_e9xmZNPACsLdgie3Tkaxf
SQL-
https://www.youtube.com/watch?v=5JCyiutyu_o&list=PLmQAMKHKeLZ-kD9VN0prfKCByr9pa4jw6
SQL- (Khan Academy)-
https://www.khanacademy.org/computing/computer-programming/sql
Python Programming Language-
https://www.youtube.com/watch?v=bPrmA1SEN2k&list=PLZoTAELRMXVNUL99R4bDlVYsncUNvwUBB
Stats Lectures-
https://www.youtube.com/watch?v=zRUliXuwJCQ&list=PLZoTAELRMXVMhVyr3Ri9IQ-t5QPBtxzJO
Stats Lectures(Khans Academy)-
https://www.khanacademy.org/math/statistics-probability
Python EDA-
https://www.youtube.com/playlist?list=PLZoTAELRMXVPQyArDHyQVjQxjj_YmEuO9
Python Feature Engineering-
https://www.youtube.com/playlist?list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN
Tableau-
https://www.tableau.com/academic/student-Iron-Viz
Basic Excel-
https://www.youtube.com/playlist?list=PLmQAMKHKeLZ_ADx6nJcoTM5t2S1bmsMdm
Advanced Excel-
https://www.youtube.com/playlist?list=PLmQAMKHKeLZ_e9xmZNPACsLdgie3Tkaxf
SQL-
https://www.youtube.com/watch?v=5JCyiutyu_o&list=PLmQAMKHKeLZ-kD9VN0prfKCByr9pa4jw6
SQL- (Khan Academy)-
https://www.khanacademy.org/computing/computer-programming/sql
Python Programming Language-
https://www.youtube.com/watch?v=bPrmA1SEN2k&list=PLZoTAELRMXVNUL99R4bDlVYsncUNvwUBB
Stats Lectures-
https://www.youtube.com/watch?v=zRUliXuwJCQ&list=PLZoTAELRMXVMhVyr3Ri9IQ-t5QPBtxzJO
Stats Lectures(Khans Academy)-
https://www.khanacademy.org/math/statistics-probability
Python EDA-
https://www.youtube.com/playlist?list=PLZoTAELRMXVPQyArDHyQVjQxjj_YmEuO9
Python Feature Engineering-
https://www.youtube.com/playlist?list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN
Tableau-
https://www.tableau.com/academic/student-Iron-Viz
๐2
Data Science & Machine Learning
What are precision, recall, and F1-score? Precision and recall are classification evaluation metrics: P = TP / (TP + FP) and R = TP / (TP + FN). Where TP is true positives, FP is false positives and FN is false negatives In both cases the score of 1 isโฆ
Here is the explanation for the quiz
The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.
The F-beta score is the weighted harmonic mean of precision and recall, reaching its optimal value at 1 and its worst value at 0. The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero. The F-score is commonly used for evaluating information retrieval systems such as search engines, and also for many kinds of machine learning models, in particular in natural language processing.
What is the full form of LSTM?
Hint- LSTM algorithm is used for processing and making predictions based on time series data
Hint- LSTM algorithm is used for processing and making predictions based on time series data
Anonymous Quiz
7%
Long story total memory
72%
Long short-term memory
16%
Long short-term machine
4%
None of three
โค1
Which of the following project seems attractive to you?
Anonymous Poll
30%
Fake News Detection
26%
Speech Emotion Recognition
28%
Chatbot Project in Python
27%
Movie Recommendation System
21%
Customer Segmentation
31%
Sentiment Analysis
21%
Handwritten Character Recognition
12%
Digit Recognition System
35%
Face Detection System
5%
None of these
Data Science & Machine Learning
Which of the following project seems attractive to you?
Amazing response from you guys in this poll
Lets start with project #1
Fake News Detection
This is an example of text classification since we need to classify whether a news is real or fake
You can refer dataset from kaggle to work on such an amazing project
https://bit.ly/3FGcyoJ
Or
https://www.kaggle.com/c/fake-news/data
Before you work on this project, you should have fair understanding of below topics
Concepts: Stopwords, Porter Stemmer, Tokenisation, Tfid Vectorizer, LSTM, NLP
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, re, nltk, Tensorflow
Steps:
1. Go through dataset
2. Import the libraries
3. Exploratory Data Analysis [EDA]
4. Data Visualization
5. Data Preparation using Tokenisation and padding
6. Apply theoretical concepts to reduce unnecessary words using Stopwords and Porter Stemmer. Convert text to vector using Count Vectorizer.
7. Split dataset into training and testing
8. Build and train the model using ML Algorithms
9. Model Evaluation using accuracy, recall, precision, confusion matrix and other metrics concepts
Algorithms you can apply:
Logistic Regression, Support Vector Machine, Multilayer Perceptron, KNN, Random Forest, Linear SVM, etc.
ENJOY LEARNING ๐๐
Lets start with project #1
Fake News Detection
This is an example of text classification since we need to classify whether a news is real or fake
You can refer dataset from kaggle to work on such an amazing project
https://bit.ly/3FGcyoJ
Or
https://www.kaggle.com/c/fake-news/data
Before you work on this project, you should have fair understanding of below topics
Concepts: Stopwords, Porter Stemmer, Tokenisation, Tfid Vectorizer, LSTM, NLP
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, re, nltk, Tensorflow
Steps:
1. Go through dataset
2. Import the libraries
3. Exploratory Data Analysis [EDA]
4. Data Visualization
5. Data Preparation using Tokenisation and padding
6. Apply theoretical concepts to reduce unnecessary words using Stopwords and Porter Stemmer. Convert text to vector using Count Vectorizer.
7. Split dataset into training and testing
8. Build and train the model using ML Algorithms
9. Model Evaluation using accuracy, recall, precision, confusion matrix and other metrics concepts
Algorithms you can apply:
Logistic Regression, Support Vector Machine, Multilayer Perceptron, KNN, Random Forest, Linear SVM, etc.
ENJOY LEARNING ๐๐
๐2
Data Science & Machine Learning
Amazing response from you guys in this poll Lets start with project #1 Fake News Detection This is an example of text classification since we need to classify whether a news is real or fake You can refer dataset from kaggle to work on such an amazingโฆ
Overview of some important concepts:
๐ Natural Language Processing, or NLP is a subfield of Artificial Intelligence that enables machines to understand the human language. Its goal is to build systems that can make sense of text and automatically perform tasks like translation, spell check, or text classification.
NLP analyzes the grammatical structure of sentences and the individual meaning of words, then uses algorithms to extract meaning and deliver outputs. In other words, it makes sense of human language so that it can automatically perform different tasks.
๐ Tokenization is a part of syntactic analysis and break up a text into smaller parts called tokens (which can be sentences or words) to make text easier to handle.
๐ Stop-word removal technique removes frequently occuring words that donโt add any semantic value, such as I, they, have, like, yours, etc.
๐ The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. For example: words such as โLikesโ, โlikedโ, โlikelyโ and โlikingโ will be reduced to โlikeโ after stemming.
๐ TfidfVectorizer is used to transform text to feature vectors that can be used as input to estimator.
LSTM[Long Short-term memory] networks are well-suited for classifying, processing and making predictions based on time series data.
ENJOY LEARNING ๐๐
๐ Natural Language Processing, or NLP is a subfield of Artificial Intelligence that enables machines to understand the human language. Its goal is to build systems that can make sense of text and automatically perform tasks like translation, spell check, or text classification.
NLP analyzes the grammatical structure of sentences and the individual meaning of words, then uses algorithms to extract meaning and deliver outputs. In other words, it makes sense of human language so that it can automatically perform different tasks.
๐ Tokenization is a part of syntactic analysis and break up a text into smaller parts called tokens (which can be sentences or words) to make text easier to handle.
๐ Stop-word removal technique removes frequently occuring words that donโt add any semantic value, such as I, they, have, like, yours, etc.
๐ The Porter stemming algorithm (or 'Porter stemmer') is a process for removing the commoner morphological and inflexional endings from words in English. For example: words such as โLikesโ, โlikedโ, โlikelyโ and โlikingโ will be reduced to โlikeโ after stemming.
๐ TfidfVectorizer is used to transform text to feature vectors that can be used as input to estimator.
LSTM[Long Short-term memory] networks are well-suited for classifying, processing and making predictions based on time series data.
ENJOY LEARNING ๐๐
โค2
Some important questions to crack data science interview
Q. Describe how Gradient Boosting works.
A. Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. If a small change in the prediction for a case causes no change in error, then next target outcome of the case is zero. Gradient boosting produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Q. Describe the decision tree model.
A. Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The leaves are the decisions or the final outcomes. A decision tree is a machine learning algorithm that partitions the data into subsets.
Q. What is a neural network?
A. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. They, also known as Artificial Neural Networks, are the subset of Deep Learning.
Q. Explain the Bias-Variance Tradeoff
A. The biasโvariance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.
Q. Whatโs the difference between L1 and L2 regularization?
A. The main intuitive difference between the L1 and L2 regularization is that L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting. That value will also be the median of the data distribution mathematically.
ENJOY LEARNING ๐๐
Q. Describe how Gradient Boosting works.
A. Gradient boosting is a type of machine learning boosting. It relies on the intuition that the best possible next model, when combined with previous models, minimizes the overall prediction error. If a small change in the prediction for a case causes no change in error, then next target outcome of the case is zero. Gradient boosting produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
Q. Describe the decision tree model.
A. Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter. The leaves are the decisions or the final outcomes. A decision tree is a machine learning algorithm that partitions the data into subsets.
Q. What is a neural network?
A. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. They, also known as Artificial Neural Networks, are the subset of Deep Learning.
Q. Explain the Bias-Variance Tradeoff
A. The biasโvariance tradeoff is the property of a model that the variance of the parameter estimated across samples can be reduced by increasing the bias in the estimated parameters.
Q. Whatโs the difference between L1 and L2 regularization?
A. The main intuitive difference between the L1 and L2 regularization is that L1 regularization tries to estimate the median of the data while the L2 regularization tries to estimate the mean of the data to avoid overfitting. That value will also be the median of the data distribution mathematically.
ENJOY LEARNING ๐๐
๐7
Which of the following is not a python library?
Anonymous Quiz
1%
Pandas
2%
Numpy
4%
Matplotlib
80%
Dictionary
13%
Seaborn
๐2
Which of the following is used specifically for applying machine learning algorithms?
Anonymous Quiz
14%
Matplotlib
71%
Scikit-learn
7%
Seaborn
8%
Scipy
Some important questions to crack data science interview Part-2
๐1. ๐ฉ-๐ฏ๐๐ฅ๐ฎ๐?
๐ns. p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.
๐2. ๐๐ง๐ญ๐๐ซ๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ฑ๐ญ๐ซ๐๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง?
๐ns. Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
๐3. ๐๐ง๐ข๐๐จ๐ซ๐ฆ๐๐ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง & ๐ง๐จ๐ซ๐ฆ๐๐ฅ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง?
๐ns. The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.
๐4. ๐๐๐๐จ๐ฆ๐ฆ๐๐ง๐๐๐ซ ๐๐ฒ๐ฌ๐ญ๐๐ฆ๐ฌ?
๐ns. The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.
๐5. ๐๐๐๐ ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง ๐ข๐ง ๐๐๐
๐ns. The SQL Joins clause is used to combine records from two or more tables in a database.
๐6. ๐๐ช๐ฎ๐๐ซ๐๐ ๐๐ซ๐ซ๐จ๐ซ ๐๐ง๐ ๐๐๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ ๐๐ซ๐ซ๐จ๐ซ?
๐ns. mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem's accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.
ENJOY LEARNING ๐๐
๐1. ๐ฉ-๐ฏ๐๐ฅ๐ฎ๐?
๐ns. p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference. P-value can be used as an alternative to or in addition to pre-selected confidence levels for hypothesis testing.
๐2. ๐๐ง๐ญ๐๐ซ๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง ๐๐ง๐ ๐๐ฑ๐ญ๐ซ๐๐ฉ๐จ๐ฅ๐๐ญ๐ข๐จ๐ง?
๐ns. Interpolation is the process of calculating the unknown value from known given values whereas extrapolation is the process of calculating unknown values beyond the given data points.
๐3. ๐๐ง๐ข๐๐จ๐ซ๐ฆ๐๐ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง & ๐ง๐จ๐ซ๐ฆ๐๐ฅ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง?
๐ns. The normal distribution is bell-shaped, which means value near the center of the distribution are more likely to occur as opposed to values on the tails of the distribution. The uniform distribution is rectangular-shaped, which means every value in the distribution is equally likely to occur.
๐4. ๐๐๐๐จ๐ฆ๐ฆ๐๐ง๐๐๐ซ ๐๐ฒ๐ฌ๐ญ๐๐ฆ๐ฌ?
๐ns. The recommender system mainly deals with the likes and dislikes of the users. Its major objective is to recommend an item to a user which has a high chance of liking or is in need of a particular user based on his previous purchases. It is like having a personalized team who can understand our likes and dislikes and help us in making the decisions regarding a particular item without being biased by any means by making use of a large amount of data in the repositories which are generated day by day.
๐5. ๐๐๐๐ ๐๐ฎ๐ง๐๐ญ๐ข๐จ๐ง ๐ข๐ง ๐๐๐
๐ns. The SQL Joins clause is used to combine records from two or more tables in a database.
๐6. ๐๐ช๐ฎ๐๐ซ๐๐ ๐๐ซ๐ซ๐จ๐ซ ๐๐ง๐ ๐๐๐ฌ๐จ๐ฅ๐ฎ๐ญ๐ ๐๐ซ๐ซ๐จ๐ซ?
๐ns. mean squared error (MSE), and mean absolute error (MAE) are used to evaluate the regression problem's accuracy. The squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization.
ENJOY LEARNING ๐๐
๐2