A LITTLE GUIDE TO HANDLING MISSING DATA
Having any Feature missing more than 5-10% of its values? you should consider it to be missing data or feature with high absence rate๐
How can you handle these missing values, ensuring you dont loose important part of your data๐คทโโ๏ธ
Not a problem๐. Here are important facts you must know๐
โ๏ธInstances with missing values for all features should be eliminated
โ๏ธFeatures with high absence rate should either be eliminated or filled with values
โ๏ธMissing values can be replaced using Mean Imputation or Regression Imputation
โ๏ธ Be careful with mean imputation for it may introduce bias as it evens out all instances
โ๏ธRegression Imputation might overfit your model
โ๏ธMean and Regression Imputation can't be applied to Text features with missing values
โ๏ธText Features with missing values can be eliminated if not needed in data
โ๏ธImportant Text Features with Missing values can be replaced with a new class or category labelled as uncategorized
Having any Feature missing more than 5-10% of its values? you should consider it to be missing data or feature with high absence rate๐
How can you handle these missing values, ensuring you dont loose important part of your data๐คทโโ๏ธ
Not a problem๐. Here are important facts you must know๐
โ๏ธInstances with missing values for all features should be eliminated
โ๏ธFeatures with high absence rate should either be eliminated or filled with values
โ๏ธMissing values can be replaced using Mean Imputation or Regression Imputation
โ๏ธ Be careful with mean imputation for it may introduce bias as it evens out all instances
โ๏ธRegression Imputation might overfit your model
โ๏ธMean and Regression Imputation can't be applied to Text features with missing values
โ๏ธText Features with missing values can be eliminated if not needed in data
โ๏ธImportant Text Features with Missing values can be replaced with a new class or category labelled as uncategorized
๐7
Top 8 Github Repos to Learn Data Science and Python
1. All algorithms implemented in Python
By: The Algorithms
Stars โญ๏ธ: 135K
Fork: 35.3K
Repo: https://github.com/TheAlgorithms/Python
2. DataScienceResources
By: jJonathan Bower
Stars โญ๏ธ: 3K
Fork: 1.3K
Repo: https://github.com/jonathan-bower/DataScienceResources
3. Playground and Cheatsheet for Learning Python
By: Oleksii Trekhleb ( Also the Image)
Stars โญ๏ธ: 12.5K
Fork: 2K
Repo: https://github.com/trekhleb/learn-python
4. Learn Python 3
By: Jerry Pussinen
Stars โญ๏ธ: 4,8K
Fork: 1,4K
Repo: https://github.com/jerry-git/learn-python3
5. Awesome Data Science
By: Fatih Aktรผrk, Hรผseyin Mert & Osman Ungur, Recep Erol.
Stars โญ๏ธ: 18.4K
Fork: 5K
Repo: https://github.com/academic/awesome-datascience
6. data-scientist-roadmap
By: MrMimic
Stars โญ๏ธ: 5K
Fork: 1.5K
Repo: https://github.com/MrMimic/data-scientist-roadmap
7. Data Science Best Resources
By: Tirthajyoti Sarkar
Stars โญ๏ธ: 1.8K
Fork: 717
Repo: https://github.com/tirthajyoti/Data-science-best-resources/blob/master/README.md
8. Ds-cheatsheets
By: Favio Andrรฉ Vรกzquez
Stars โญ๏ธ: 10.4K
Fork: 3.1K
Repo: https://github.com/FavioVazquez/ds-cheatsheets
1. All algorithms implemented in Python
By: The Algorithms
Stars โญ๏ธ: 135K
Fork: 35.3K
Repo: https://github.com/TheAlgorithms/Python
2. DataScienceResources
By: jJonathan Bower
Stars โญ๏ธ: 3K
Fork: 1.3K
Repo: https://github.com/jonathan-bower/DataScienceResources
3. Playground and Cheatsheet for Learning Python
By: Oleksii Trekhleb ( Also the Image)
Stars โญ๏ธ: 12.5K
Fork: 2K
Repo: https://github.com/trekhleb/learn-python
4. Learn Python 3
By: Jerry Pussinen
Stars โญ๏ธ: 4,8K
Fork: 1,4K
Repo: https://github.com/jerry-git/learn-python3
5. Awesome Data Science
By: Fatih Aktรผrk, Hรผseyin Mert & Osman Ungur, Recep Erol.
Stars โญ๏ธ: 18.4K
Fork: 5K
Repo: https://github.com/academic/awesome-datascience
6. data-scientist-roadmap
By: MrMimic
Stars โญ๏ธ: 5K
Fork: 1.5K
Repo: https://github.com/MrMimic/data-scientist-roadmap
7. Data Science Best Resources
By: Tirthajyoti Sarkar
Stars โญ๏ธ: 1.8K
Fork: 717
Repo: https://github.com/tirthajyoti/Data-science-best-resources/blob/master/README.md
8. Ds-cheatsheets
By: Favio Andrรฉ Vรกzquez
Stars โญ๏ธ: 10.4K
Fork: 3.1K
Repo: https://github.com/FavioVazquez/ds-cheatsheets
๐5๐ฅฐ1
๐ฅDeep Learning with Pytorch by Prof.Yann LeCun (CNN Founder)
This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition.
GitHub Link: https://atcold.github.io/pytorch-Deep-Learning/
YouTube Playlist: https://www.youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq
This course concerns the latest techniques in deep learning and representation learning, focusing on supervised and unsupervised deep learning, embedding methods, metric learning, convolutional and recurrent nets, with applications to computer vision, natural language understanding, and speech recognition.
GitHub Link: https://atcold.github.io/pytorch-Deep-Learning/
YouTube Playlist: https://www.youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyolxOsz6mq
YouTube
NYU Deep Learning SP20
Course website: https://bit.ly/DLSP20-web
๐4
New Data Scientists - When you learn, it's easy to get distracted by Machine Learning & Deep Learning terms like "XGBoost", "Neural Networks", "RNN", "LSTM" or Advanced Technologies like "Spark", "Julia", "Scala", "Go", etc.
Don't get bogged down trying to learn every new term & technology you come across.
Instead, focus on foundations.
- data wrangling
- visualizing
- exploring
- modeling
- understanding the results.
The best tools are often basic, Build yourself up. You'll advance much faster. Keep learning!
Don't get bogged down trying to learn every new term & technology you come across.
Instead, focus on foundations.
- data wrangling
- visualizing
- exploring
- modeling
- understanding the results.
The best tools are often basic, Build yourself up. You'll advance much faster. Keep learning!
๐16โค9๐ค1
Which of the following tool can be used for data visualization?
Anonymous Quiz
21%
Matplotlib
17%
Tableau
2%
Seaborn
61%
All of the above
๐7
Data Analysis Interview Questions and Answers
๐๐
1.How to create filters in Power BI?
Filters are an integral part of Power BI reports. They are used to slice and dice the data as per the dimensions we want. Filters are created in a couple of ways.
Using Slicers: A slicer is a visual under Visualization Pane. This can be added to the design view to filter our reports. When a slicer is added to the design view, it requires a field to be added to it. For example- Slicer can be added for Country fields. Then the data can be filtered based on countries.
Using Filter Pane: The Power BI team has added a filter pane to the reports, which is a single space where we can add different fields as filters. And these fields can be added depending on whether you want to filter only one visual(Visual level filter), or all the visuals in the report page(Page level filters), or applicable to all the pages of the report(report level filters)
2.How to sort data in Power BI?
Sorting is available in multiple formats. In the data view, a common sorting option of alphabetical order is there. Apart from that, we have the option of Sort by column, where one can sort a column based on another column. The sorting option is available in visuals as well. Sort by ascending and descending option by the fields and measure present in the visual is also available.
3.How to convert pdf to excel?
Open the PDF document you want to convert in XLSX format in Acrobat DC.
Go to the right pane and click on the โExport PDFโ option.
Choose spreadsheet as the Export format.
Select โMicrosoft Excel Workbook.โ
Now click โExport.โ
Download the converted file or share it.
4. How to enable macros in excel?
Click the file tab and then click โOptions.โ
A dialog box will appear. In the โExcel Optionsโ dialog box, click on the โTrust Centerโ and then โTrust Center Settings.โ
Go to the โMacro Settingsโ and select โenable all macros.โ
Click OK to apply the macro settings.
โโโโโโโโโโโโโโโโโโโโ-
ENJOY LEARNING ๐๐
๐๐
1.How to create filters in Power BI?
Filters are an integral part of Power BI reports. They are used to slice and dice the data as per the dimensions we want. Filters are created in a couple of ways.
Using Slicers: A slicer is a visual under Visualization Pane. This can be added to the design view to filter our reports. When a slicer is added to the design view, it requires a field to be added to it. For example- Slicer can be added for Country fields. Then the data can be filtered based on countries.
Using Filter Pane: The Power BI team has added a filter pane to the reports, which is a single space where we can add different fields as filters. And these fields can be added depending on whether you want to filter only one visual(Visual level filter), or all the visuals in the report page(Page level filters), or applicable to all the pages of the report(report level filters)
2.How to sort data in Power BI?
Sorting is available in multiple formats. In the data view, a common sorting option of alphabetical order is there. Apart from that, we have the option of Sort by column, where one can sort a column based on another column. The sorting option is available in visuals as well. Sort by ascending and descending option by the fields and measure present in the visual is also available.
3.How to convert pdf to excel?
Open the PDF document you want to convert in XLSX format in Acrobat DC.
Go to the right pane and click on the โExport PDFโ option.
Choose spreadsheet as the Export format.
Select โMicrosoft Excel Workbook.โ
Now click โExport.โ
Download the converted file or share it.
4. How to enable macros in excel?
Click the file tab and then click โOptions.โ
A dialog box will appear. In the โExcel Optionsโ dialog box, click on the โTrust Centerโ and then โTrust Center Settings.โ
Go to the โMacro Settingsโ and select โenable all macros.โ
Click OK to apply the macro settings.
โโโโโโโโโโโโโโโโโโโโ-
ENJOY LEARNING ๐๐
๐6๐ฅฐ5
While certificates have its own place to prove your skills, completing a course just for the sake of certificate is not going to help you at all. So whatever courses you take up, please make sure that you learn, practice and acquire that skill.
โค20๐9
Some helpful Data science projects for beginners
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://www.kaggle.com/c/digit-recognizer
https://www.kaggle.com/c/titanic
Intermediate Level Data science Projects
Black Friday Data : https://www.kaggle.com/sdolezel/black-friday
Human Activity Recognition Data : https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
Trip History Data : https://www.kaggle.com/pronto/cycle-share-dataset
Million Song Data : https://www.kaggle.com/c/msdchallenge
Census Income Data : https://www.kaggle.com/c/census-income/data
Movie Lens Data : https://www.kaggle.com/grouplens/movielens-20m-dataset
Twitter Classification Data : https://www.kaggle.com/c/twitter-sentiment-analysis2
Text mining : https://www.kaggle.com/kanncaa1/applying-text-mining
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://www.kaggle.com/c/digit-recognizer
https://www.kaggle.com/c/titanic
Intermediate Level Data science Projects
Black Friday Data : https://www.kaggle.com/sdolezel/black-friday
Human Activity Recognition Data : https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
Trip History Data : https://www.kaggle.com/pronto/cycle-share-dataset
Million Song Data : https://www.kaggle.com/c/msdchallenge
Census Income Data : https://www.kaggle.com/c/census-income/data
Movie Lens Data : https://www.kaggle.com/grouplens/movielens-20m-dataset
Twitter Classification Data : https://www.kaggle.com/c/twitter-sentiment-analysis2
Text mining : https://www.kaggle.com/kanncaa1/applying-text-mining
๐4
Three different learning styles in machine learning algorithms:
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
1. Supervised Learning
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
Example problems are classification and regression.
Example algorithms include: Logistic Regression and the Back Propagation Neural Network.
2. Unsupervised Learning
Input data is not labeled and does not have a known result.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
Example problems are clustering, dimensionality reduction and association rule learning.
Example algorithms include: the Apriori algorithm and K-Means.
3. Semi-Supervised Learning
Input data is a mixture of labeled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.
Example problems are classification and regression.
Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.
๐4
Interview QnAs For ML Engineer
1.What are the various steps involved in an data analytics project?
The steps involved in a data analytics project are:
Data collection
Data cleansing
Data pre-processing
EDA
Creation of train test and validation sets
Model creation
Hyperparameter tuning
Model deployment
2. Explain Star Schema.
Star schema is a data warehousing concept in which all schema is connected to a central schema.
3. What is root cause analysis?
Root cause analysis is the process of tracing back of occurrence of an event and the factors which lead to it. Itโs generally done when a software malfunctions. In data science, root cause analysis helps businesses understand the semantics behind certain outcomes.
4. Define Confounding Variables.
A confounding variable is an external influence in an experiment. In simple words, these variables change the effect of a dependent and independent variable. A variable should satisfy below conditions to be a confounding variable :
Variables should be correlated to the independent variable.
Variables should be informally related to the dependent variable.
For example, if you are studying whether a lack of exercise has an effect on weight gain, then the lack of exercise is an independent variable and weight gain is a dependent variable. A confounder variable can be any other factor that has an effect on weight gain. Amount of food consumed, weather conditions etc. can be a confounding variable.
1.What are the various steps involved in an data analytics project?
The steps involved in a data analytics project are:
Data collection
Data cleansing
Data pre-processing
EDA
Creation of train test and validation sets
Model creation
Hyperparameter tuning
Model deployment
2. Explain Star Schema.
Star schema is a data warehousing concept in which all schema is connected to a central schema.
3. What is root cause analysis?
Root cause analysis is the process of tracing back of occurrence of an event and the factors which lead to it. Itโs generally done when a software malfunctions. In data science, root cause analysis helps businesses understand the semantics behind certain outcomes.
4. Define Confounding Variables.
A confounding variable is an external influence in an experiment. In simple words, these variables change the effect of a dependent and independent variable. A variable should satisfy below conditions to be a confounding variable :
Variables should be correlated to the independent variable.
Variables should be informally related to the dependent variable.
For example, if you are studying whether a lack of exercise has an effect on weight gain, then the lack of exercise is an independent variable and weight gain is a dependent variable. A confounder variable can be any other factor that has an effect on weight gain. Amount of food consumed, weather conditions etc. can be a confounding variable.
๐2
Managing Machine Learning Projects
Simon Thompson, 2022
๐๐
https://t.iss.one/Programming_experts/121
Simon Thompson, 2022
๐๐
https://t.iss.one/Programming_experts/121
Which of the following tool can't be used for Data Visualization?
Anonymous Quiz
6%
Tableau
10%
Power BI
9%
Matplotlib
75%
Javascript
๐5๐ค1
To become a Machine Learning Engineer:
โข Python
โข numpy, pandas, matplotlib, Scikit-Learn
โข TensorFlow or PyTorch
โข Jupyter, Colab
โข Analysis > Code
โข 99%: Foundational algorithms
โข 1%: Other algorithms
โข Solve problems โ This is key
โข Teaching = 2 ร Learning
โข Have fun!
โข Python
โข numpy, pandas, matplotlib, Scikit-Learn
โข TensorFlow or PyTorch
โข Jupyter, Colab
โข Analysis > Code
โข 99%: Foundational algorithms
โข 1%: Other algorithms
โข Solve problems โ This is key
โข Teaching = 2 ร Learning
โข Have fun!
๐33๐ฅฐ5โค1
Useful Pandas๐ผ method you should definitely know
โ head()
โ info()
โ fillna()
โ melt()
โ pivot()
โ query()
โ merge()
โ assign()
โ groupby()
โ describe()
โ sample()
โ replace()
โ rename()
โ head()
โ info()
โ fillna()
โ melt()
โ pivot()
โ query()
โ merge()
โ assign()
โ groupby()
โ describe()
โ sample()
โ replace()
โ rename()
๐15๐1
Data Analyst Interview Questions
[Python, SQL, PowerBI]
1. Is indentation required in python?
Ans: Indentation is necessary for Python. It specifies a block of code. All code within loops, classes, functions, etc is specified within an indented block. It is usually done using four space characters. If your code is not indented necessarily, it will not execute accurately and will throw errors as well.
2. What are Entities and Relationships?
Ans:
Entity: An entity can be a real-world object that can be easily identifiable. For example, in a college database, students, professors, workers, departments, and projects can be referred to as entities.
Relationships: Relations or links between entities that have something to do with each other. For example โ The employeeโs table in a companyโs database can be associated with the salary table in the same database.
3. What are Aggregate and Scalar functions?
Ans: An aggregate function performs operations on a collection of values to return a single scalar value. Aggregate functions are often used with the GROUP BY and HAVING clauses of the SELECT statement. A scalar function returns a single value based on the input value.
4. What are Custom Visuals in Power BI?
Ans: Custom Visuals are like any other visualizations, generated using Power BI. The only difference is that it develops the custom visuals using a custom SDK. The languages like JQuery and JavaScript are used to create custom visuals in Power BI
ENJOY LEARNING ๐๐
[Python, SQL, PowerBI]
1. Is indentation required in python?
Ans: Indentation is necessary for Python. It specifies a block of code. All code within loops, classes, functions, etc is specified within an indented block. It is usually done using four space characters. If your code is not indented necessarily, it will not execute accurately and will throw errors as well.
2. What are Entities and Relationships?
Ans:
Entity: An entity can be a real-world object that can be easily identifiable. For example, in a college database, students, professors, workers, departments, and projects can be referred to as entities.
Relationships: Relations or links between entities that have something to do with each other. For example โ The employeeโs table in a companyโs database can be associated with the salary table in the same database.
3. What are Aggregate and Scalar functions?
Ans: An aggregate function performs operations on a collection of values to return a single scalar value. Aggregate functions are often used with the GROUP BY and HAVING clauses of the SELECT statement. A scalar function returns a single value based on the input value.
4. What are Custom Visuals in Power BI?
Ans: Custom Visuals are like any other visualizations, generated using Power BI. The only difference is that it develops the custom visuals using a custom SDK. The languages like JQuery and JavaScript are used to create custom visuals in Power BI
ENJOY LEARNING ๐๐
๐18