Here are 7 FREE courses that will make you smarter:
1. Negotiating Salary:
Learn how to get the pay you deserve by mastering the art of negotiation.
https://pll.harvard.edu/course/negotiating-salary
Share this telegram channel with your friends: https://t.iss.one/udacityfreecourse
2. Entrepreneurship:
Learn how to build a successful business.
https://pll.harvard.edu/course/technology-entrepreneurship-lab-market
3. Intro to AI:
A beginner's guide to artificial intelligence and its applications in the real world.
https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python
4. Managing Happiness:
Did you know you can learn how to be happier?
Learn how!
https://pll.harvard.edu/course/managing-happiness
5. Mobile App Development:
Learn how to create your mobile app and reach a wider audience.
https://cs50.harvard.edu/mobile/2018/
6. Entrepreneurship in Emerging Economies:
Learn how to start a successful business in countries where the economy is growing fast.
https://pll.harvard.edu/course/entrepreneurship-in-emerging-economies
7. Web Programming:
Learn how to build your website.
https://pll.harvard.edu/course/cs50s-web-programming-python-and-javascript
Share this telegram channel with your friends: https://t.iss.one/udacityfreecourse
1. Negotiating Salary:
Learn how to get the pay you deserve by mastering the art of negotiation.
https://pll.harvard.edu/course/negotiating-salary
Share this telegram channel with your friends: https://t.iss.one/udacityfreecourse
2. Entrepreneurship:
Learn how to build a successful business.
https://pll.harvard.edu/course/technology-entrepreneurship-lab-market
3. Intro to AI:
A beginner's guide to artificial intelligence and its applications in the real world.
https://pll.harvard.edu/course/cs50s-introduction-artificial-intelligence-python
4. Managing Happiness:
Did you know you can learn how to be happier?
Learn how!
https://pll.harvard.edu/course/managing-happiness
5. Mobile App Development:
Learn how to create your mobile app and reach a wider audience.
https://cs50.harvard.edu/mobile/2018/
6. Entrepreneurship in Emerging Economies:
Learn how to start a successful business in countries where the economy is growing fast.
https://pll.harvard.edu/course/entrepreneurship-in-emerging-economies
7. Web Programming:
Learn how to build your website.
https://pll.harvard.edu/course/cs50s-web-programming-python-and-javascript
Share this telegram channel with your friends: https://t.iss.one/udacityfreecourse
โค1๐1
๐ง๐ต๐ฒ ๐ฐ ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐๐ ๐ง๐ต๐ฎ๐ ๐๐ฎ๐ป ๐๐ฎ๐ป๐ฑ ๐ฌ๐ผ๐ ๐ฎ ๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ผ๐ฏ (๐๐๐ฒ๐ป ๐ช๐ถ๐๐ต๐ผ๐๐ ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐ฒ๐ป๐ฐ๐ฒ) ๐ผ
Recruiters donโt want to see more certificatesโthey want proof you can solve real-world problems. Thatโs where the right projects come in. Not toy datasets, but projects that demonstrate storytelling, problem-solving, and impact.
Here are 4 killer projects thatโll make your portfolio stand out ๐
๐น 1. Exploratory Data Analysis (EDA) on Real-World Dataset
Pick a messy dataset from Kaggle or public sources. Show your thought process.
โ Clean data using Pandas
โ Visualize trends with Seaborn/Matplotlib
โ Share actionable insights with graphs and markdown
Bonus: Turn it into a Jupyter Notebook with detailed storytelling
๐น 2. Predictive Modeling with ML
Solve a real problem using machine learning. For example:
โ Predict customer churn using Logistic Regression
โ Predict housing prices with Random Forest or XGBoost
โ Use scikit-learn for training + evaluation
Bonus: Add SHAP or feature importance to explain predictions
๐น 3. SQL-Powered Business Dashboard
Use real sales or ecommerce data to build a dashboard.
โ Write complex SQL queries for KPIs
โ Visualize with Power BI or Tableau
โ Show trends: Revenue by Region, Product Performance, etc.
Bonus: Add filters & slicers to make it interactive
๐น 4. End-to-End Data Science Pipeline Project
Build a complete pipeline from scratch.
โ Collect data via web scraping (e.g., IMDb, LinkedIn Jobs)
โ Clean + Analyze + Model + Deploy
โ Deploy with Streamlit/Flask + GitHub + Render
Bonus: Add a blog post or LinkedIn write-up explaining your approach
๐ฏ One solid project > 10 certificates.
Make it visible. Make it valuable. Share it confidently.
I have curated the best interview resources to crack Data Science Interviews
๐๐
https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D
Like if you need similar content ๐๐
Recruiters donโt want to see more certificatesโthey want proof you can solve real-world problems. Thatโs where the right projects come in. Not toy datasets, but projects that demonstrate storytelling, problem-solving, and impact.
Here are 4 killer projects thatโll make your portfolio stand out ๐
๐น 1. Exploratory Data Analysis (EDA) on Real-World Dataset
Pick a messy dataset from Kaggle or public sources. Show your thought process.
โ Clean data using Pandas
โ Visualize trends with Seaborn/Matplotlib
โ Share actionable insights with graphs and markdown
Bonus: Turn it into a Jupyter Notebook with detailed storytelling
๐น 2. Predictive Modeling with ML
Solve a real problem using machine learning. For example:
โ Predict customer churn using Logistic Regression
โ Predict housing prices with Random Forest or XGBoost
โ Use scikit-learn for training + evaluation
Bonus: Add SHAP or feature importance to explain predictions
๐น 3. SQL-Powered Business Dashboard
Use real sales or ecommerce data to build a dashboard.
โ Write complex SQL queries for KPIs
โ Visualize with Power BI or Tableau
โ Show trends: Revenue by Region, Product Performance, etc.
Bonus: Add filters & slicers to make it interactive
๐น 4. End-to-End Data Science Pipeline Project
Build a complete pipeline from scratch.
โ Collect data via web scraping (e.g., IMDb, LinkedIn Jobs)
โ Clean + Analyze + Model + Deploy
โ Deploy with Streamlit/Flask + GitHub + Render
Bonus: Add a blog post or LinkedIn write-up explaining your approach
๐ฏ One solid project > 10 certificates.
Make it visible. Make it valuable. Share it confidently.
I have curated the best interview resources to crack Data Science Interviews
๐๐
https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D
Like if you need similar content ๐๐
๐4
Statistics Interview Questions
Topics to Cover:
โข Descriptive statistics
โข Probability
โข Hypothesis testing
โข Regression analysis
Questions and Answers:
1 Q: What is the difference between descriptive and inferential statistics?
A: Descriptive statistics summarize the main features of a dataset (e.g., mean, median, mode), while inferential statistics use samples to make inferences about a larger population.
2 Q: Define p-value in hypothesis testing.
A: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. A low p-value (< 0.05) indicates strong evidence against the null hypothesis.
3 Q: What is the central limit theorem?
A: The central limit theorem states that the distribution of the sample mean approximates a normal distribution as the sample size becomes large, regardless of the population's distribution.
4 Q: Explain the concept of correlation.
A: Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no correlation.
5 Q: What is linear regression?
A: Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
I have curated best 80+ top-notch Data Analytics Resources ๐๐
https://whatsapp.com/channel/0029VaGgzAk72WTmQFERKh02
Like if it helps :)
Topics to Cover:
โข Descriptive statistics
โข Probability
โข Hypothesis testing
โข Regression analysis
Questions and Answers:
1 Q: What is the difference between descriptive and inferential statistics?
A: Descriptive statistics summarize the main features of a dataset (e.g., mean, median, mode), while inferential statistics use samples to make inferences about a larger population.
2 Q: Define p-value in hypothesis testing.
A: The p-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. A low p-value (< 0.05) indicates strong evidence against the null hypothesis.
3 Q: What is the central limit theorem?
A: The central limit theorem states that the distribution of the sample mean approximates a normal distribution as the sample size becomes large, regardless of the population's distribution.
4 Q: Explain the concept of correlation.
A: Correlation measures the strength and direction of the relationship between two variables. It ranges from -1 (perfect negative) to +1 (perfect positive), with 0 indicating no correlation.
5 Q: What is linear regression?
A: Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.
I have curated best 80+ top-notch Data Analytics Resources ๐๐
https://whatsapp.com/channel/0029VaGgzAk72WTmQFERKh02
Like if it helps :)
โค3๐2
Forwarded from Python Projects & Resources
๐ฒ ๐๐ฟ๐ฒ๐ฒ ๐๐ ๐๐ฒ๐ฟ๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐๐ผ๐๐ฟ๐๐ฒ๐ ๐ง๐ผ ๐จ๐ฝ๐๐ธ๐ถ๐น๐น ๐๐ป ๐ฎ๐ฌ๐ฎ๐ฑ๐
Whether youโre a student, aspiring data analyst, software enthusiast, or just curious about AI, nowโs the perfect time to dive in.
These 6 beginner-friendly and completely free AI courses from top institutions like Google, IBM, Harvard, and more
๐๐ถ๐ป๐ธ:-๐
https://pdlink.in/4d0SrTG
Enroll for FREE & Get Certified ๐
Whether youโre a student, aspiring data analyst, software enthusiast, or just curious about AI, nowโs the perfect time to dive in.
These 6 beginner-friendly and completely free AI courses from top institutions like Google, IBM, Harvard, and more
๐๐ถ๐ป๐ธ:-๐
https://pdlink.in/4d0SrTG
Enroll for FREE & Get Certified ๐
Essential statistics topics for data science
1. Descriptive statistics: Measures of central tendency, measures of dispersion, and graphical representations of data.
2. Inferential statistics: Hypothesis testing, confidence intervals, and regression analysis.
3. Probability theory: Concepts of probability, random variables, and probability distributions.
4. Sampling techniques: Simple random sampling, stratified sampling, and cluster sampling.
5. Statistical modeling: Linear regression, logistic regression, and time series analysis.
6. Machine learning algorithms: Supervised learning, unsupervised learning, and reinforcement learning.
7. Bayesian statistics: Bayesian inference, Bayesian networks, and Markov chain Monte Carlo methods.
8. Data visualization: Techniques for visualizing data and communicating insights effectively.
9. Experimental design: Designing experiments, analyzing experimental data, and interpreting results.
10. Big data analytics: Handling large volumes of data using tools like Hadoop, Spark, and SQL.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: https://t.iss.one/datasciencefun
Like if you need similar content ๐๐
1. Descriptive statistics: Measures of central tendency, measures of dispersion, and graphical representations of data.
2. Inferential statistics: Hypothesis testing, confidence intervals, and regression analysis.
3. Probability theory: Concepts of probability, random variables, and probability distributions.
4. Sampling techniques: Simple random sampling, stratified sampling, and cluster sampling.
5. Statistical modeling: Linear regression, logistic regression, and time series analysis.
6. Machine learning algorithms: Supervised learning, unsupervised learning, and reinforcement learning.
7. Bayesian statistics: Bayesian inference, Bayesian networks, and Markov chain Monte Carlo methods.
8. Data visualization: Techniques for visualizing data and communicating insights effectively.
9. Experimental design: Designing experiments, analyzing experimental data, and interpreting results.
10. Big data analytics: Handling large volumes of data using tools like Hadoop, Spark, and SQL.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: https://t.iss.one/datasciencefun
Like if you need similar content ๐๐
๐4
Exploratory Data Analysis (EDA)
EDA is the process of analyzing datasets to summarize key patterns, detect anomalies, and gain insights before applying machine learning or reporting.
1๏ธโฃ Descriptive Statistics
Descriptive statistics help summarize and understand data distributions.
In SQL:
Calculate Mean (Average):
Find Mode (Most Frequent Value)
Calculate Variance & Standard Deviation
In Python (Pandas):
Mean, Median, Mode
Variance & Standard Deviation
2๏ธโฃ Data Visualization
Visualizing data helps identify trends, outliers, and patterns.
In SQL (For Basic Visualization in Some Databases Like PostgreSQL):
Create Histogram (Approximate in SQL)
In Python (Matplotlib & Seaborn):
Bar Chart (Category-Wise Sales)
Histogram (Salary Distribution)
Box Plot (Outliers in Sales Data)
Heatmap (Correlation Between Variables)
3๏ธโฃ Detecting Anomalies & Outliers
Outliers can skew results and should be identified.
In SQL:
Find records with unusually high salaries
In Python (Pandas & NumPy):
Using Z-Score (Values Beyond 3 Standard Deviations)
Using IQR (Interquartile Range)
4๏ธโฃ Key EDA Steps
Understand the Data โ Check missing values, duplicates, and column types
Summarize Statistics โ Mean, Median, Standard Deviation, etc.
Visualize Trends โ Histograms, Box Plots, Heatmaps
Detect Outliers & Anomalies โ Z-Score, IQR
Feature Engineering โ Transform variables if needed
Mini Task for You: Write an SQL query to find employees whose salaries are above two standard deviations from the mean salary.
Here you can find the roadmap for data analyst: https://t.iss.one/sqlspecialist/1159
Like this post if you want me to continue covering all the topics! โค๏ธ
Share with credits: https://t.iss.one/sqlspecialist
Hope it helps :)
#sql
EDA is the process of analyzing datasets to summarize key patterns, detect anomalies, and gain insights before applying machine learning or reporting.
1๏ธโฃ Descriptive Statistics
Descriptive statistics help summarize and understand data distributions.
In SQL:
Calculate Mean (Average):
SELECT AVG(salary) AS average_salary FROM employees;
Find Median (Using Window Functions) SELECT salary FROM ( SELECT salary, ROW_NUMBER() OVER (ORDER BY salary) AS row_num, COUNT(*) OVER () AS total_rows FROM employees ) subquery WHERE row_num = (total_rows / 2);
Find Mode (Most Frequent Value)
SELECT department, COUNT(*) AS count FROM employees GROUP BY department ORDER BY count DESC LIMIT 1;
Calculate Variance & Standard Deviation
SELECT VARIANCE(salary) AS salary_variance, STDDEV(salary) AS salary_std_dev FROM employees;
In Python (Pandas):
Mean, Median, Mode
df['salary'].mean() df['salary'].median() df['salary'].mode()[0]
Variance & Standard Deviation
df['salary'].var() df['salary'].std()
2๏ธโฃ Data Visualization
Visualizing data helps identify trends, outliers, and patterns.
In SQL (For Basic Visualization in Some Databases Like PostgreSQL):
Create Histogram (Approximate in SQL)
SELECT salary, COUNT(*) FROM employees GROUP BY salary ORDER BY salary;
In Python (Matplotlib & Seaborn):
Bar Chart (Category-Wise Sales)
import matplotlib.pyplot as plt
import seaborn as sns
df.groupby('category')['sales'].sum().plot(kind='bar')
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.show()
Histogram (Salary Distribution)
sns.histplot(df['salary'], bins=10, kde=True)
plt.title('Salary Distribution')
plt.show()
Box Plot (Outliers in Sales Data)
sns.boxplot(y=df['sales'])
plt.title('Sales Data Outliers')
plt.show()
Heatmap (Correlation Between Variables)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm') plt.title('Feature Correlation Heatmap') plt.show() 3๏ธโฃ Detecting Anomalies & Outliers
Outliers can skew results and should be identified.
In SQL:
Find records with unusually high salaries
SELECT * FROM employees WHERE salary > (SELECT AVG(salary) + 2 * STDDEV(salary) FROM employees);
In Python (Pandas & NumPy):
Using Z-Score (Values Beyond 3 Standard Deviations)
from scipy import stats df['z_score'] = stats.zscore(df['salary']) df_outliers = df[df['z_score'].abs() > 3]
Using IQR (Interquartile Range)
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
df_outliers = df[(df['salary'] < (Q1 - 1.5 * IQR)) | (df['salary'] > (Q3 + 1.5 * IQR))]
4๏ธโฃ Key EDA Steps
Understand the Data โ Check missing values, duplicates, and column types
Summarize Statistics โ Mean, Median, Standard Deviation, etc.
Visualize Trends โ Histograms, Box Plots, Heatmaps
Detect Outliers & Anomalies โ Z-Score, IQR
Feature Engineering โ Transform variables if needed
Mini Task for You: Write an SQL query to find employees whose salaries are above two standard deviations from the mean salary.
Here you can find the roadmap for data analyst: https://t.iss.one/sqlspecialist/1159
Like this post if you want me to continue covering all the topics! โค๏ธ
Share with credits: https://t.iss.one/sqlspecialist
Hope it helps :)
#sql
โค4๐1
Forwarded from Python Projects & Resources
๐๐ฒ๐น๐ผ๐ถ๐๐๐ฒ ๐ฉ๐ถ๐ฟ๐๐๐ฎ๐น ๐๐ฅ๐๐ ๐๐ฎ๐๐ฎ ๐๐ป๐ฎ๐น๐๐๐ถ๐ฐ๐ ๐๐ฒ๐ฟ๐๐ถ๐ณ๐ถ๐ฐ๐ฎ๐๐ถ๐ผ๐ป ๐
If youโre eager to build real skills in data analytics before landing your first role, Deloitte is giving you a golden opportunityโcompletely free!
๐ก No prior experience required
๐ Ideal for students, freshers, and aspiring data analysts
โฐ Self-paced โ complete at your convenience
๐ ๐๐ฝ๐ฝ๐น๐ ๐๐ฒ๐ฟ๐ฒ (๐๐ฟ๐ฒ๐ฒ)๐:-
https://pdlink.in/4iKcgA4
Enroll for FREE & Get Certified ๐
If youโre eager to build real skills in data analytics before landing your first role, Deloitte is giving you a golden opportunityโcompletely free!
๐ก No prior experience required
๐ Ideal for students, freshers, and aspiring data analysts
โฐ Self-paced โ complete at your convenience
๐ ๐๐ฝ๐ฝ๐น๐ ๐๐ฒ๐ฟ๐ฒ (๐๐ฟ๐ฒ๐ฒ)๐:-
https://pdlink.in/4iKcgA4
Enroll for FREE & Get Certified ๐
๐ง๐ต๐ฒ ๐ฐ ๐ฃ๐ฟ๐ผ๐ท๐ฒ๐ฐ๐๐ ๐ง๐ต๐ฎ๐ ๐๐ฎ๐ป ๐๐ฎ๐ป๐ฑ ๐ฌ๐ผ๐ ๐ฎ ๐๐ฎ๐๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐๐ผ๐ฏ (๐๐๐ฒ๐ป ๐ช๐ถ๐๐ต๐ผ๐๐ ๐๐
๐ฝ๐ฒ๐ฟ๐ถ๐ฒ๐ป๐ฐ๐ฒ) ๐ผ
Recruiters donโt want to see more certificatesโthey want proof you can solve real-world problems. Thatโs where the right projects come in. Not toy datasets, but projects that demonstrate storytelling, problem-solving, and impact.
Here are 4 killer projects thatโll make your portfolio stand out ๐
๐น 1. Exploratory Data Analysis (EDA) on Real-World Dataset
Pick a messy dataset from Kaggle or public sources. Show your thought process.
โ Clean data using Pandas
โ Visualize trends with Seaborn/Matplotlib
โ Share actionable insights with graphs and markdown
Bonus: Turn it into a Jupyter Notebook with detailed storytelling
๐น 2. Predictive Modeling with ML
Solve a real problem using machine learning. For example:
โ Predict customer churn using Logistic Regression
โ Predict housing prices with Random Forest or XGBoost
โ Use scikit-learn for training + evaluation
Bonus: Add SHAP or feature importance to explain predictions
๐น 3. SQL-Powered Business Dashboard
Use real sales or ecommerce data to build a dashboard.
โ Write complex SQL queries for KPIs
โ Visualize with Power BI or Tableau
โ Show trends: Revenue by Region, Product Performance, etc.
Bonus: Add filters & slicers to make it interactive
๐น 4. End-to-End Data Science Pipeline Project
Build a complete pipeline from scratch.
โ Collect data via web scraping (e.g., IMDb, LinkedIn Jobs)
โ Clean + Analyze + Model + Deploy
โ Deploy with Streamlit/Flask + GitHub + Render
Bonus: Add a blog post or LinkedIn write-up explaining your approach
๐ฏ One solid project > 10 certificates.
Make it visible. Make it valuable. Share it confidently.
I have curated the best interview resources to crack Data Science Interviews
๐๐
https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D
Like if you need similar content ๐๐
Recruiters donโt want to see more certificatesโthey want proof you can solve real-world problems. Thatโs where the right projects come in. Not toy datasets, but projects that demonstrate storytelling, problem-solving, and impact.
Here are 4 killer projects thatโll make your portfolio stand out ๐
๐น 1. Exploratory Data Analysis (EDA) on Real-World Dataset
Pick a messy dataset from Kaggle or public sources. Show your thought process.
โ Clean data using Pandas
โ Visualize trends with Seaborn/Matplotlib
โ Share actionable insights with graphs and markdown
Bonus: Turn it into a Jupyter Notebook with detailed storytelling
๐น 2. Predictive Modeling with ML
Solve a real problem using machine learning. For example:
โ Predict customer churn using Logistic Regression
โ Predict housing prices with Random Forest or XGBoost
โ Use scikit-learn for training + evaluation
Bonus: Add SHAP or feature importance to explain predictions
๐น 3. SQL-Powered Business Dashboard
Use real sales or ecommerce data to build a dashboard.
โ Write complex SQL queries for KPIs
โ Visualize with Power BI or Tableau
โ Show trends: Revenue by Region, Product Performance, etc.
Bonus: Add filters & slicers to make it interactive
๐น 4. End-to-End Data Science Pipeline Project
Build a complete pipeline from scratch.
โ Collect data via web scraping (e.g., IMDb, LinkedIn Jobs)
โ Clean + Analyze + Model + Deploy
โ Deploy with Streamlit/Flask + GitHub + Render
Bonus: Add a blog post or LinkedIn write-up explaining your approach
๐ฏ One solid project > 10 certificates.
Make it visible. Make it valuable. Share it confidently.
I have curated the best interview resources to crack Data Science Interviews
๐๐
https://whatsapp.com/channel/0029Va8v3eo1NCrQfGMseL2D
Like if you need similar content ๐๐
๐2
Data Cleaning Checklist:
If you're just starting out in the world of data analytics, hopefully this checklist helps demystify the concept of "data cleaning"...
โ Missing data - Decide if youโre going to omit the datapoint, mathematically estimate the missing data using statistical methods, or use an external source to fill in the missing data.
โ Duplicate data - Identify duplicate data and what it means in context. Is the duplicate an error that needs to be deleted? Or is it possible that you could have two of the same data point?
โ Formatting errors - Ensure all data is rounded to the correct decimal place, all data is aligned correctly, and the data format is consistent within columns.
โ Incorrect data types - Ensure all of your data is pulled as the correct data type (ex. making sure that integers are not used for money values).
โ Outliers - Identify data points that are +/- 2 standard deviations from the mean, and double check that these values are correct. If they are correct, they may require further investigation.
If you're just starting out in the world of data analytics, hopefully this checklist helps demystify the concept of "data cleaning"...
โ Missing data - Decide if youโre going to omit the datapoint, mathematically estimate the missing data using statistical methods, or use an external source to fill in the missing data.
โ Duplicate data - Identify duplicate data and what it means in context. Is the duplicate an error that needs to be deleted? Or is it possible that you could have two of the same data point?
โ Formatting errors - Ensure all data is rounded to the correct decimal place, all data is aligned correctly, and the data format is consistent within columns.
โ Incorrect data types - Ensure all of your data is pulled as the correct data type (ex. making sure that integers are not used for money values).
โ Outliers - Identify data points that are +/- 2 standard deviations from the mean, and double check that these values are correct. If they are correct, they may require further investigation.
๐4
Why is it require to split our data into three parts: train, validation, and test?
โข The training set is used to fit the model, i.e. to train the model with the data.
โข The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model.
โข Finally, a test data set which the model has never "seen" before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
โข The training set is used to fit the model, i.e. to train the model with the data.
โข The validation set is then used to provide an unbiased evaluation of a model while fine-tuning hyperparameters. This improves the generalization of the model.
โข Finally, a test data set which the model has never "seen" before should be used for the final evaluation of the model. This allows for an unbiased evaluation of the model. The evaluation should never be performed on the same data that is used for training. Otherwise the model performance would not be representative.
๐1