Python | Machine Learning | Coding | R
67.3K subscribers
1.25K photos
89 videos
153 files
907 links
Help and ads: @hussein_sheikho

Discover powerful insights with Python, Machine Learning, Coding, and R—your essential toolkit for data-driven solutions, smart alg

List of our channels:
https://t.iss.one/addlist/8_rRW2scgfRhOTc0

https://telega.io/?r=nikapsOH
Download Telegram
Pandas Data Cleaning (Guide)

🔑 Tags: #Pandas #DataCleaning #ML

https://t.iss.one/DataScienceM
Please open Telegram to view this post
VIEW IN TELEGRAM
👍6
Please open Telegram to view this post
VIEW IN TELEGRAM
👍8
#NLP #Lesson #SentimentAnalysis #MachineLearning

Building an NLP Model from Scratch: Sentiment Analysis

This lesson will guide you through creating a complete Natural Language Processing (NLP) project. We will build a sentiment analysis classifier that can determine if a piece of text is positive or negative.

---

Step 1: Setup and Data Preparation

First, we need to import the necessary libraries and prepare our dataset. For simplicity, we'll use a small, hard-coded list of sentences. In a real-world project, you would load this data from a file (e.g., a CSV).

#Python #DataPreparation

# Imports and Data
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords

# You may need to download stopwords for the first time
# nltk.download('stopwords')

# Sample Data (In a real project, load this from a file)
texts = [
"I love this movie, it's fantastic!",
"This was a terrible film.",
"The acting was superb and the plot was great.",
"I would not recommend this to anyone.",
"It was an okay movie, not the best but enjoyable.",
"Absolutely brilliant, a must-see!",
"A complete waste of time and money.",
"The story was compelling and engaging."
]
# Labels: 1 for Positive, 0 for Negative
labels = [1, 0, 1, 0, 1, 1, 0, 1]


---

Step 2: Text Preprocessing

Computers don't understand words, so we must clean and process our text data first. This involves making text lowercase, removing punctuation, and filtering out common "stop words" (like 'the', 'a', 'is') that don't add much meaning.

#TextPreprocessing #DataCleaning

# Text Preprocessing Function
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
# Make text lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)
# Tokenize and remove stopwords
tokens = text.split()
filtered_tokens = [word for word in tokens if word not in stop_words]
return " ".join(filtered_tokens)

# Apply preprocessing to our dataset
processed_texts = [preprocess_text(text) for text in texts]
print("--- Original vs. Processed ---")
for i in range(3):
print(f"Original: {texts[i]}")
print(f"Processed: {processed_texts[i]}\n")


---

Step 3: Splitting the Data

We must split our data into a training set (to teach the model) and a testing set (to evaluate its performance on unseen data).

#MachineLearning #TrainTestSplit

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
processed_texts,
labels,
test_size=0.25, # Use 25% of data for testing
random_state=42 # for reproducibility
)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")


---

Step 4: Feature Extraction (Vectorization)

We need to convert our cleaned text into a numerical format. We'll use TF-IDF (Term Frequency-Inverse Document Frequency). This technique converts text into vectors of numbers, giving more weight to words that are important to a document but not common across all documents.

#FeatureEngineering #TFIDF #Vectorization
2
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]}, index=['x', 'y', 'z'])
print(df.at['y', 'A'])

2


#24. df.iat[]
Accesses a single value by row/column integer position. Faster than .iloc.

import pandas as pd
df = pd.DataFrame({'A': [10, 20, 30]})
print(df.iat[1, 0])

20


#25. df.sample()
Returns a random sample of items from an axis of object.

import pandas as pd
df = pd.DataFrame({'A': range(10)})
print(df.sample(n=3))

A
8 8
2 2
5 5
(Note: Output rows will be random)

---
#DataAnalysis #Pandas #DataCleaning #Manipulation

Part 3: Pandas - Data Cleaning & Manipulation

#26. df.dropna()
Removes missing values.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan, 3]})
print(df.dropna())

A
0 1.0
2 3.0


#27. df.fillna()
Fills missing (NA/NaN) values using a specified method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan, 3]})
print(df.fillna(0))

A
0 1.0
1 0.0
2 3.0


#28. df.astype()
Casts a pandas object to a specified dtype.

import pandas as pd
df = pd.DataFrame({'A': [1.1, 2.7, 3.5]})
df['A'] = df['A'].astype(int)
print(df)

A
0 1
1 2
2 3


#29. df.rename()
Alters axes labels.

import pandas as pd
df = pd.DataFrame({'a': [1], 'b': [2]})
df_renamed = df.rename(columns={'a': 'A', 'b': 'B'})
print(df_renamed)

A  B
0 1 2


#30. df.drop()
Drops specified labels from rows or columns.

import pandas as pd
df = pd.DataFrame({'A': [1], 'B': [2], 'C': [3]})
df_dropped = df.drop(columns=['B'])
print(df_dropped)

A  C
0 1 3


#31. pd.to_datetime()
Converts argument to datetime.

import pandas as pd
s = pd.Series(['2023-01-01', '2023-01-02'])
dt_s = pd.to_datetime(s)
print(dt_s)

0   2023-01-01
1 2023-01-02
dtype: datetime64[ns]


#32. df.apply()
Applies a function along an axis of the DataFrame.

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df['B'] = df['A'].apply(lambda x: x * 2)
print(df)

A  B
0 1 2
1 2 4
2 3 6


#33. df['col'].map()
Maps values of a Series according to an input mapping or function.

import pandas as pd
df = pd.DataFrame({'Gender': ['M', 'F', 'M']})
df['Gender_Full'] = df['Gender'].map({'M': 'Male', 'F': 'Female'})
print(df)

Gender Gender_Full
0 M Male
1 F Female
2 M Male


#34. df.replace()
Replaces values given in to_replace with value.

import pandas as pd
df = pd.DataFrame({'Score': [10, -99, 15, -99]})
df_replaced = df.replace(-99, 0)
print(df_replaced)

Score
0 10
1 0
2 15
3 0


#35. df.duplicated()
Returns a boolean Series denoting duplicate rows.

import pandas as pd
df = pd.DataFrame({'A': [1, 2, 1], 'B': ['a', 'b', 'a']})
print(df.duplicated())

0    False
1 False
2 True
dtype: bool


#36. df.drop_duplicates()
Returns a DataFrame with duplicate rows removed.