Code With Python
39.2K subscribers
889 photos
27 videos
22 files
770 links
This channel delivers clear, practical content for developers, covering Python, Django, Data Structures, Algorithms, and DSA – perfect for learning, coding, and mastering key programming skills.
Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
from nltk.corpus import stopwords
# nltk.download('stopwords') # Run once
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if not w.lower() in stop_words]


VII. Word Normalization (Stemming & Lemmatization)

• Stemming (reduce words to their root form).
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed = ps.stem("running") # 'run'

• Lemmatization (reduce words to their dictionary form).
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet') # Run once
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("better", pos="a") # 'good'


VIII. Advanced NLP Analysis
(Requires pip install spacy and python -m spacy download en_core_web_sm)

• Part-of-Speech (POS) Tagging.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup.")
for token in doc: print(token.text, token.pos_)

• Named Entity Recognition (NER).
for ent in doc.ents:
print(ent.text, ent.label_) # Apple ORG, U.K. GPE

• Get word frequency distribution.
from nltk.probability import FreqDist
fdist = FreqDist(word_tokenize("this is a test this is only a test"))


IX. Text Formatting & Encoding

• Format strings with f-strings.
name = "Alice"
age = 30
message = f"Name: {name}, Age: {age}"

• Pad a string with leading zeros.
number = "42".zfill(5) # '00042'

• Encode a string to bytes.
byte_string = "hello".encode('utf-8')

• Decode bytes to a string.
original_string = byte_string.decode('utf-8')


X. Text Vectorization
(Requires pip install scikit-learn)

• Create a Bag-of-Words (BoW) model.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["This is the first document.", "This is the second document."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

• Get feature names (the vocabulary).
print(vectorizer.get_feature_names_out())

• Create a TF-IDF model.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)


XI. More String Utilities

• Center a string within a width.
centered = "Hello".center(20, '-') # '-------Hello--------'

• Check if a string is in title case.
"This Is A Title".istitle() # True

• Find the highest index of a substring.
"test test".rfind("test") # Returns 5

• Split from the right.
"path/to/file.txt".rsplit('/', 1) # ['path/to', 'file.txt']

• Create a character translation table.
table = str.maketrans('aeiou', '12345')
vowels_to_num = "hello".translate(table) # 'h2ll4'

• Remove a specific prefix.
"TestCase".removeprefix("Test") # 'Case'

• Remove a specific suffix.
"filename.txt".removesuffix(".txt") # 'filename'

• Check for unicode decimal characters.
"½".isdecimal() # False
"123".isdecimal() # True

• Check for unicode numeric characters.
"½".isnumeric() # True
"²".isnumeric() # True


#Python #TextProcessing #NLP #RegEx #NLTK

━━━━━━━━━━━━━━━
By: @DataScience4