Code With Python

👩‍💻 Complete FastAPI Course in Python!

🔗 Link: https://www.youtube.com/playlist?list=PL8VzFQ8k4U1L5QpSapVEzoSfob-4CR8zM

⭐️ Tags:
#PythonCheatSheet #PythonProgramming #DataScience #CodingTips #Python3 #LearnPython #ProgrammingGuide #PythonSyntax #CodeSnippets #DataStructures #OOP #Regex #ErrorHandling #PythonLibraries #CodingReference #PythonTricks #TechResources #DeveloperTools #PythonForBeginners #AdvancedPython

⭐️

BEST DATA SCIENCE CHANNELS ON TELEGRAM

⭐️

Please open Telegram to view this post

VIEW IN TELEGRAM

👍4🔥1

3.86K views10:37

Code With Python

from nltk.corpus import stopwords
# nltk.download('stopwords') # Run once
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if not w.lower() in stop_words]

VII. Word Normalization (Stemming & Lemmatization)

• Stemming (reduce words to their root form).

from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed = ps.stem("running") # 'run'

• Lemmatization (reduce words to their dictionary form).

from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet') # Run once
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("better", pos="a") # 'good'

VIII. Advanced NLP Analysis
(Requires pip install spacy and python -m spacy download en_core_web_sm)

• Part-of-Speech (POS) Tagging.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup.")
for token in doc: print(token.text, token.pos_)

• Named Entity Recognition (NER).

for ent in doc.ents:
    print(ent.text, ent.label_) # Apple ORG, U.K. GPE

• Get word frequency distribution.

from nltk.probability import FreqDist
fdist = FreqDist(word_tokenize("this is a test this is only a test"))

IX. Text Formatting & Encoding

• Format strings with f-strings.

name = "Alice"
age = 30
message = f"Name: {name}, Age: {age}"

• Pad a string with leading zeros.

number = "42".zfill(5) # '00042'

• Encode a string to bytes.

byte_string = "hello".encode('utf-8')

• Decode bytes to a string.

original_string = byte_string.decode('utf-8')

X. Text Vectorization
(Requires pip install scikit-learn)

• Create a Bag-of-Words (BoW) model.

from sklearn.feature_extraction.text import CountVectorizer
corpus = ["This is the first document.", "This is the second document."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)

• Get feature names (the vocabulary).

print(vectorizer.get_feature_names_out())

• Create a TF-IDF model.

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

XI. More String Utilities

• Center a string within a width.

centered = "Hello".center(20, '-') # '-------Hello--------'

• Check if a string is in title case.

"This Is A Title".istitle() # True

• Find the highest index of a substring.

"test test".rfind("test") # Returns 5

• Split from the right.

"path/to/file.txt".rsplit('/', 1) # ['path/to', 'file.txt']

• Create a character translation table.

table = str.maketrans('aeiou', '12345')
vowels_to_num = "hello".translate(table) # 'h2ll4'

• Remove a specific prefix.

"TestCase".removeprefix("Test") # 'Case'

• Remove a specific suffix.

"filename.txt".removesuffix(".txt") # 'filename'

• Check for unicode decimal characters.

"½".isdecimal() # False
"123".isdecimal() # True

• Check for unicode numeric characters.

"½".isnumeric() # True
"²".isnumeric() # True

#Python #TextProcessing #NLP #RegEx #NLTK

━━━━━━━━━━━━━━━
By: @DataScience4 ✨

763 views11:04

About

Blog

Apps

Platform