Forwarded from Machine Learning
📚 NATURAL LANGUAGE PROCESSING (2023)
🟢 Discount Price: 5$
🟢 Buy it: https://www.patreon.com/DataScienceBooks/shop/natural-language-processing-textbook-64525
💬 Tags: #NLP
🟢 Discount Price: 5$
🟢 Buy it: https://www.patreon.com/DataScienceBooks/shop/natural-language-processing-textbook-64525
💬 Tags: #NLP
👍4
Forwarded from Machine Learning with Python
Python Cheat Sheet
⚡️ Our Telegram channels: https://t.iss.one/addlist/0f6vfFbEMdAwODBk
📱 Our WhatsApp channel: https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
#AI #SentimentAnalysis #DataVisualization #pandas #Numpy #InteractiveDesign #NLP #MachineLearning #Python #GitHubProjects #TowardsDataScience
Please open Telegram to view this post
VIEW IN TELEGRAM
🔥3
from nltk.corpus import stopwords
# nltk.download('stopwords') # Run once
stop_words = set(stopwords.words('english'))
filtered = [w for w in words if not w.lower() in stop_words]
VII. Word Normalization (Stemming & Lemmatization)
• Stemming (reduce words to their root form).
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed = ps.stem("running") # 'run'
• Lemmatization (reduce words to their dictionary form).
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet') # Run once
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize("better", pos="a") # 'good'
VIII. Advanced NLP Analysis
(Requires
pip install spacy and python -m spacy download en_core_web_sm)• Part-of-Speech (POS) Tagging.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying a U.K. startup.")
for token in doc: print(token.text, token.pos_)
• Named Entity Recognition (NER).
for ent in doc.ents:
print(ent.text, ent.label_) # Apple ORG, U.K. GPE
• Get word frequency distribution.
from nltk.probability import FreqDist
fdist = FreqDist(word_tokenize("this is a test this is only a test"))
IX. Text Formatting & Encoding
• Format strings with f-strings.
name = "Alice"
age = 30
message = f"Name: {name}, Age: {age}"
• Pad a string with leading zeros.
number = "42".zfill(5) # '00042'
• Encode a string to bytes.
byte_string = "hello".encode('utf-8')• Decode bytes to a string.
original_string = byte_string.decode('utf-8')X. Text Vectorization
(Requires
pip install scikit-learn)• Create a Bag-of-Words (BoW) model.
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["This is the first document.", "This is the second document."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
• Get feature names (the vocabulary).
print(vectorizer.get_feature_names_out())
• Create a TF-IDF model.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
XI. More String Utilities
• Center a string within a width.
centered = "Hello".center(20, '-') # '-------Hello--------'
• Check if a string is in title case.
"This Is A Title".istitle() # True
• Find the highest index of a substring.
"test test".rfind("test") # Returns 5• Split from the right.
"path/to/file.txt".rsplit('/', 1) # ['path/to', 'file.txt']• Create a character translation table.
table = str.maketrans('aeiou', '12345')
vowels_to_num = "hello".translate(table) # 'h2ll4'• Remove a specific prefix.
"TestCase".removeprefix("Test") # 'Case'• Remove a specific suffix.
"filename.txt".removesuffix(".txt") # 'filename'• Check for unicode decimal characters.
"½".isdecimal() # False
"123".isdecimal() # True
• Check for unicode numeric characters.
"½".isnumeric() # True
"²".isnumeric() # True
#Python #TextProcessing #NLP #RegEx #NLTK
━━━━━━━━━━━━━━━
By: @DataScience4 ✨