Corpus
corpus = ["John likes to watch movies.",
"Mary likes movies too.",
"John also likes to watch football games."]
tgt_doc = corpus[2]
Term Frequency:
term_freq = {}
for word in tgt_doc.split():
term_freq[word] = term_freq.get(word, 0) + 1
Inverse Document Frequency:
df = {}
for doc in corpus:
for word in set(doc.split()):
df[word] = df.get(word, 0) + 1
idf = {}
for word in df:
idf[word] = math.log(len(corpus) / df[word])
TF-IDF:
tfidf = {}
for word in tgt_doc.split():
tfidf[word] = term_freq[word] * idf[word]
print(tfidf)
{'John': 0.4054651081081644,
'also': 1.0986122886681098,
'likes': 0.0,
'to': 0.4054651081081644,
'watch': 0.4054651081081644,
'football': 1.0986122886681098,
'games.': 1.0986122886681098}
Problem
Word2Vec:
Let's consider the stats from our Pokemon example.
df.set_index("Name", inplace=True)
df[["HP", "Attack", "Defense", "Sp. Atk", "Sp. Def", "Speed"]]
df.loc["Bulbasaur"]
HP 45
Attack 49
Defense 49
Sp. Atk 65
Sp. Def 65
Speed 45
Name: Bulbasaur, dtype: int64
df.loc[['Bulbasaur', 'Vanillite']]
HP Attack Defense Sp. Atk Sp. Def Speed
Name
Bulbasaur 45 49 49 65 65 45
Vanillite 36 50 50 65 60 44
df.loc[['Pikachu', 'Diglett']]
HP Attack Defense Sp. Atk Sp. Def Speed
Name
Pikachu 35 55 40 50 50 90
Diglett 10 55 25 35 45 95
df.loc[['Nidoqueen', 'Poliwrath']]
HP Attack Defense Sp. Atk Sp. Def Speed
Name
Nidoqueen 90 92 87 75 85 76
Poliwrath 90 95 95 70 90 70
import spacy
import numpy as np
nlp = spacy.load("en_core_web_lg") # Load spaCy model
cat = nlp("cat")
dog = nlp("dog")
ham = nlp("ham")
print(f"Distance from 'cat' to 'dog' is {np.linalg.norm(cat.vector - dog.vector)}")
print(f"Distance from 'dog' to 'ham' is {np.linalg.norm(cat.vector - ham.vector)}")
Distance from 'cat' to 'dog' is 42.8679084777832
Distance from 'dog' to 'ham' is 77.25950622558594
Definition:
Use Cases:
Lexicon-Based Approach:
import spacy
from textblob import TextBlob
nlp = spacy.load("en_core_web_sm")
sentence = "I am very unhappy with this product."
doc = nlp(sentence)
blob = TextBlob(doc.text)
print(f"Sentiment Polarity: {blob.sentiment.polarity}")
print(f"Sentiment Subjectivity: {blob.sentiment.subjectivity}")
print(f"Assessments: {blob.sentiment_assessments.assessments}")
Sentiment Polarity: -0.78
Sentiment Subjectivity: 1.0
Assessments: [(['very', 'unhappy'], -0.78, 1.0, None)]