TFIDF Vectorization using GENSIM
[1] Create TFIDF Vectors:
from gensim import corpora
from gensim.models import TfidfModel
from gensim import models
from gensim import similarities
import pprint
# Sample documents
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# Tokenize the documents
texts1 = [[word for word in document.lower().split()] for document in documents]
# Create a dictionary representation of the documents.
dictionary1 = corpora.Dictionary(texts1)
# Transform the documents to a vectorized form
corpus1 = [dictionary1.doc2bow(text) for text in texts1]
# Create a TF-IDF model
tfidf1 = models.TfidfModel(corpus1)
# Calculate the TF-IDF scores for the corpus
corpus_tfidf1 = tfidf1[corpus1]
# size
print(len(corpus_tfidf1))
# Print the TF-IDF scores
for doc in corpus_tfidf1[0:3]:
pprint.pprint(doc)
Output:
[2] Print term frequencies:
for doc in corpus1[0:3]:
print([[dictionary1[id], freq] for id, freq in doc])
Output:
[3] Print Term TFIDF:
import numpy as np
for doc in corpus_tfidf1[0:3]:
print([[dictionary1[id], np.around(freq,3)] for id, freq in doc])
Output:
Note:
It looks like GENSIM does not preserve the word order in the TFIDF corpus sentences; alphabetical order takes priority.
To convert the model into dataframe:
import pandas as pd
# Convert corpus_tfidf to dataframe
# assuming corpus_tfidf contains trigram e.g. gram1,gram2,gram3
df_tfidf1 = pd.DataFrame([[[dictionary1[id], np.around(freq,3)] for id, freq in doc] for doc in corpus_tfidf1],
columns=['gram1', 'gram2', 'gram3'])
print(len(df_tfidf1))
# Print the dataframe
df_tfidf1