TFIDF Vectorization using GENSIM

[1] Create TFIDF Vectors:

from gensim import corpora
from gensim.models import TfidfModel
from gensim import models
from gensim import similarities
import pprint

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

# Tokenize the documents
texts1 = [[word for word in document.lower().split()] for document in documents]

# Create a dictionary representation of the documents.
dictionary1 = corpora.Dictionary(texts1)

# Transform the documents to a vectorized form
corpus1 = [dictionary1.doc2bow(text) for text in texts1]

# Create a TF-IDF model
tfidf1 = models.TfidfModel(corpus1)

# Calculate the TF-IDF scores for the corpus
corpus_tfidf1 = tfidf1[corpus1]

# size
print(len(corpus_tfidf1))

# Print the TF-IDF scores
for doc in corpus_tfidf1[0:3]:
    pprint.pprint(doc)

Output:

[2] Print term frequencies:

for doc in corpus1[0:3]:
   print([[dictionary1[id], freq] for id, freq in doc])

Output:

[3] Print Term TFIDF:

import numpy as np
for doc in corpus_tfidf1[0:3]:
   print([[dictionary1[id], np.around(freq,3)] for id, freq in doc])

Output:

Note:

It looks like GENSIM does not preserve the word order in the TFIDF corpus sentences; alphabetical order takes priority.

To convert the model into dataframe:

import pandas as pd

# Convert corpus_tfidf to dataframe
# assuming corpus_tfidf contains trigram e.g. gram1,gram2,gram3
df_tfidf1 = pd.DataFrame([[[dictionary1[id], np.around(freq,3)] for id, freq in doc] for doc in corpus_tfidf1],
                        columns=['gram1', 'gram2', 'gram3'])
print(len(df_tfidf1))
# Print the dataframe
df_tfidf1