Contextual word embeddings are advanced language representations that capture the meaning of words based on their context. Unlike traditional static word embeddings, which assign a single vector to each word, contextual embeddings generate dynamic representations that change according to the surrounding words in a sentence (ActiveLoop.ai).
The following are Python codes that were published at the source: medium.com/mlearning-ai/getting-contextuali..
[1] Install Required Libraries
import pandas as pd
import numpy as np
import torch
[2] Load the PreTrained BERT Model
from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True,)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
[3] Prepare BERT's Tokenized Text
def bert_text_preparation(text, tokenizer):
"""
Preprocesses text input in a way that BERT can interpret.
"""
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1]*len(indexed_tokens)
# convert inputs to tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensor = torch.tensor([segments_ids])
return tokenized_text, tokens_tensor, segments_tensor
[4] Define BERT Embedding Task
def get_bert_embeddings(tokens_tensor, segments_tensor, model):
"""
Obtains BERT embeddings for tokens.
"""
# gradient calculation id disabled
with torch.no_grad():
# obtain hidden states
outputs = model(tokens_tensor, segments_tensor)
hidden_states = outputs[2]
# concatenate the tensors for all layers
# use "stack" to create new dimension in tensor
token_embeddings = torch.stack(hidden_states, dim=0)
# remove dimension 1, the "batches"
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# swap dimensions 0 and 1 so we can loop over tokens
token_embeddings = token_embeddings.permute(1,0,2)
# intialized list to store embeddings
token_vecs_sum = []
# "token_embeddings" is a [Y x 12 x 768] tensor
# where Y is the number of tokens in the sentence
# loop over tokens in sentence
for token in token_embeddings:
# "token" is a [12 x 768] tensor
# sum the vectors from the last four layers
sum_vec = torch.sum(token[-4:], dim=0)
token_vecs_sum.append(sum_vec)
return token_vecs_sum
[5] Create BERT Embedding Model
sentences = ["bank",
"he eventually sold the shares back to the bank at a premium.",
"the bank strongly resisted cutting interest rates.",
"the bank will supply and buy back foreign currency.",
"the bank is pressing us for repayment of the loan.",
"the bank left its lending rates unchanged.",
"the river flowed over the bank.",
"tall, luxuriant plants grew along the river bank.",
"his soldiers were arrayed along the river bank.",
"wild flowers adorned the river bank.",
"two fox cubs romped playfully on the river bank.",
"the jewels were kept in a bank vault.",
"you can stow your jewellery away in the bank.",
"most of the money was in storage in bank vaults.",
"the diamonds are shut away in a bank vault somewhere.",
"thieves broke into the bank vault.",
"can I bank on your support?",
"you can bank on him to hand you a reasonable bill for your services.",
"don't bank on your friends to help you out of trouble.",
"you can bank on me when you need money.",
"i bank on your help."
]
from collections import OrderedDict
context_embeddings = []
context_tokens = []
for sentence in sentences:
tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
# make ordered dictionary to keep track of the position of each word
tokens = OrderedDict()
# loop over tokens in sensitive sentence
for token in tokenized_text[1:-1]:
# keep track of position of word and whether it occurs multiple times
if token in tokens:
tokens[token] += 1
else:
tokens[token] = 1
# compute the position of the current token
token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
current_index = token_indices[tokens[token]-1]
# get the corresponding embedding
token_vec = list_token_embeddings[current_index]
# save values
context_tokens.append(token)
context_embeddings.append(token_vec)
In the above screenshot, notice that the tokens are stored based on their actual positions in the sentence.
[6] Visualize the BERT contextual word embedding model
import os
filepath = os.path.join('/content/')
name = 'metadata_small.tsv'
with open(os.path.join(filepath, name), 'w+') as file_metadata:
for i, token in enumerate(context_tokens):
file_metadata.write(token + '\n')
import csv
name = 'embeddings_small.tsv'
with open(os.path.join(filepath, name), 'w+') as tsvfile:
writer = csv.writer(tsvfile, delimiter='\t')
for embedding in context_embeddings:
writer.writerow(embedding.numpy())
In the above screenshot, notice that the word "bank" appears in two different clusters based on their context of use.
Read the original source for further explanation.
also read:
discuss.huggingface.co/t/generate-raw-word-..
Colab Notebook:
https://colab.research.google.com/drive/1IfkGa9cyLXCzqcXzehfPR_n4NP-Dm8xy?usp=sharing