Creating BERT Contextual Word Embedding Model

Contextual word embeddings are advanced language representations that capture the meaning of words based on their context. Unlike traditional static word embeddings, which assign a single vector to each word, contextual embeddings generate dynamic representations that change according to the surrounding words in a sentence (ActiveLoop.ai).

The following are Python codes that were published at the source: medium.com/mlearning-ai/getting-contextuali..

[1] Install Required Libraries

import pandas as pd
import numpy as np
import torch

[2] Load the PreTrained BERT Model

from transformers import BertModel, BertTokenizer
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True,)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

[3] Prepare BERT's Tokenized Text

def bert_text_preparation(text, tokenizer):
  """
  Preprocesses text input in a way that BERT can interpret.
  """
  marked_text = "[CLS] " + text + " [SEP]"
  tokenized_text = tokenizer.tokenize(marked_text)
  indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
  segments_ids = [1]*len(indexed_tokens)
  # convert inputs to tensors
  tokens_tensor = torch.tensor([indexed_tokens])
  segments_tensor = torch.tensor([segments_ids])
  return tokenized_text, tokens_tensor, segments_tensor

[4] Define BERT Embedding Task

def get_bert_embeddings(tokens_tensor, segments_tensor, model):
    """
    Obtains BERT embeddings for tokens.
    """
    # gradient calculation id disabled
    with torch.no_grad():
      # obtain hidden states
      outputs = model(tokens_tensor, segments_tensor)
      hidden_states = outputs[2]
    # concatenate the tensors for all layers
    # use "stack" to create new dimension in tensor
    token_embeddings = torch.stack(hidden_states, dim=0)
    # remove dimension 1, the "batches"
    token_embeddings = torch.squeeze(token_embeddings, dim=1)
    # swap dimensions 0 and 1 so we can loop over tokens
    token_embeddings = token_embeddings.permute(1,0,2)
    # intialized list to store embeddings
    token_vecs_sum = []
    # "token_embeddings" is a [Y x 12 x 768] tensor
    # where Y is the number of tokens in the sentence
    # loop over tokens in sentence
    for token in token_embeddings:
    # "token" is a [12 x 768] tensor
    # sum the vectors from the last four layers
      sum_vec = torch.sum(token[-4:], dim=0)
      token_vecs_sum.append(sum_vec)
    return token_vecs_sum

[5] Create BERT Embedding Model

sentences = ["bank",
         "he eventually sold the shares back to the bank at a premium.",
         "the bank strongly resisted cutting interest rates.",
         "the bank will supply and buy back foreign currency.",
         "the bank is pressing us for repayment of the loan.",
         "the bank left its lending rates unchanged.",
         "the river flowed over the bank.",
         "tall, luxuriant plants grew along the river bank.",
         "his soldiers were arrayed along the river bank.",
         "wild flowers adorned the river bank.",
         "two fox cubs romped playfully on the river bank.",
         "the jewels were kept in a bank vault.",
         "you can stow your jewellery away in the bank.",
         "most of the money was in storage in bank vaults.",
         "the diamonds are shut away in a bank vault somewhere.",
         "thieves broke into the bank vault.",
         "can I bank on your support?",
         "you can bank on him to hand you a reasonable bill for your services.",
         "don't bank on your friends to help you out of trouble.",
         "you can bank on me when you need money.",
         "i bank on your help."
         ]
from collections import OrderedDict
context_embeddings = []
context_tokens = []
for sentence in sentences:
  tokenized_text, tokens_tensor, segments_tensors = bert_text_preparation(sentence, tokenizer)
  list_token_embeddings = get_bert_embeddings(tokens_tensor, segments_tensors, model)
  # make ordered dictionary to keep track of the position of each   word
  tokens = OrderedDict()
  # loop over tokens in sensitive sentence
  for token in tokenized_text[1:-1]:
    # keep track of position of word and whether it occurs multiple times
    if token in tokens:
      tokens[token] += 1
    else:
      tokens[token] = 1
  # compute the position of the current token
    token_indices = [i for i, t in enumerate(tokenized_text) if t == token]
    current_index = token_indices[tokens[token]-1]
  # get the corresponding embedding
    token_vec = list_token_embeddings[current_index]

    # save values
    context_tokens.append(token)
    context_embeddings.append(token_vec)

In the above screenshot, notice that the tokens are stored based on their actual positions in the sentence.

[6] Visualize the BERT contextual word embedding model

import os
filepath = os.path.join('/content/')
name = 'metadata_small.tsv'
with open(os.path.join(filepath, name), 'w+') as file_metadata:
  for i, token in enumerate(context_tokens):
    file_metadata.write(token + '\n')
import csv
name = 'embeddings_small.tsv'
with open(os.path.join(filepath, name), 'w+') as tsvfile:
    writer = csv.writer(tsvfile, delimiter='\t')
    for embedding in context_embeddings:
        writer.writerow(embedding.numpy())

In the above screenshot, notice that the word "bank" appears in two different clusters based on their context of use.

Read the original source for further explanation.

also read:

discuss.huggingface.co/t/generate-raw-word-..

Colab Notebook:

https://colab.research.google.com/drive/1IfkGa9cyLXCzqcXzehfPR_n4NP-Dm8xy?usp=sharing