Siamese Network Using SBERT-like Implementation

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).
A Siamese network is a type of neural network architecture that uses two or more identical subnetworks that share the same weights and parameters. This architecture is commonly used for tasks like similarity learning, where the goal is to determine how similar two inputs are. Sentence-BERT (SBERT) is a modification of the BERT model that uses a Siamese network architecture to generate sentence embeddings that are useful for tasks like semantic textual similarity, clustering, and information retrieval.
Key Components of the Siamese Network Using SBERT-like Implementation
Shared BERT Model:
Use a pre-trained BERT model (or any transformer-based model like RoBERTa, DistilBERT, etc.) as the base encoder.
The same BERT model is used for both input sentences (weight sharing).
Pooling Layer:
After passing the input sentences through BERT, you need to pool the output embeddings to get fixed-size sentence embeddings.
Common pooling strategies include:
Mean Pooling: Take the mean of all token embeddings.
Max Pooling: Take the maximum value across token embeddings.
CLS Token: Use the embedding of the
[CLS]token as the sentence representation.
Similarity Metric:
Compute the similarity between the two sentence embeddings.
Common similarity metrics include:
Cosine Similarity: Measures the cosine of the angle between the two vectors.
Euclidean Distance: Measures the straight-line distance between the two vectors.
Dot Product: Computes the dot product of the two vectors.
Loss Function:
Use a loss function that encourages similar sentences to have high similarity scores and dissimilar sentences to have low similarity scores.
Common loss functions include:
Contrastive Loss: Encourages similar pairs to be close and dissimilar pairs to be far apart.
Triplet Loss: Uses an anchor, positive, and negative example to learn embeddings.
Cross-Entropy Loss: If you’re classifying similarity scores into categories (e.g., similar/dissimilar).
Implementation Steps
Install Required Libraries:
Ensure you have the necessary libraries installed:pip install transformers torchLoad Pre-trained BERT Model:
Use thetransformerslibrary to load a pre-trained BERT model and tokenizer.Define the Siamese Network:
Create a PyTorch model that uses the shared BERT model and a pooling layer.Training Loop:
Train the model using pairs of sentences and their similarity labels.
Experiment 1
import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score
# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)
# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
return tokenizer(
sentences,
return_tensors='pt',
padding=True,
truncation=True,
max_length=max_length
).to(device)
# Define the SBERT-like Siamese Network Model
class SBERTLikeModel(nn.Module):
def __init__(self):
super(SBERTLikeModel, self).__init__()
self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
# Add a classification head for binary classification
self.classifier = nn.Linear(2304, 1) # Input size is 2304 (768 * 3)
def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
# Get embeddings for both sentences
outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)
# Mean pooling over token embeddings
embedding_a = torch.mean(outputs_a.last_hidden_state, dim=1)
embedding_b = torch.mean(outputs_b.last_hidden_state, dim=1)
# Concatenate embeddings and pass through the classifier
combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
logits = self.classifier(combined_embedding)
return logits.squeeze()
# Initialize model
sbert_model = SBERTLikeModel().to(device)
# Define Cross-Entropy Loss for binary classification
criterion = nn.BCEWithLogitsLoss()
# Initialize optimizer
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=2e-5)
# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv') # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()
# select the first 10000 documents
df = df.copy().head(10000)
df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)
sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()
# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)
# Training loop
epochs = 3 # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
sbert_model.train()
epoch_loss = 0
for i in range(0, len(X_train_a), batch_size):
# Get batch
batch_a = X_train_a[i:i + batch_size]
batch_b = X_train_b[i:i + batch_size]
batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)
# Tokenize sentences
inputs_a = tokenize_sentences(batch_a, tokenizer)
inputs_b = tokenize_sentences(batch_b, tokenizer)
# Forward pass
optimizer.zero_grad()
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Compute loss
loss = criterion(logits, batch_labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')
# Evaluation
sbert_model.eval()
with torch.no_grad():
# Tokenize test sentences
inputs_a = tokenize_sentences(X_test_a, tokenizer)
inputs_b = tokenize_sentences(X_test_b, tokenizer)
# Get logits
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Apply sigmoid to get probabilities
probs = torch.sigmoid(logits).cpu().numpy()
predictions = (probs > 0.5).astype(int) # Threshold at 0.5
# Compute metrics
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')
Output:
Epoch 1/3, Loss: 0.5850
Epoch 2/3, Loss: 0.4584
Epoch 3/3, Loss: 0.3231
Test Accuracy: 0.7455
Test F1 Score: 0.5776
Observation:
The results indicate that the model is learning, as the loss is decreasing over epochs. However, the Test Accuracy (0.7455) and Test F1 Score (0.5776) suggest that the model's performance is moderate but could be improved. The F1 score is relatively low, which indicates that the model might be struggling with balancing precision and recall, especially for the minority class (e.g., dissimilar pairs).
Experiment 2
import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score
# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)
# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
return tokenizer(
sentences,
return_tensors='pt',
padding=True,
truncation=True,
max_length=max_length
).to(device)
# Define the SBERT-like Siamese Network Model with improved pooling and classifier
class SBERTLikeModel(nn.Module):
def __init__(self):
super(SBERTLikeModel, self).__init__()
self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
# Use CLS token for pooling
self.pooling = 'cls'
# Add a more complex classification head
self.classifier = nn.Sequential(
nn.Linear(768 * 3, 512), # Input size is 2304 (768 * 3)
nn.ReLU(),
nn.Linear(512, 1)
)
def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
# Get embeddings for both sentences
outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)
# Pooling strategy
if self.pooling == 'cls':
embedding_a = outputs_a.last_hidden_state[:, 0, :] # CLS token
embedding_b = outputs_b.last_hidden_state[:, 0, :] # CLS token
else:
embedding_a = torch.mean(outputs_a.last_hidden_state, dim=1) # Mean pooling
embedding_b = torch.mean(outputs_b.last_hidden_state, dim=1) # Mean pooling
# Concatenate embeddings and pass through the classifier
combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
logits = self.classifier(combined_embedding)
return logits.squeeze()
# Initialize model
sbert_model = SBERTLikeModel().to(device)
# Define Cross-Entropy Loss with class weighting
pos_weight = torch.tensor([2.0]).to(device) # Adjust based on class imbalance
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
# Initialize optimizer
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=1e-5)
# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv') # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()
# select the first 10000 documents
df = df.copy().head(10000)
df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)
sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()
# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)
# Training loop
epochs = 3 # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
sbert_model.train()
epoch_loss = 0
for i in range(0, len(X_train_a), batch_size):
# Get batch
batch_a = X_train_a[i:i + batch_size]
batch_b = X_train_b[i:i + batch_size]
batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)
# Tokenize sentences
inputs_a = tokenize_sentences(batch_a, tokenizer)
inputs_b = tokenize_sentences(batch_b, tokenizer)
# Forward pass
optimizer.zero_grad()
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Compute loss
loss = criterion(logits, batch_labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')
# Evaluation
sbert_model.eval()
with torch.no_grad():
# Tokenize test sentences
inputs_a = tokenize_sentences(X_test_a, tokenizer)
inputs_b = tokenize_sentences(X_test_b, tokenizer)
# Get logits
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Apply sigmoid to get probabilities
probs = torch.sigmoid(logits).cpu().numpy()
predictions = (probs > 0.5).astype(int) # Threshold at 0.5
# Compute metrics
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')
Output:
Epoch 1/3, Loss: 0.8469
Epoch 2/3, Loss: 0.6645
Epoch 3/3, Loss: 0.5295
Test Accuracy: 0.7850
Test F1 Score: 0.7434
Observation:
The results show improvement compared to the previous run: Test Accuracy increased from 0.7455 to 0.7850. Test F1 Score increased significantly from 0.5776 to 0.7434. This indicates that the implemented changes (e.g., improved pooling strategy, class weighting, and a more complex classifier) are working well.
Experiment 3
import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score
# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)
# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
return tokenizer(
sentences,
return_tensors='pt',
padding=True,
truncation=True,
max_length=max_length
).to(device)
# Define the SBERT-like Siamese Network Model with dropout and attention-based pooling
class SBERTLikeModel(nn.Module):
def __init__(self):
super(SBERTLikeModel, self).__init__()
self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
# Use attention-based pooling
self.attention = nn.Linear(768, 1)
# Add a more complex classification head with dropout
self.classifier = nn.Sequential(
nn.Linear(768 * 3, 512),
nn.ReLU(),
nn.Dropout(0.1), # Add dropout
nn.Linear(512, 1)
)
def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
# Get embeddings for both sentences
outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)
# Attention-based pooling
attention_weights_a = torch.softmax(self.attention(outputs_a.last_hidden_state), dim=1)
embedding_a = torch.sum(attention_weights_a * outputs_a.last_hidden_state, dim=1)
attention_weights_b = torch.softmax(self.attention(outputs_b.last_hidden_state), dim=1)
embedding_b = torch.sum(attention_weights_b * outputs_b.last_hidden_state, dim=1)
# Concatenate embeddings and pass through the classifier
combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
logits = self.classifier(combined_embedding)
return logits.squeeze()
# Initialize model
sbert_model = SBERTLikeModel().to(device)
# Define Cross-Entropy Loss with class weighting
pos_weight = torch.tensor([2.0]).to(device) # Adjust based on class imbalance
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)
# Initialize optimizer with smaller learning rate
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=5e-6)
# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv') # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()
# select the first 10000 documents
df = df.copy().head(10000)
df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)
sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()
# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)
# Training loop
epochs = 3 # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
sbert_model.train()
epoch_loss = 0
for i in range(0, len(X_train_a), batch_size):
# Get batch
batch_a = X_train_a[i:i + batch_size]
batch_b = X_train_b[i:i + batch_size]
batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)
# Tokenize sentences
inputs_a = tokenize_sentences(batch_a, tokenizer)
inputs_b = tokenize_sentences(batch_b, tokenizer)
# Forward pass
optimizer.zero_grad()
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Compute loss
loss = criterion(logits, batch_labels)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')
# Evaluation
sbert_model.eval()
with torch.no_grad():
# Tokenize test sentences
inputs_a = tokenize_sentences(X_test_a, tokenizer)
inputs_b = tokenize_sentences(X_test_b, tokenizer)
# Get logits
logits = sbert_model(
inputs_a['input_ids'], inputs_a['attention_mask'],
inputs_b['input_ids'], inputs_b['attention_mask']
)
# Apply sigmoid to get probabilities
probs = torch.sigmoid(logits).cpu().numpy()
predictions = (probs > 0.5).astype(int) # Threshold at 0.5
# Compute metrics
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')
Output:
Epoch 1/3, Loss: 0.8693
Epoch 2/3, Loss: 0.7211
Epoch 3/3, Loss: 0.6196
Test Accuracy: 0.7840
Test F1 Score: 0.7497
Observation:
The results show that the model's performance has stabilized, with Test Accuracy (0.7840) and Test F1 Score (0.7497) being slightly better than the previous run.
However, the improvements are marginal, which suggests that the model might be reaching a performance plateau with the current architecture and training setup.
Colab Notebook:
https://drive.google.com/file/d/1AWadl30-08dV50OC6wjZjIfY9id59QrW/view?usp=sharing