A Siamese network is a type of neural network architecture that uses two or more identical subnetworks that share the same weights and parameters. This architecture is commonly used for tasks like similarity learning, where the goal is to determine how similar two inputs are. Sentence-BERT (SBERT) is a modification of the BERT model that uses a Siamese network architecture to generate sentence embeddings that are useful for tasks like semantic textual similarity, clustering, and information retrieval.

Key Components of the Siamese Network Using SBERT-like Implementation

Shared BERT Model:
- Use a pre-trained BERT model (or any transformer-based model like RoBERTa, DistilBERT, etc.) as the base encoder.
- The same BERT model is used for both input sentences (weight sharing).
Pooling Layer:
- After passing the input sentences through BERT, you need to pool the output embeddings to get fixed-size sentence embeddings.
- Common pooling strategies include:
  - Mean Pooling: Take the mean of all token embeddings.
  - Max Pooling: Take the maximum value across token embeddings.
  - CLS Token: Use the embedding of the [CLS] token as the sentence representation.
Similarity Metric:
- Compute the similarity between the two sentence embeddings.
- Common similarity metrics include:
  - Cosine Similarity: Measures the cosine of the angle between the two vectors.
  - Euclidean Distance: Measures the straight-line distance between the two vectors.
  - Dot Product: Computes the dot product of the two vectors.
Loss Function:
- Use a loss function that encourages similar sentences to have high similarity scores and dissimilar sentences to have low similarity scores.
- Common loss functions include:
  - Contrastive Loss: Encourages similar pairs to be close and dissimilar pairs to be far apart.
  - Triplet Loss: Uses an anchor, positive, and negative example to learn embeddings.
  - Cross-Entropy Loss: If you’re classifying similarity scores into categories (e.g., similar/dissimilar).

Implementation Steps

Install Required Libraries:
Ensure you have the necessary libraries installed:
```
 pip install transformers torch
```
Load Pre-trained BERT Model:
Use the transformers library to load a pre-trained BERT model and tokenizer.
Define the Siamese Network:
Create a PyTorch model that uses the shared BERT model and a pooling layer.
Training Loop:
Train the model using pairs of sentences and their similarity labels.

Experiment 1

import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score

# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
    return tokenizer(
        sentences,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=max_length
    ).to(device)

# Define the SBERT-like Siamese Network Model
class SBERTLikeModel(nn.Module):
    def __init__(self):
        super(SBERTLikeModel, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        # Add a classification head for binary classification
        self.classifier = nn.Linear(2304, 1)  # Input size is 2304 (768 * 3)

    def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
        # Get embeddings for both sentences
        outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
        outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)

        # Mean pooling over token embeddings
        embedding_a = torch.mean(outputs_a.last_hidden_state, dim=1)
        embedding_b = torch.mean(outputs_b.last_hidden_state, dim=1)

        # Concatenate embeddings and pass through the classifier
        combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
        logits = self.classifier(combined_embedding)
        return logits.squeeze()

# Initialize model
sbert_model = SBERTLikeModel().to(device)

# Define Cross-Entropy Loss for binary classification
criterion = nn.BCEWithLogitsLoss()

# Initialize optimizer
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=2e-5)

# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv')  # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()

# select the first 10000 documents
df = df.copy().head(10000)

df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)

sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()

# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
    sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)

# Training loop
epochs = 3  # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
    sbert_model.train()
    epoch_loss = 0
    for i in range(0, len(X_train_a), batch_size):
        # Get batch
        batch_a = X_train_a[i:i + batch_size]
        batch_b = X_train_b[i:i + batch_size]
        batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)

        # Tokenize sentences
        inputs_a = tokenize_sentences(batch_a, tokenizer)
        inputs_b = tokenize_sentences(batch_b, tokenizer)

        # Forward pass
        optimizer.zero_grad()
        logits = sbert_model(
            inputs_a['input_ids'], inputs_a['attention_mask'],
            inputs_b['input_ids'], inputs_b['attention_mask']
        )

        # Compute loss
        loss = criterion(logits, batch_labels)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')

# Evaluation
sbert_model.eval()
with torch.no_grad():
    # Tokenize test sentences
    inputs_a = tokenize_sentences(X_test_a, tokenizer)
    inputs_b = tokenize_sentences(X_test_b, tokenizer)

    # Get logits
    logits = sbert_model(
        inputs_a['input_ids'], inputs_a['attention_mask'],
        inputs_b['input_ids'], inputs_b['attention_mask']
    )

    # Apply sigmoid to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()
    predictions = (probs > 0.5).astype(int)  # Threshold at 0.5

    # Compute metrics
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')

Output:

Epoch 1/3, Loss: 0.5850
Epoch 2/3, Loss: 0.4584
Epoch 3/3, Loss: 0.3231
Test Accuracy: 0.7455
Test F1 Score: 0.5776

Observation:

The results indicate that the model is learning, as the loss is decreasing over epochs. However, the Test Accuracy (0.7455) and Test F1 Score (0.5776) suggest that the model's performance is moderate but could be improved. The F1 score is relatively low, which indicates that the model might be struggling with balancing precision and recall, especially for the minority class (e.g., dissimilar pairs).

Experiment 2

import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score

# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
    return tokenizer(
        sentences,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=max_length
    ).to(device)

# Define the SBERT-like Siamese Network Model with improved pooling and classifier
class SBERTLikeModel(nn.Module):
    def __init__(self):
        super(SBERTLikeModel, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        # Use CLS token for pooling
        self.pooling = 'cls'
        # Add a more complex classification head
        self.classifier = nn.Sequential(
            nn.Linear(768 * 3, 512),  # Input size is 2304 (768 * 3)
            nn.ReLU(),
            nn.Linear(512, 1)
        )

    def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
        # Get embeddings for both sentences
        outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
        outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)

        # Pooling strategy
        if self.pooling == 'cls':
            embedding_a = outputs_a.last_hidden_state[:, 0, :]  # CLS token
            embedding_b = outputs_b.last_hidden_state[:, 0, :]  # CLS token
        else:
            embedding_a = torch.mean(outputs_a.last_hidden_state, dim=1)  # Mean pooling
            embedding_b = torch.mean(outputs_b.last_hidden_state, dim=1)  # Mean pooling

        # Concatenate embeddings and pass through the classifier
        combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
        logits = self.classifier(combined_embedding)
        return logits.squeeze()

# Initialize model
sbert_model = SBERTLikeModel().to(device)

# Define Cross-Entropy Loss with class weighting
pos_weight = torch.tensor([2.0]).to(device)  # Adjust based on class imbalance
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

# Initialize optimizer
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=1e-5)

# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv')  # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()

# select the first 10000 documents
df = df.copy().head(10000)

df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)

sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()

# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
    sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)

# Training loop
epochs = 3  # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
    sbert_model.train()
    epoch_loss = 0
    for i in range(0, len(X_train_a), batch_size):
        # Get batch
        batch_a = X_train_a[i:i + batch_size]
        batch_b = X_train_b[i:i + batch_size]
        batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)

        # Tokenize sentences
        inputs_a = tokenize_sentences(batch_a, tokenizer)
        inputs_b = tokenize_sentences(batch_b, tokenizer)

        # Forward pass
        optimizer.zero_grad()
        logits = sbert_model(
            inputs_a['input_ids'], inputs_a['attention_mask'],
            inputs_b['input_ids'], inputs_b['attention_mask']
        )

        # Compute loss
        loss = criterion(logits, batch_labels)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')

# Evaluation
sbert_model.eval()
with torch.no_grad():
    # Tokenize test sentences
    inputs_a = tokenize_sentences(X_test_a, tokenizer)
    inputs_b = tokenize_sentences(X_test_b, tokenizer)

    # Get logits
    logits = sbert_model(
        inputs_a['input_ids'], inputs_a['attention_mask'],
        inputs_b['input_ids'], inputs_b['attention_mask']
    )

    # Apply sigmoid to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()
    predictions = (probs > 0.5).astype(int)  # Threshold at 0.5

    # Compute metrics
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')

Output:

Epoch 1/3, Loss: 0.8469
Epoch 2/3, Loss: 0.6645
Epoch 3/3, Loss: 0.5295
Test Accuracy: 0.7850
Test F1 Score: 0.7434

Observation:

The results show improvement compared to the previous run: Test Accuracy increased from 0.7455 to 0.7850. Test F1 Score increased significantly from 0.5776 to 0.7434. This indicates that the implemented changes (e.g., improved pooling strategy, class weighting, and a more complex classifier) are working well.

Experiment 3

import random
import numpy as np
import pandas as pd
import torch
from torch import nn
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics import accuracy_score, f1_score

# Ensure the GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load pre-trained DistilBERT and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# Function to tokenize sentences
def tokenize_sentences(sentences, tokenizer, max_length=32):
    return tokenizer(
        sentences,
        return_tensors='pt',
        padding=True,
        truncation=True,
        max_length=max_length
    ).to(device)

# Define the SBERT-like Siamese Network Model with dropout and attention-based pooling
class SBERTLikeModel(nn.Module):
    def __init__(self):
        super(SBERTLikeModel, self).__init__()
        self.bert = DistilBertModel.from_pretrained('distilbert-base-uncased')
        # Use attention-based pooling
        self.attention = nn.Linear(768, 1)
        # Add a more complex classification head with dropout
        self.classifier = nn.Sequential(
            nn.Linear(768 * 3, 512),
            nn.ReLU(),
            nn.Dropout(0.1),  # Add dropout
            nn.Linear(512, 1)
        )

    def forward(self, input_ids_a, attention_mask_a, input_ids_b, attention_mask_b):
        # Get embeddings for both sentences
        outputs_a = self.bert(input_ids=input_ids_a, attention_mask=attention_mask_a)
        outputs_b = self.bert(input_ids=input_ids_b, attention_mask=attention_mask_b)

        # Attention-based pooling
        attention_weights_a = torch.softmax(self.attention(outputs_a.last_hidden_state), dim=1)
        embedding_a = torch.sum(attention_weights_a * outputs_a.last_hidden_state, dim=1)

        attention_weights_b = torch.softmax(self.attention(outputs_b.last_hidden_state), dim=1)
        embedding_b = torch.sum(attention_weights_b * outputs_b.last_hidden_state, dim=1)

        # Concatenate embeddings and pass through the classifier
        combined_embedding = torch.cat([embedding_a, embedding_b, torch.abs(embedding_a - embedding_b)], dim=1)
        logits = self.classifier(combined_embedding)
        return logits.squeeze()

# Initialize model
sbert_model = SBERTLikeModel().to(device)

# Define Cross-Entropy Loss with class weighting
pos_weight = torch.tensor([2.0]).to(device)  # Adjust based on class imbalance
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight)

# Initialize optimizer with smaller learning rate
optimizer = torch.optim.Adam(sbert_model.parameters(), lr=5e-6)

# Prepare the data
# Replace synthetic dataset with Quora Question Pairs dataset
df = pd.read_csv('quora_duplicate_questions.csv')  # Replace with your file path
df = df[['question1', 'question2', 'is_duplicate']].dropna()

# select the first 10000 documents
df = df.copy().head(10000)

df.columns = ['Sentence A', 'Sentence B', 'Label']
df['Label'] = df['Label'].astype(int)

sentence_a = df['Sentence A'].tolist()
sentence_b = df['Sentence B'].tolist()
labels = df['Label'].tolist()

# Split the data into training and testing sets
X_train_a, X_test_a, X_train_b, X_test_b, y_train, y_test = train_test_split(
    sentence_a, sentence_b, labels, test_size=0.2, random_state=42
)

# Training loop
epochs = 3  # SBERT typically uses fewer epochs due to fine-tuning
batch_size = 32
for epoch in range(epochs):
    sbert_model.train()
    epoch_loss = 0
    for i in range(0, len(X_train_a), batch_size):
        # Get batch
        batch_a = X_train_a[i:i + batch_size]
        batch_b = X_train_b[i:i + batch_size]
        batch_labels = torch.tensor(y_train[i:i + batch_size], dtype=torch.float32).to(device)

        # Tokenize sentences
        inputs_a = tokenize_sentences(batch_a, tokenizer)
        inputs_b = tokenize_sentences(batch_b, tokenizer)

        # Forward pass
        optimizer.zero_grad()
        logits = sbert_model(
            inputs_a['input_ids'], inputs_a['attention_mask'],
            inputs_b['input_ids'], inputs_b['attention_mask']
        )

        # Compute loss
        loss = criterion(logits, batch_labels)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / (len(X_train_a) / batch_size):.4f}')

# Evaluation
sbert_model.eval()
with torch.no_grad():
    # Tokenize test sentences
    inputs_a = tokenize_sentences(X_test_a, tokenizer)
    inputs_b = tokenize_sentences(X_test_b, tokenizer)

    # Get logits
    logits = sbert_model(
        inputs_a['input_ids'], inputs_a['attention_mask'],
        inputs_b['input_ids'], inputs_b['attention_mask']
    )

    # Apply sigmoid to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()
    predictions = (probs > 0.5).astype(int)  # Threshold at 0.5

    # Compute metrics
    accuracy = accuracy_score(y_test, predictions)
    f1 = f1_score(y_test, predictions)

print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test F1 Score: {f1:.4f}')

Output:

Epoch 1/3, Loss: 0.8693
Epoch 2/3, Loss: 0.7211
Epoch 3/3, Loss: 0.6196
Test Accuracy: 0.7840
Test F1 Score: 0.7497

Observation:

The results show that the model's performance has stabilized, with Test Accuracy (0.7840) and Test F1 Score (0.7497) being slightly better than the previous run.

However, the improvements are marginal, which suggests that the model might be reaching a performance plateau with the current architecture and training setup.

Colab Notebook:

https://drive.google.com/file/d/1AWadl30-08dV50OC6wjZjIfY9id59QrW/view?usp=sharing

Siamese Network Using SBERT-like Implementation

Key Components of the Siamese Network Using SBERT-like Implementation

Implementation Steps

Experiment 1

Experiment 2

Experiment 3

Comments

More from this blog

Getting Started with CI/CD Using GitHub Actions and Docker

Exploring Cloud Automation and DevOps with VirtualBox and Docker

Exploring Docker Networking and Understanding Cloud Networking

Exploring VirtualBox Networking and Understanding Cloud Networking

Docker Fundamentals: Imperative and Declarative Deployment

Command Palette

Key Components of the Siamese Network Using SBERT-like Implementation

Implementation Steps

Experiment 1

Experiment 2

Experiment 3

Comments

More from this blog