Skip to main content

Command Palette

Search for a command to run...

Python: Removing stop words

Published
3 min read
M

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).

1. Use Set Lookup for Stopwords

Ensure that set_CustomStopWord is a set (not a list) because lookups in sets are O(1) on average, while lookups in lists are O(n). This will significantly speed up the filtering process.

set_CustomStopWord = set(set_CustomStopWord)  # Ensure it's a set
df_dset['CleanToken'] = [[token for token in strg.split() if token not in set_CustomStopWord] for strg in df_dset['Clean']]

2. Use str.split() with expand=True and stack()

For large datasets, you can leverage pandas' vectorized operations to split the strings and filter stopwords more efficiently.

import pandas as pd

# Split the 'Clean' column into tokens
split_tokens = df_dset['Clean'].str.split(expand=True)

# Stack the tokens into a single column and filter stopwords
filtered_tokens = split_tokens.stack().loc[lambda x: ~x.isin(set_CustomStopWord)]

# Group the filtered tokens back into lists
df_dset['CleanToken'] = filtered_tokens.groupby(level=0).apply(list)

3. Use numpy for Vectorized Operations

If you have a very large dataset, you can use numpy for faster operations. This approach avoids Python loops and leverages numpy's vectorized operations.

import numpy as np

# Convert the 'Clean' column to a numpy array
clean_texts = df_dset['Clean'].to_numpy()

# Split and filter stopwords using numpy
def filter_stopwords(texts, stopwords):
    return [np.array([token for token in text.split() if token not in stopwords]) for text in texts]

df_dset['CleanToken'] = filter_stopwords(clean_texts, set_CustomStopWord)

4. Use multiprocessing for Parallel Processing

If you have a very large dataset and a multi-core CPU, you can parallelize the filtering process using Python's multiprocessing module.

from multiprocessing import Pool, cpu_count

# Function to filter stopwords
def filter_stopwords(text):
    return [token for token in text.split() if token not in set_CustomStopWord]

# Parallelize the filtering process
with Pool(cpu_count()) as pool:
    df_dset['CleanToken'] = pool.map(filter_stopwords, df_dset['Clean'])

5. Use swifter for Pandas Apply

The swifter library can automatically parallelize pandas apply operations, making it faster without requiring significant code changes.

pip install swifter

.

import swifter

df_dset['CleanToken'] = df_dset['Clean'].swifter.apply(lambda x: [token for token in x.split() if token not in set_CustomStopWord])

6. Precompile the Stopwords into a Regex Pattern

If your stopwords are fixed and don't change, you can precompile them into a regex pattern and use re.sub to remove them. This can be faster for large texts.

import re

# Create a regex pattern for stopwords
stopword_pattern = re.compile(r'\b(?:{})\b'.format('|'.join(map(re.escape, set_CustomStopWord))))

# Remove stopwords using regex
df_dset['CleanToken'] = df_dset['Clean'].apply(lambda x: [token for token in x.split() if not stopword_pattern.match(token)])

Performance Comparison

MethodSpeed (Relative)Best Use Case
List comprehension with setFastSmall to medium datasets
Pandas str.split + stack()FasterMedium to large datasets
Numpy vectorized operationsVery FastVery large datasets
MultiprocessingFastest (CPU-bound)Extremely large datasets with multi-core CPUs
SwifterFastEasy parallelization for pandas apply
Regex patternFastFixed stopwords and large texts

Recommendation

  • For small to medium datasets, the list comprehension with a set is sufficient and easy to implement.

  • For large datasets, use pandas str.split + stack() or numpy vectorized operations.

  • For extremely large datasets, consider multiprocessing or swifter.