Python: Removing stop words
Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).
1. Use Set Lookup for Stopwords
Ensure that set_CustomStopWord is a set (not a list) because lookups in sets are O(1) on average, while lookups in lists are O(n). This will significantly speed up the filtering process.
set_CustomStopWord = set(set_CustomStopWord) # Ensure it's a set
df_dset['CleanToken'] = [[token for token in strg.split() if token not in set_CustomStopWord] for strg in df_dset['Clean']]
2. Use str.split() with expand=True and stack()
For large datasets, you can leverage pandas' vectorized operations to split the strings and filter stopwords more efficiently.
import pandas as pd
# Split the 'Clean' column into tokens
split_tokens = df_dset['Clean'].str.split(expand=True)
# Stack the tokens into a single column and filter stopwords
filtered_tokens = split_tokens.stack().loc[lambda x: ~x.isin(set_CustomStopWord)]
# Group the filtered tokens back into lists
df_dset['CleanToken'] = filtered_tokens.groupby(level=0).apply(list)
3. Use numpy for Vectorized Operations
If you have a very large dataset, you can use numpy for faster operations. This approach avoids Python loops and leverages numpy's vectorized operations.
import numpy as np
# Convert the 'Clean' column to a numpy array
clean_texts = df_dset['Clean'].to_numpy()
# Split and filter stopwords using numpy
def filter_stopwords(texts, stopwords):
return [np.array([token for token in text.split() if token not in stopwords]) for text in texts]
df_dset['CleanToken'] = filter_stopwords(clean_texts, set_CustomStopWord)
4. Use multiprocessing for Parallel Processing
If you have a very large dataset and a multi-core CPU, you can parallelize the filtering process using Python's multiprocessing module.
from multiprocessing import Pool, cpu_count
# Function to filter stopwords
def filter_stopwords(text):
return [token for token in text.split() if token not in set_CustomStopWord]
# Parallelize the filtering process
with Pool(cpu_count()) as pool:
df_dset['CleanToken'] = pool.map(filter_stopwords, df_dset['Clean'])
5. Use swifter for Pandas Apply
The swifter library can automatically parallelize pandas apply operations, making it faster without requiring significant code changes.
pip install swifter
.
import swifter
df_dset['CleanToken'] = df_dset['Clean'].swifter.apply(lambda x: [token for token in x.split() if token not in set_CustomStopWord])
6. Precompile the Stopwords into a Regex Pattern
If your stopwords are fixed and don't change, you can precompile them into a regex pattern and use re.sub to remove them. This can be faster for large texts.
import re
# Create a regex pattern for stopwords
stopword_pattern = re.compile(r'\b(?:{})\b'.format('|'.join(map(re.escape, set_CustomStopWord))))
# Remove stopwords using regex
df_dset['CleanToken'] = df_dset['Clean'].apply(lambda x: [token for token in x.split() if not stopword_pattern.match(token)])
Performance Comparison
| Method | Speed (Relative) | Best Use Case |
| List comprehension with set | Fast | Small to medium datasets |
Pandas str.split + stack() | Faster | Medium to large datasets |
| Numpy vectorized operations | Very Fast | Very large datasets |
| Multiprocessing | Fastest (CPU-bound) | Extremely large datasets with multi-core CPUs |
| Swifter | Fast | Easy parallelization for pandas apply |
| Regex pattern | Fast | Fixed stopwords and large texts |
Recommendation
For small to medium datasets, the list comprehension with a set is sufficient and easy to implement.
For large datasets, use pandas
str.split+stack()or numpy vectorized operations.For extremely large datasets, consider multiprocessing or swifter.