Python: Creating a Subsample of a Dataset

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He is studying at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).
Creating a subsample of a dataset in Python can be easily achieved using the Pandas library. This post discusses a few common methods to create a subsample.
The following is an example dataset from the previous post (Improving Text Classification — a step by step approach):
import pandas as pd
data_text = [
# Positive reviews
"I love this product! It's amazing and works perfectly.",
"Fantastic quality! I'm very happy with my purchase.",
"Good value for the price. Satisfied with the performance.",
"Exceeded my expectations. Will buy again!",
"Highly recommend! Five stars!",
"Absolutely wonderful experience. The product is superb.",
"Brilliant product, top quality.",
"The best purchase I've made in a while!",
"Very pleased with this product, fantastic buy!",
"Exceptional quality and value.",
"Remarkable product, highly satisfied.",
"An exceptional find, totally worth it.",
"Really impressive, I couldn't be happier.",
"Absolutely in love with this item!",
"This product blew me away, amazing.",
"Five stars all the way, no complaints.",
"A fantastic purchase, exceeded my hopes.",
"Such high quality, definitely recommend.",
"A brilliant choice, top-notch item.",
"A stellar product, worth every penny.",
"Excellent performance, absolutely love it.",
"I can't say enough good things about this.",
"Superb craftsmanship, will buy again.",
"Outstanding, exceeded all my expectations.",
"Wonderful addition to my collection.",
# Neutral reviews
"It's okay, but I've seen better.",
"Mediocre at best. Not impressed.",
"Average product, nothing special.",
"It's fine, does the job.",
"Not good, not bad, just average.",
"Serviceable product. Not much to say.",
"It's alright, wouldn't write home about it.",
"Meets expectations, no more, no less.",
"A standard product, gets the job done.",
"Neither here nor there, just okay.",
"Decent item, nothing to rave about.",
"It's passable, meets basic needs.",
"A typical product, nothing extraordinary.",
"Pretty average, does what it should.",
"Nothing special, just another item.",
"Barely meets the mark, but acceptable.",
"It’s adequate, but far from amazing.",
"Just a regular product, does the job.",
"A run-of-the-mill product, works fine.",
"The definition of average, nothing more.",
"Meh, it's just okay.",
"Expected more, but it's not bad.",
"It's alright for its price, I guess.",
"Could be better, but not the worst.",
"It's decent, no major complaints.",
# Negative reviews
"This is the worst thing I've ever bought. Totally disappointing.",
"Not worth the money. I regret buying it.",
"Terrible customer service. The product arrived broken.",
"Completely dissatisfied with the quality.",
"Horrible experience. Would not recommend.",
"Very poor quality, extremely disappointed.",
"Awful product, don't waste your money.",
"Absolutely terrible, never buying again.",
"Low quality and unreliable.",
"The worst product I have ever used.",
"Terrible, wouldn’t wish this on anyone.",
"Absolute waste of money, so disappointing.",
"Such poor quality, do not recommend.",
"Horrendous experience, truly awful.",
"Awful customer service, never again.",
"Broke within days, very poor quality.",
"A total failure, regret this purchase.",
"Not even close to what I expected.",
"Defective and useless, avoid at all costs.",
"Worst product ever, extremely dissatisfied.",
"A nightmare to use, don't buy it.",
"Complete garbage, regret every penny.",
"Utterly disappointing, can't believe it.",
"A real letdown, avoid this product.",
"Fails to meet any standards, terrible."
]
# Corresponding labels
data_label = [
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, # Positive
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # Neutral
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 # Negative
]
# Create DataFrame
df = pd.DataFrame({
'text': data_text,
'label': data_label
})
df
Output:
text label
0 I love this product! It's amazing and works pe... 2
1 Fantastic quality! I'm very happy with my purc... 2
2 Good value for the price. Satisfied with the p... 2
3 Exceeded my expectations. Will buy again! 2
4 Highly recommend! Five stars! 2
... ... ...
70 A nightmare to use, don't buy it. 0
71 Complete garbage, regret every penny. 0
72 Utterly disappointing, can't believe it. 0
73 A real letdown, avoid this product. 0
74 Fails to meet any standards, terrible. 0
[1] Random Sampling
You can use the sample() method to randomly select a specified number of rows from your DataFrame.
import pandas as pd
# Assuming df is your DataFrame
subsample = df.sample(n=10, random_state=42) # Select 10 random rows
subsample
Output:
text label
4 Highly recommend! Five stars! 2
63 Horrendous experience, truly awful. 0
10 Remarkable product, highly satisfied. 2
0 I love this product! It's amazing and works pe... 2
35 Decent item, nothing to rave about. 1
61 Absolute waste of money, so disappointing. 0
28 It's fine, does the job. 1
12 Really impressive, I couldn't be happier. 2
69 Worst product ever, extremely dissatisfied. 0
64 Awful customer service, never again. 0
[2] Fractional Sampling
If you want to select a fraction of your DataFrame, you can specify the frac parameter.
subsample = df.sample(frac=0.1, random_state=42) # Select 10% of the rows
subsample
Output:
text label
4 Highly recommend! Five stars! 2
63 Horrendous experience, truly awful. 0
10 Remarkable product, highly satisfied. 2
0 I love this product! It's amazing and works pe... 2
35 Decent item, nothing to rave about. 1
61 Absolute waste of money, so disappointing. 0
28 It's fine, does the job. 1
12 Really impressive, I couldn't be happier. 2
[3] Stratified Sampling
If you need to ensure that the subsample maintains the same proportions of a categorical variable, you can use the train_test_split function from Scikit-learn.
from sklearn.model_selection import train_test_split
# Assuming 'label' is your categorical column
train, subsample = train_test_split(df, test_size=0.1, stratify=df['label'], random_state=42)
subsample
Output:
text label
64 Awful customer service, never again. 0
41 It’s adequate, but far from amazing. 1
36 It's passable, meets basic needs. 1
60 Terrible, wouldn’t wish this on anyone. 0
34 Neither here nor there, just okay. 1
19 A stellar product, worth every penny. 2
20 Excellent performance, absolutely love it. 2
1 Fantastic quality! I'm very happy with my purc... 2
[4] Conditional Sampling
You can also create a subsample based on certain conditions. For example, if you want to select rows where a specific column meets a condition.
subsample = df[df['label'] > 1] # Select rows where column_name is greater than value
subsample
Output:
text label
0 I love this product! It's amazing and works pe... 2
1 Fantastic quality! I'm very happy with my purc... 2
2 Good value for the price. Satisfied with the p... 2
3 Exceeded my expectations. Will buy again! 2
4 Highly recommend! Five stars! 2
5 Absolutely wonderful experience. The product i... 2
6 Brilliant product, top quality. 2
7 The best purchase I've made in a while! 2
8 Very pleased with this product, fantastic buy! 2
9 Exceptional quality and value. 2
10 Remarkable product, highly satisfied. 2
11 An exceptional find, totally worth it. 2
12 Really impressive, I couldn't be happier. 2
13 Absolutely in love with this item! 2
14 This product blew me away, amazing. 2
15 Five stars all the way, no complaints. 2
16 A fantastic purchase, exceeded my hopes. 2
17 Such high quality, definitely recommend. 2
18 A brilliant choice, top-notch item. 2
19 A stellar product, worth every penny. 2
20 Excellent performance, absolutely love it. 2
21 I can't say enough good things about this. 2
22 Superb craftsmanship, will buy again. 2
23 Outstanding, exceeded all my expectations. 2
24 Wonderful addition to my collection. 2
[5] Using .iloc for Index-Based Subsampling
You can use .iloc to select rows based on their index positions.
# Select the last 20 rows
subsample = df.iloc[-20:]
subsample
Output:
text label
55 Very poor quality, extremely disappointed. 0
56 Awful product, don't waste your money. 0
57 Absolutely terrible, never buying again. 0
58 Low quality and unreliable. 0
59 The worst product I have ever used. 0
60 Terrible, wouldn’t wish this on anyone. 0
61 Absolute waste of money, so disappointing. 0
62 Such poor quality, do not recommend. 0
63 Horrendous experience, truly awful. 0
64 Awful customer service, never again. 0
65 Broke within days, very poor quality. 0
66 A total failure, regret this purchase. 0
67 Not even close to what I expected. 0
68 Defective and useless, avoid at all costs. 0
69 Worst product ever, extremely dissatisfied. 0
70 A nightmare to use, don't buy it. 0
71 Complete garbage, regret every penny. 0
72 Utterly disappointing, can't believe it. 0
73 A real letdown, avoid this product. 0
74 Fails to meet any standards, terrible. 0
In the provided example above, we have explored ways to create a subsample of a DataFrame using various techniques in Pandas, including random sampling, fractional sampling, and conditional filtering. These techniques are essential for data analysis and preprocessing, allowing researchers and data scientists to work with manageable subsets of larger datasets while maintaining the integrity and representativeness of the data. By employing these methods, one can streamline analysis and enhance performance in various data-driven tasks.