Python: Creating a Subsample of a Dataset

Creating a subsample of a dataset in Python can be easily achieved using the Pandas library. This post discusses a few common methods to create a subsample.

The following is an example dataset from the previous post (Improving Text Classification — a step by step approach):

import pandas as pd

data_text = [
    # Positive reviews
    "I love this product! It's amazing and works perfectly.",
    "Fantastic quality! I'm very happy with my purchase.",
    "Good value for the price. Satisfied with the performance.",
    "Exceeded my expectations. Will buy again!",
    "Highly recommend! Five stars!",
    "Absolutely wonderful experience. The product is superb.",
    "Brilliant product, top quality.",
    "The best purchase I've made in a while!",
    "Very pleased with this product, fantastic buy!",
    "Exceptional quality and value.",
    "Remarkable product, highly satisfied.",
    "An exceptional find, totally worth it.",
    "Really impressive, I couldn't be happier.",
    "Absolutely in love with this item!",
    "This product blew me away, amazing.",
    "Five stars all the way, no complaints.",
    "A fantastic purchase, exceeded my hopes.",
    "Such high quality, definitely recommend.",
    "A brilliant choice, top-notch item.",
    "A stellar product, worth every penny.",
    "Excellent performance, absolutely love it.",
    "I can't say enough good things about this.",
    "Superb craftsmanship, will buy again.",
    "Outstanding, exceeded all my expectations.",
    "Wonderful addition to my collection.",

    # Neutral reviews
    "It's okay, but I've seen better.",
    "Mediocre at best. Not impressed.",
    "Average product, nothing special.",
    "It's fine, does the job.",
    "Not good, not bad, just average.",
    "Serviceable product. Not much to say.",
    "It's alright, wouldn't write home about it.",
    "Meets expectations, no more, no less.",
    "A standard product, gets the job done.",
    "Neither here nor there, just okay.",
    "Decent item, nothing to rave about.",
    "It's passable, meets basic needs.",
    "A typical product, nothing extraordinary.",
    "Pretty average, does what it should.",
    "Nothing special, just another item.",
    "Barely meets the mark, but acceptable.",
    "It’s adequate, but far from amazing.",
    "Just a regular product, does the job.",
    "A run-of-the-mill product, works fine.",
    "The definition of average, nothing more.",
    "Meh, it's just okay.",
    "Expected more, but it's not bad.",
    "It's alright for its price, I guess.",
    "Could be better, but not the worst.",
    "It's decent, no major complaints.",

    # Negative reviews
    "This is the worst thing I've ever bought. Totally disappointing.",
    "Not worth the money. I regret buying it.",
    "Terrible customer service. The product arrived broken.",
    "Completely dissatisfied with the quality.",
    "Horrible experience. Would not recommend.",
    "Very poor quality, extremely disappointed.",
    "Awful product, don't waste your money.",
    "Absolutely terrible, never buying again.",
    "Low quality and unreliable.",
    "The worst product I have ever used.",
    "Terrible, wouldn’t wish this on anyone.",
    "Absolute waste of money, so disappointing.",
    "Such poor quality, do not recommend.",
    "Horrendous experience, truly awful.",
    "Awful customer service, never again.",
    "Broke within days, very poor quality.",
    "A total failure, regret this purchase.",
    "Not even close to what I expected.",
    "Defective and useless, avoid at all costs.",
    "Worst product ever, extremely dissatisfied.",
    "A nightmare to use, don't buy it.",
    "Complete garbage, regret every penny.",
    "Utterly disappointing, can't believe it.",
    "A real letdown, avoid this product.",
    "Fails to meet any standards, terrible."
]

# Corresponding labels
data_label = [
    2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,  # Positive
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # Neutral
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0   # Negative
]

# Create DataFrame
df = pd.DataFrame({
    'text': data_text,
    'label': data_label
})

df

Output:

text label
0 I love this product! It's amazing and works pe... 2
1 Fantastic quality! I'm very happy with my purc... 2
2 Good value for the price. Satisfied with the p... 2
3 Exceeded my expectations. Will buy again! 2
4 Highly recommend! Five stars! 2
... ... ...
70 A nightmare to use, don't buy it. 0
71 Complete garbage, regret every penny. 0
72 Utterly disappointing, can't believe it. 0
73 A real letdown, avoid this product. 0
74 Fails to meet any standards, terrible. 0

[1] Random Sampling

You can use the sample() method to randomly select a specified number of rows from your DataFrame.

import pandas as pd

# Assuming df is your DataFrame
subsample = df.sample(n=10, random_state=42)  # Select 10 random rows

subsample

Output:

 text label
4 Highly recommend! Five stars! 2
63 Horrendous experience, truly awful. 0
10 Remarkable product, highly satisfied. 2
0 I love this product! It's amazing and works pe... 2
35 Decent item, nothing to rave about. 1
61 Absolute waste of money, so disappointing. 0
28 It's fine, does the job. 1
12 Really impressive, I couldn't be happier. 2
69 Worst product ever, extremely dissatisfied. 0
64 Awful customer service, never again. 0

[2] Fractional Sampling

If you want to select a fraction of your DataFrame, you can specify the frac parameter.

subsample = df.sample(frac=0.1, random_state=42)  # Select 10% of the rows

subsample

Output:

 text label
4 Highly recommend! Five stars! 2
63 Horrendous experience, truly awful. 0
10 Remarkable product, highly satisfied. 2
0 I love this product! It's amazing and works pe... 2
35 Decent item, nothing to rave about. 1
61 Absolute waste of money, so disappointing. 0
28 It's fine, does the job. 1
12 Really impressive, I couldn't be happier. 2

[3] Stratified Sampling

If you need to ensure that the subsample maintains the same proportions of a categorical variable, you can use the train_test_split function from Scikit-learn.

from sklearn.model_selection import train_test_split

# Assuming 'label' is your categorical column
train, subsample = train_test_split(df, test_size=0.1, stratify=df['label'], random_state=42)

subsample

Output:

 text label
64 Awful customer service, never again. 0
41 It’s adequate, but far from amazing. 1
36 It's passable, meets basic needs. 1
60 Terrible, wouldn’t wish this on anyone. 0
34 Neither here nor there, just okay. 1
19 A stellar product, worth every penny. 2
20 Excellent performance, absolutely love it. 2
1 Fantastic quality! I'm very happy with my purc... 2

[4] Conditional Sampling

You can also create a subsample based on certain conditions. For example, if you want to select rows where a specific column meets a condition.

subsample = df[df['label'] > 1]  # Select rows where column_name is greater than value

subsample

Output:

 text label
0 I love this product! It's amazing and works pe... 2
1 Fantastic quality! I'm very happy with my purc... 2
2 Good value for the price. Satisfied with the p... 2
3 Exceeded my expectations. Will buy again! 2
4 Highly recommend! Five stars! 2
5 Absolutely wonderful experience. The product i... 2
6 Brilliant product, top quality. 2
7 The best purchase I've made in a while! 2
8 Very pleased with this product, fantastic buy! 2
9 Exceptional quality and value. 2
10 Remarkable product, highly satisfied. 2
11 An exceptional find, totally worth it. 2
12 Really impressive, I couldn't be happier. 2
13 Absolutely in love with this item! 2
14 This product blew me away, amazing. 2
15 Five stars all the way, no complaints. 2
16 A fantastic purchase, exceeded my hopes. 2
17 Such high quality, definitely recommend. 2
18 A brilliant choice, top-notch item. 2
19 A stellar product, worth every penny. 2
20 Excellent performance, absolutely love it. 2
21 I can't say enough good things about this. 2
22 Superb craftsmanship, will buy again. 2
23 Outstanding, exceeded all my expectations. 2
24 Wonderful addition to my collection. 2

[5] Using `.iloc` for Index-Based Subsampling

You can use .iloc to select rows based on their index positions.

# Select the last 20 rows
subsample = df.iloc[-20:]

subsample

Output:

 text label
55 Very poor quality, extremely disappointed. 0
56 Awful product, don't waste your money. 0
57 Absolutely terrible, never buying again. 0
58 Low quality and unreliable. 0
59 The worst product I have ever used. 0
60 Terrible, wouldn’t wish this on anyone. 0
61 Absolute waste of money, so disappointing. 0
62 Such poor quality, do not recommend. 0
63 Horrendous experience, truly awful. 0
64 Awful customer service, never again. 0
65 Broke within days, very poor quality. 0
66 A total failure, regret this purchase. 0
67 Not even close to what I expected. 0
68 Defective and useless, avoid at all costs. 0
69 Worst product ever, extremely dissatisfied. 0
70 A nightmare to use, don't buy it. 0
71 Complete garbage, regret every penny. 0
72 Utterly disappointing, can't believe it. 0
73 A real letdown, avoid this product. 0
74 Fails to meet any standards, terrible. 0

In the provided example above, we have explored ways to create a subsample of a DataFrame using various techniques in Pandas, including random sampling, fractional sampling, and conditional filtering. These techniques are essential for data analysis and preprocessing, allowing researchers and data scientists to work with manageable subsets of larger datasets while maintaining the integrity and representativeness of the data. By employing these methods, one can streamline analysis and enhance performance in various data-driven tasks.

Python: Creating a Subsample of a Dataset

[1] Random Sampling

[2] Fractional Sampling

[3] Stratified Sampling

[4] Conditional Sampling

[5] Using `.iloc` for Index-Based Subsampling

Comments

More from this blog

Getting Started with CI/CD Using GitHub Actions and Docker

Exploring Cloud Automation and DevOps with VirtualBox and Docker

Exploring Docker Networking and Understanding Cloud Networking

Exploring VirtualBox Networking and Understanding Cloud Networking

Docker Fundamentals: Imperative and Declarative Deployment

Command Palette

[1] Random Sampling

[2] Fractional Sampling

[3] Stratified Sampling

[4] Conditional Sampling

[5] Using .iloc for Index-Based Subsampling

Comments

More from this blog

[5] Using `.iloc` for Index-Based Subsampling