Pandas Sample – pd.DataFrame.sample()

Pandas Sample is used when you need to pull random rows or columns from a DataFrame.

Why would you ever want random rows? Say you’re running a data science model, and you want to test a subset of data. If you’re not using train test split, you can use pd.sample() to pull a small section of rows.

I use Pandas Sample mostly when I want to view a small section of data, but DataFrame.head() shows me data that is too homogeneous. I want some variability!

pd.df.sample(n=number_of_samples, axis=rows_or_columns)

Pseudo Code: With your DataFrame, return random rows or columns.

Pandas Sample

Pandas Sample - Return a random sample of rows (or columns) from your DataFrame to you.

Sample Parameters

Sample has some of my favorite parameters of any Pandas function. Each one is packed with dense functionality.

  • n – The number of samples you want to return. You can optionally specify n or frac (below). ‘n’ must be less than the number of rows you have in your DataFrame.
  • frac – If you did not specify an ‘n’ (above) then you can specify ‘frac’ or fraction. As in, what fraction of your dataset do you want to return to you? Ex: “Return me 10% of my dataframe. Frac=.1”
  • replace (Default: False) – Do you want your rows to be able to be randomly picked twice? By default, if pandas randomly selects a row that has already been picked, then it will not pick it again. However, if replace=True, then pandas will pick a row again.
  • weights (Optional) – Super awesome parameter! By default, pandas will apply the same weights to all of your rows. Meaning, each row has an equal chance of being randomly picked. But what if you wanted some rows to have a higher chance to be picked than others? You can set a weight per row which will cause pandas to more heavily pick some rows than others. Check out the example for details.
  • random_state (Optional) – By default, pandas will pick different random numbers each time you sample. However, what if you wanted to pick the same random numbers each time? By setting random_state to an int, you’ll ensure consistency.
  • axis (Default: 0 or ‘index’) – Did you know you could also select random columns from your DataFrame? If you wanted to, set axis=1 or ‘columns’.

Now the fun part, let’s take a look at a code sample

In [2]:
import pandas as pd

Pandas Sample

Pandas Sample is a great way to pull random (a sample) of rows from your DataFrame. I use this most often when I need to subset my data, but I want to do it randomly.

Examples we'll run through:

  1. Simple sample setting 'n'
  2. Simple sample setting 'frac'
  3. Sample setting 'n' and replace
  4. Sample with weights
  5. Sample random columns

But first, let's start with a couple of lists of restaurants in San Francisco:

In [24]:
df = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30),
                   ('Page', 'bar', 80.34),
                   ('Tompkins', 'bar', 34.2),
                   ('Als Place', 'Restaurant', 56.52),],
           columns=('name', 'type', 'AvgBill')
                 )
df
Out[24]:
nametypeAvgBill
0Foreign CinemaRestaurant289.00
1Liho LihoRestaurant224.00
2500 Clubbar80.50
3The Squarebar25.30
4Pagebar80.34
5Tompkinsbar34.20
6Als PlaceRestaurant56.52

1. Simple sample setting 'n'

Specifying 'n' is specifying the number of random rows you want to return.

Notice how I specify n=2 and I get two random rows back.

In [25]:
df.sample(n=2)
Out[25]:
nametypeAvgBill
3The Squarebar25.3
0Foreign CinemaRestaurant289.0

If I do it again, I get another set of random rows

In [26]:
df.sample(n=2)
Out[26]:
nametypeAvgBill
6Als PlaceRestaurant56.52
2500 Clubbar80.50

2. Simple sample setting 'frac'

Instead of setting 'n' you could specifying 'frac' which tells pandas what franction of your dataframe do you want to randomly return to you?

Here I'm setting frac=.4 or 40%. So since I have 7 rows, 40% is 3 rows (2.8 rounded up).

In [27]:
df.sample(frac=.4)
Out[27]:
nametypeAvgBill
3The Squarebar25.30
5Tompkinsbar34.20
4Pagebar80.34

3. Sample setting 'n' and replace

By default, pandas will only select a random row once. However, if you wanted to be able to select the same row more than once, then you can set replace=True. This will 'replace' your rows back into the DataFrame for sampling again.

With this case, you'll be able to set your n greater than the # of rows you have in your DataFrame.

Notice the same row below is randomly picked twice now.

In [28]:
df.sample(n=5, replace=True)
Out[28]:
nametypeAvgBill
3The Squarebar25.30
2500 Clubbar80.50
6Als PlaceRestaurant56.52
5Tompkinsbar34.20
5Tompkinsbar34.20

4. Sample with weights

By default, pandas give each row an equal chance to be selected. However, what if you wanted to select restaurants more often than bars? You could give restaurants a higher chance (higher weights) to be picked.

First let me add weights to my DataFrame. I want resturants to have 5x chance to be randomly picked than bars. I'll give each restaurant a weights=2 and bars weights=1.

In [40]:
weights = {'Restaurant': 5,
          'bar': 1}
df['weights'] = df['type'].apply(lambda x: weights[x])
df
Out[40]:
nametypeAvgBillweights
0Foreign CinemaRestaurant289.005
1Liho LihoRestaurant224.005
2500 Clubbar80.501
3The Squarebar25.301
4Pagebar80.341
5Tompkinsbar34.201
6Als PlaceRestaurant56.525

Here I'll pull a random sample of 3 rows from my DataFrame and pass my weights column. I set random state to make sure I get the same random numbers each time. Notice how 2 restaurants pop up out of the 3 rows. That is because they had higher weights and therefore a bigger chance to be picked.

In [43]:
df.sample(n=3, weights='weights', random_state=42)
Out[43]:
nametypeAvgBillweights
1Liho LihoRestaurant224.005
6Als PlaceRestaurant56.525
5Tompkinsbar34.201

5. Sample random columns

Say you wanted to randomly select columns instead of rows. Just set axis=1.

In [45]:
df.sample(n=2, axis=1)
Out[45]:
typeAvgBill
0Restaurant289.00
1Restaurant224.00
2bar80.50
3bar25.30
4bar80.34
5bar34.20
6Restaurant56.52

Remember, you'll get random items each time you run your code unless you set a random_state

In [46]:
df.sample(n=2, axis=1)
Out[46]:
AvgBillweights
0289.005
1224.005
280.501
325.301
480.341
534.201
656.525

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation