If you’re looking for information on how to find data or cell within a Pandas DataFrame or Series, check out a future post – Locating Data Within A DataFrame. This post will be around finding substrings within a series of strings.
Often times you may want to know where a substring exists in a bigger string. You could be trying to extract an address, remove a piece of text, or simply wanting to find the first instance of a substring.
Pandas.Series.Str.Find() helps you locate substrings within larger strings. This has the identical functionality as =find() in Excel or Google Sheets.
Example: “day” is a substring within “Monday.” However, “day” is not a substring of “November,” since “day” does not appear in “November”
Pseudo code: “Monday”.find(“day”) returns 4. “day” starts at the 4th character in “Monday”
But first, what is a string and substring?
- String = Data type within python that represents text
- Substring = A piece of text within a larger piece of text
To find where a substring exists (if it does at all) within a larger series of strings you need to call pd.Series.str.find()
Pandas find returns an integer of the location (number of characters from the left) of a substring. It will return -1 if it does not exist
Find has two important arguments that go along with the function. Start & End
- Start (default = 0): Where you want .find() to start looking for your substring. By default you’ll start at the beginning of the string (location 0).
- End: Where you want .find() to finish looking for your substring.
Note: You would only use start & end if you didn’t want to search the entire string.
import pandas as pd
Pandas Find | pd.Series.str.find()¶
Say you have a series of strings and you want to find the position of a substring.
Pandas .find() will return the location (number of characters from the left) of a certain substring. Let's look at an example.
First, create a series of strings. Note: You can also do this with a column in a pandas DataFrame
my_string_series = pd.Series(['San Francisco', 'Chicago', 'Traveling', 'Pandas', 'Remote Worker'], name="string_series")
Now say we want to find if and where the substring "cago" sit within each string in our series. In order to do this, we call .find("cago")
find_result = my_string_series.str.find("cago") # Calling .find and passing "cago" find_result.name = "find_result" # naming the series so we can call it later find_result # displaying the results
0 -1 1 3 2 -1 3 -1 4 -1 Name: find_result, dtype: int64
In order to view the output easily, let's concat our original series with the result
pd.concat([my_string_series, find_result], axis=1)
Notice how San Francisco, Traveling, Pandas, and Remote Worker all return -1 for .find(). This is because the substring "cago" does not exist within those strings.
However, "Chicago" returns 3. This is because "cago" starts at position 3 within Chicago!
Let's try some more. This time, I want to find the first instance of the letter "o" within our series of strings
find_result = my_string_series.str.find("o") pd.concat([my_string_series, find_result], axis=1)
Now it looks like Traveling & Pandas do not contain "o" (good thing, because they don't contain it) while SF, Chiago, and Remote Worker do.
What if I only wanted to search a series of strings between the 3rd and 8th character? Then we would pass a start= and end=
find_result = my_string_series.str.find("o", start=3, end=8) pd.concat([my_string_series, find_result], axis=1)
In this case, San Francisco does contain the letter "o", but not between characters 3 through 8, so .find() returns -1 for San Francisco. Chicago and Remote Worker both return results.
Check out more Pandas functions on our Pandas Page