import pandas as pdPython’s Pandas library can be a bit of an oddity to students. Students coming from a pure Python background have likely not had the opportunity or need to explore its data analysis focused tooling. Physicists are commonly extremely familiar with Numpy arrays, but have not had as many instances where they need the mixtures of text and numeric data types that Pandas dataframe provide. And students coming from a background in R already have R, which provides most of the same general structure and functionality. This guide is an attempt to introduce the basics, so that students can get up and running with the least amount of trouble.
I consider this guide a live document, and thus it is subject to change, expansion, and revision.
1 Installation and Import
If you do not already have Pandas on your system, you’ll need to start there. On most Linux systems, it is probably available through your distribution repositories as python-pandas. On Windows or macOS systems, the easiest route to install is through using Pip, the Python package manager:
pip3 install pandasOnce installed, you’ll need to import the Pandas library into whatever script or notebook you need. The overwhelming convention is to import the library into the pd namespace:
At which point you can continue with the rest of this guide.
2 The Core Data Types
Pandas introduces two fundamental data types which are used extensively: Series and DataFrames.
2.1 Series
A Pandas series represents a single column or array of information. To this end, it is highly similar to a Python list, except that it has a specific data type associate to all values within the series. Additionally, it can have a potential index associated with each row, which may or may not be a simple integer.
Creating a series is as simple as passing in a list or array:
S = pd.Series(['Apple', 'Banana', 'Cherry', 'Date'])
S0 Apple
1 Banana
2 Cherry
3 Date
dtype: object
In the output you can see both the data values listed in the right column, one entry per row. To the left is the default index used if none is provided, which is comprised of standard Python integer values. On the bottom you can see the associated data type, which is listed as object for any non-numeric type value.
You can optionally provide a name and/or an index array when creating the series. This will associate a name with the entire column or overwrite the standard index values:
S2 = pd.Series(
['Aardvark', 'Bobcat', 'Cougar', 'Damsel Fly'],
name='Animals',
index=['A', 'B', 'C', 'D'],
)
S2A Aardvark
B Bobcat
C Cougar
D Damsel Fly
Name: Animals, dtype: object
2.2 DataFrames
Whereas a series is a single dimensional array of information, a dataframe is a two dimensional grid of information. Within this grid, each column has the same number of rows (or, conversely, each row has the same number of columns). Every column has a name and data type associated with it; every row has an index associated with it. In my experience, the easiest way to create a dataframe from scratch is to use a Python dictionary, where the keys of the dictionary will become the column headings and the values should be arrays of the same length.
df = pd.DataFrame({
'name': ['Billy', 'Henry', 'Jill', 'Beth'],
'age': [23, 25, 21, 19],
'enrolled': [True, False, False, True]
})
df| name | age | enrolled | |
|---|---|---|---|
| 0 | Billy | 23 | True |
| 1 | Henry | 25 | False |
| 2 | Jill | 21 | False |
| 3 | Beth | 19 | True |
Note that in this instance the default index is used since none was provided.
You can also create a dataframe from a 2D array, where each interior list is considered a row:
df2 = pd.DataFrame([
[1, 3, 5, 7],
['A', 'B', 'C', 'D'],
[2.3, 9.1, 7.5, 5.7]
],
columns = ['C1', 'C2', 'C3', 'C4']
)
df2| C1 | C2 | C3 | C4 | |
|---|---|---|---|---|
| 0 | 1 | 3 | 5 | 7 |
| 1 | A | B | C | D |
| 2 | 2.3 | 9.1 | 7.5 | 5.7 |
In this case, if the additional columns option is not provided, the default column names will be numbers, counting up from 0.
3 Making Selections
One of the fundamental operations when working with compound data structures is being able to select desired material. Pandas gives several different ways of achieving this.
3.1 Basic Indexing
When working with an series object, you can select an index using standard square brackets. Doing so will return the value of the object at that index:
S[1]'Banana'
The index value here is the same as the listed index, so if it has been changed, the value in the square brackets must change as well:
S2['A']'Aardvark'
When using square brackets to index from a dataframe, the value within the square brackets represents a column name. Note that this is different from classically indexing a two-dimensional array!
df['name']0 Billy
1 Henry
2 Jill
3 Beth
Name: name, dtype: object
Note that when you select a column out of a dataframe, you get back a series!
If you are selecting just a single element from a series or dataframe, and if the corresponding index or column name is a string, you can also use dot-notation of access the value.
S2.A'Aardvark'
or
df.name0 Billy
1 Henry
2 Jill
3 Beth
Name: name, dtype: object
If you want to select multiple values from a series or dataframe, they need to be passed as a list:
S2[['A', 'B']]A Aardvark
B Bobcat
Name: Animals, dtype: object
when doing so, the end result will be the same data type as the original. So selecting multiple values from a series gives another series.
Selecting multiple columns from a dataframe works the same way:
df[['name', 'enrolled']]| name | enrolled | |
|---|---|---|
| 0 | Billy | True |
| 1 | Henry | False |
| 2 | Jill | False |
| 3 | Beth | True |
3.2 Location Indexing
The above works well for series, but it generally only gets us columns of a data frame. Sometimes you want to select both a subset of columns and a subset of rows. In these cases, the loc attribute will allow you to specify both what rows you want and what columns you want. Note that you still use square brackets after df.loc, which can take some getting used to. Therefore, if you wanted to select out the first three names from the df dataframe, you could do so as:
df.loc[0:2, 'name']0 Billy
1 Henry
2 Jill
Name: name, dtype: object
Note that if the output is one-dimensional, then a series object is returned. If multiple rows and columns are selected, then a dataframe object will be returned.
When providing slicing syntax to the rows or columns of .loc, it is important to realize that here the endpoints are included! The above grabs the rows with index values of 0, 1 and 2. This is different from how Python’s normal range function works, and indeed different from how .iloc works, as we’ll see shortly.
It is important to realize that, with loc, if your row indexes have been set to something different, you specify those indexes directly. So, for example, if you had the dataframe:
df3 = pd.DataFrame({'name': ['Jill', 'Ben', 'Jane', 'Beth'],
'age': [21, 19, 20, 20]},
index=['001', '002', '003', '004']
)
df3| name | age | |
|---|---|---|
| 001 | Jill | 21 |
| 002 | Ben | 19 |
| 003 | Jane | 20 |
| 004 | Beth | 20 |
Then you’d need to select out the rows using that specific index:
df3.loc['002':'004', :]| name | age | |
|---|---|---|
| 002 | Ben | 19 |
| 003 | Jane | 20 |
| 004 | Beth | 20 |
Here you can see as well that using a plain : for one of the dimension slices will do the normal Python action of taking everything from the start to the end, or all of the entries. This can be used to choose either all columns or all rows, as desired.
loc
Using loc can be particularly important if you are trying to assign values to a cell or collection of cells in a dataframe, since doing something like this:
# DON'T DO THIS!
df['age'][3] = 20technically first selects out a series object, which is a copy of the original data. And thus setting the 3 index of that series will not change the underlying dataframe, as was desired. Using
# This is the way
df.loc[3, 'age'] = 20however will directly alter the original dataframe.
3.3 Index Location Indexing
Sometimes it is useful to ignore all the column or row labels, and just work with numeric indexes. To talk about the 3rd column or 9th row, for instance. In this cases, the attribute .iloc will do what you want. That is to say, if you wanted to select the first two rows and first two columns of a dataframe:
df.iloc[:2, :2]| name | age | |
|---|---|---|
| 0 | Billy | 23 |
| 1 | Henry | 25 |
delivers exactly what you want, without needing to worry about column names or row indexes (which might not be numeric!). Observe that providing ranges to iloc works exactly the way it does as with Python’s range function, where you specify the start and limit, not the start and stop points. If you want to pick and choose different rows or columns that are not in some pattern, you can provide a list of index values:
df.iloc[::2, [0, 2]]| name | enrolled | |
|---|---|---|
| 0 | Billy | True |
| 2 | Jill | False |
…to be continued…