A Crash-Course in Pandas

Author

Jed Rembold

Published

February 3, 2023

Python’s Pandas library can be a bit of an oddity to students. Students coming from a pure Python background have likely not had the opportunity or need to explore its data analysis focused tooling. Physicists are commonly extremely familiar with Numpy arrays, but have not had as many instances where they need the mixtures of text and numeric data types that Pandas dataframe provide. And students coming from a background in R already have R, which provides most of the same general structure and functionality. This guide is an attempt to introduce the basics, so that students can get up and running with the least amount of trouble.

I consider this guide a live document, and thus it is subject to change, expansion, and revision.

1 Installation and Import

If you do not already have Pandas on your system, you’ll need to start there. On most Linux systems, it is probably available through your distribution repositories as python-pandas. On Windows or macOS systems, the easiest route to install is through using Pip, the Python package manager:

pip3 install pandas

Once installed, you’ll need to import the Pandas library into whatever script or notebook you need. The overwhelming convention is to import the library into the pd namespace:

import pandas as pd

At which point you can continue with the rest of this guide.

2 The Core Data Types

Pandas introduces two fundamental data types which are used extensively: Series and DataFrames.

2.1 Series

A Pandas series represents a single column or array of information. To this end, it is highly similar to a Python list, except that it has a specific data type associate to all values within the series. Additionally, it can have a potential index associated with each row, which may or may not be a simple integer.

Creating a series is as simple as passing in a list or array:

S = pd.Series(['Apple', 'Banana', 'Cherry', 'Date'])
S
0     Apple
1    Banana
2    Cherry
3      Date
dtype: object

In the output you can see both the data values listed in the right column, one entry per row. To the left is the default index used if none is provided, which is comprised of standard Python integer values. On the bottom you can see the associated data type, which is listed as object for any non-numeric type value.

You can optionally provide a name and/or an index array when creating the series. This will associate a name with the entire column or overwrite the standard index values:

S2 = pd.Series(
        ['Aardvark', 'Bobcat', 'Cougar', 'Damsel Fly'],
        name='Animals',
        index=['A', 'B', 'C', 'D'],
        )
S2
A      Aardvark
B        Bobcat
C        Cougar
D    Damsel Fly
Name: Animals, dtype: object

2.2 DataFrames

Whereas a series is a single dimensional array of information, a dataframe is a two dimensional grid of information. Within this grid, each column has the same number of rows (or, conversely, each row has the same number of columns). Every column has a name and data type associated with it; every row has an index associated with it. In my experience, the easiest way to create a dataframe from scratch is to use a Python dictionary, where the keys of the dictionary will become the column headings and the values should be arrays of the same length.

df = pd.DataFrame({
                    'name': ['Billy', 'Henry', 'Jill', 'Beth'],
                    'age': [23, 25, 21, 19],
                    'enrolled': [True, False, False, True]
  })
df
name age enrolled
0 Billy 23 True
1 Henry 25 False
2 Jill 21 False
3 Beth 19 True

Note that in this instance the default index is used since none was provided.

You can also create a dataframe from a 2D array, where each interior list is considered a row:

df2 = pd.DataFrame([
                    [1, 3, 5, 7],
                    ['A', 'B', 'C', 'D'],
                    [2.3, 9.1, 7.5, 5.7]
                   ],
                   columns = ['C1', 'C2', 'C3', 'C4']

  )
df2
C1 C2 C3 C4
0 1 3 5 7
1 A B C D
2 2.3 9.1 7.5 5.7

In this case, if the additional columns option is not provided, the default column names will be numbers, counting up from 0.

3 Making Selections

One of the fundamental operations when working with compound data structures is being able to select desired material. Pandas gives several different ways of achieving this.

3.1 Basic Indexing

When working with an series object, you can select an index using standard square brackets. Doing so will return the value of the object at that index:

S[1]
'Banana'

The index value here is the same as the listed index, so if it has been changed, the value in the square brackets must change as well:

S2['A']
'Aardvark'

When using square brackets to index from a dataframe, the value within the square brackets represents a column name. Note that this is different from classically indexing a two-dimensional array!

df['name']
0    Billy
1    Henry
2     Jill
3     Beth
Name: name, dtype: object

Note that when you select a column out of a dataframe, you get back a series!

Using dot notation

If you are selecting just a single element from a series or dataframe, and if the corresponding index or column name is a string, you can also use dot-notation of access the value.

S2.A
'Aardvark'

or

df.name
0    Billy
1    Henry
2     Jill
3     Beth
Name: name, dtype: object

If you want to select multiple values from a series or dataframe, they need to be passed as a list:

S2[['A', 'B']]
A    Aardvark
B      Bobcat
Name: Animals, dtype: object

when doing so, the end result will be the same data type as the original. So selecting multiple values from a series gives another series.

Selecting multiple columns from a dataframe works the same way:

df[['name', 'enrolled']]
name enrolled
0 Billy True
1 Henry False
2 Jill False
3 Beth True

3.2 Location Indexing

The above works well for series, but it generally only gets us columns of a data frame. Sometimes you want to select both a subset of columns and a subset of rows. In these cases, the loc attribute will allow you to specify both what rows you want and what columns you want. Note that you still use square brackets after df.loc, which can take some getting used to. Therefore, if you wanted to select out the first three names from the df dataframe, you could do so as:

df.loc[0:2, 'name']
0    Billy
1    Henry
2     Jill
Name: name, dtype: object

Note that if the output is one-dimensional, then a series object is returned. If multiple rows and columns are selected, then a dataframe object will be returned.

Warning

When providing slicing syntax to the rows or columns of .loc, it is important to realize that here the endpoints are included! The above grabs the rows with index values of 0, 1 and 2. This is different from how Python’s normal range function works, and indeed different from how .iloc works, as we’ll see shortly.

It is important to realize that, with loc, if your row indexes have been set to something different, you specify those indexes directly. So, for example, if you had the dataframe:

df3 = pd.DataFrame({'name': ['Jill', 'Ben', 'Jane', 'Beth'],
                    'age': [21, 19, 20, 20]},
                    index=['001', '002', '003', '004']
                    )
df3
name age
001 Jill 21
002 Ben 19
003 Jane 20
004 Beth 20

Then you’d need to select out the rows using that specific index:

df3.loc['002':'004', :]
name age
002 Ben 19
003 Jane 20
004 Beth 20

Here you can see as well that using a plain : for one of the dimension slices will do the normal Python action of taking everything from the start to the end, or all of the entries. This can be used to choose either all columns or all rows, as desired.

Setting values using loc

Using loc can be particularly important if you are trying to assign values to a cell or collection of cells in a dataframe, since doing something like this:

# DON'T DO THIS!
df['age'][3] = 20

technically first selects out a series object, which is a copy of the original data. And thus setting the 3 index of that series will not change the underlying dataframe, as was desired. Using

# This is the way
df.loc[3, 'age'] = 20

however will directly alter the original dataframe.

3.3 Index Location Indexing

Sometimes it is useful to ignore all the column or row labels, and just work with numeric indexes. To talk about the 3rd column or 9th row, for instance. In this cases, the attribute .iloc will do what you want. That is to say, if you wanted to select the first two rows and first two columns of a dataframe:

df.iloc[:2, :2]
name age
0 Billy 23
1 Henry 25

delivers exactly what you want, without needing to worry about column names or row indexes (which might not be numeric!). Observe that providing ranges to iloc works exactly the way it does as with Python’s range function, where you specify the start and limit, not the start and stop points. If you want to pick and choose different rows or columns that are not in some pattern, you can provide a list of index values:

df.iloc[::2, [0, 2]]
name enrolled
0 Billy True
2 Jill False

…to be continued…