Pandas

Pandas is a data analysis library; it is based on NumPy

import pandas as pd

Series and DataFrame

Series: Collection of values for some keys (table column)
DataFrame: Collection of associated series (table)

Example:

	Area	Population	Capital
CN	9.6	1386	Beijing
RU	17	144	Moscow
US	9.8	327	Washington, D.C.

Creating a Series

area = pd.Series({'CN': 9.6, 'RU': 17, 'US': 9.8})
population = pd.Series({'CN': 1386, 'RU': 144, 'US': 327})

area = pd.Series([9.6, 17, 9.8], ["CN", "RU", "US"])
population = pd.Series([1386, 144, 327], ["CN", "RU", "US"])

Reading values

Reading a value via the index value:

area['CN'] # 9.6

Series

Each series has a specific data type

area.dtype # float64

Manually setting the data type:

area = pd.Series(
    {"CN": 9.6, "RU": 17, "US": 9.8}, dtype="float32"
)

DataFrame

countries = pd.DataFrame(
    {"area": area, "population": population}
)

Importing and exporting data

data formats:

CSV
Excel
JSON
HDF5 (efficient binary format)
Parquet (from Apache Hadoop)
SQL tables (via SQLAlchemy)

Importing and exporting data

The following functions can import / export data from files. Imports may read files from online sources.

import: pd.read_csv, pd.read_excel, ...

export: df.to_csv, df.to_excel, ...

Importing CSV

Example: Euribor (interest rates of European bonds)

euribor = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv"
)

Importing CSV

Possible keyword arguments for read_csv:

index_col: identifies a column to be used as an index instead of a numerical index
header: passing a value of None signifies that there is no header row
names: column names to be used in the resulting DataFrame
sep: for specifying other separators than a comma
usecols: for importing only specific columns
parse_dates: expects a list of column names
dtype: may be a dictionary that specifies data types for certain columns
...

Importing CSV

Advanced Euribor example:

parse the date
use the date as the index
only import the columns date and rate

euribor = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv",
    parse_dates=["date"],
    index_col="date",
    usecols=["date", "rate"]
)

Importing CSV

Task: Import the following data sources, ensuring the data is formatted nicely:

S&P 500 monthly prices (US stock index): https://raw.githubusercontent.com/datasets/s-and-p-500/main/data/data.csv
Iris dataset (statistics of leaf sizes for iris flowers): http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Titanic passenger data: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

Importing CSV

possible solutions:

sp500 = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/s-and-p-500/main/data/data.csv",
    index_col="Date",
    parse_dates=["Date"],
)
iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    header=None,
    names=["sepal_length", "sepal_width", "petal_length",
           "petal_width", "species"],
)
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    index_col="PassengerId",
)

Importing and exporting Excel data

requires package openpyxl

reading / writing a single Excel sheet:

pd.read_excel
df.to_excel

reading / writing a complete documents (including formatting):

pd.ExcelFile
pd.ExcelWriter

see: Dataquest: Tutorial Using Excel with Python and Pandas

Importing and exporting HDF5 data

requires PyTables (package name: tables)

euribor.to_hdf("data.hdf5", "euribor")
sp500.to_hdf("data.hdf5", "sp500")

euribor = pd.read_hdf("data.hdf5", "euribor")

Example data sets

Sources for example data sets

https://datahub.io
seaborn data sets (click on a file and then on the raw button)
pandas-datareader
http://lib.stat.cmu.edu/datasets/
Our world in data: general datasets, COVID-19 datasets

Example data sets

iris flower data set: download
Titanic passenger data: download
MNIST digits data set: download binary files
California housing data
MovieLens data set

Pandas and NumPy

for many array operations, there is equivalent functionality on Pandas DataFrames / Series:

df.abs()
df1 + df2
df.shape
...

Pandas and NumPy

conversion:

df.to_numpy()
series.to_numpy()

Data statistics

Statistics on series

titanic["Age"].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

(see next slide for explanations)

Statistics on series

titanic["Age"].describe()

The above computes the following data:

.count()
.mean()
.std()
.quantile(0) or .min()
.quantile(0.25)
.quantile(0.5) or .median()
.quantile(0.75)
.quantile(1) or .max()

Statistical quantities

std: Standard deviation
median: half of the values are less and half are greater than the median
min: all values are greater than the minimum
25%-quantile: 25% of all values are less

Statistics on series

get a list of unique values:

titanic["Pclass"].unique()

[2, 3, 1]

count occurences:

titanic["Pclass"].value_counts()

3    491
1    216
2    184

Querying data: basics

Column names and index values

df.index: row index values
df.columns: column names

Selecting columns

accessing a column (as a Series):

titanic["Age"]

accessing multiple columns (as a DataFrame):

titanic[["Name", "Age"]]

Selecting columns

short notation (does not work for all column names):

titanic.Age

cases where the short notation cannot be used:

df["first name"]  # space in name
df["class"]       # reserved word
df["mean"]        # existing method

Selecting rows by index value

single row as a Series:

sp500.loc["1872-01-01"]

multiple rows as a DataFrame (both limits inclusive):

sp500.loc["1872-01-01" : "1872-12-31"]

Selecting rows by index value

if the index column was parsed as dates:

sp500.loc["2009"]

Selecting rows by row number

single row as a Series:

sp500.iloc[0]

multiple rows as a DataFrame:

sp500.iloc[0 : 10]

Selecting rows

titanic[titanic["Pclass"] == 1]

titanic[titanic["Age"] >= 70]

Sampling data

df.sample() - random entry
df.sample(5) - five entries
df.sample(frac=0.1) - 10% of all entries

Exercises: S&P 500

first entry
last entry
last 10 entries
10 random entries
entry from 2009-01-01
entries from the year 2009
entries where the long interest rate was greater than 14
...

Solutions: S&P 500

first entry: sp500.iloc[0]
last entry: sp500.iloc[-1]
last 10 entries: sp500.iloc[-10:]
10 random entries: sp500.sample(10)
entry from 2009-01-01: sp500.loc["2009-01-01"]
entries from the year 2009: sp500.loc["2009-01-01": "2009-12-31"]
entries where the long interest rate was greater than 14: sp500[sp500["Long Interest Rate"] > 14]

Exercise: S&P 500

When was the S&P 500 at its highest value? (compute the maximum, then get the corresponding row in the DataFrame)

Solution: S&P 500

sp500_max = sp500["SP500"].max()
# returns a DataFrame
sp500_max_row = sp500[sp500["SP500"] == sp500_max]

shorter alternative (using .idxmax()):

# returns a Series
sp500_max_row = sp500.loc[sp500["SP500"].idxmax()]

Exercises: Titanic

all survivors
all 60-year-olds
number of survivors and non-survivors
number of survivors and non-survivors in first class

Solutions: Titanic

all survivors: titanic[titanic["Survived"] == 0]
all 60-year-olds: titanic[titanic["Age"] == 60]
number of survivors and non-survivors: titanic["Survived"].value_counts()
number of survivors and non-survivors in first class: titanic[titanic["Pclass"] == 1]["Survived"].value_counts()

Querying data: advanced

Querying data

querying by row and column:

df.loc["2009-01-02", "rate"]
df.iloc[5, 1]

Sorting rows

series.sort_values()
df.sort_values("column_name")
df.sort_values(["col1", "col2"])
df.sort_index(ascending=False)

Selecting rows

basic method:

titanic[titanic["Pclass"] == 1]

titanic[titanic["Age"] >= 70]

Selecting rows: advanced

male passengers in second or third class:

titanic[(titanic["Pclass"] >= 2) & (titanic["Sex"] == "male")]

passengers who embarked in Southampton or Queenstown:

titanic[titanic["Embarked"].isin(["S", "Q"])]

Selecting rows: advanced

titanic.query("Pclass >= 2 and Sex == 'male'")

titanic.query("Embarked in ['S', 'Q']")

Exercises: Titanic

adult (>= 18) males, sorted by age
adult survivors

Solutions: Titanic

adult (>= 18) males, sorted by age:

titanic[
    (titanic["Age"] >= 18) & (titanic["Sex"] = "male")
].sort_values("Age")

adult survivors

titanic[(titanic["Age"] >= 18) & (titanic["Survived"] == 1)]

Exercises: Iris flowers

iris flowers, sorted by petal length
iris setosa, sorted by petal length

Solutions: Iris flowers

iris.sort_values(by="petal_length")

iris[iris["species"] == "Iris-setosa"].sort_values(
    by="petal_length"
)

Plotting basics

plotting functionality of pandas is based on pyplot (similar commands)

Simple plot with pyplot

basic functionality of pyplot:

import numpy as np
import matplotlib.pyplot as plt

x = np.array([0, 1, 2, 3])

y1 = x*2
y2 = x**2

plt.plot(x, y1)
plt.plot(x, y2)

Simple plot with pyplot

result:

Plot types

Basic plot types:

line plot / graph
bar chart
histogram
box plot
scatter plot
pie chart

Plotting options

example pyplot options:

plt.plot(x, y1, color="C3", marker="X", linestyle="dashed")

Plotting examples in pandas

Plots from Pandas

wrappers for pyplot functions that exist on Series and DataFrame objects:

data.plot() or data.plot.line()
data.plot.bar()
data.plot.scatter()
data.plot.hist()
data.plot.box()
data.plot.pie()

Graph / line plot

examples:

euribor.plot.line()

sp500["SP500"].plot.line()

Bar chart

example:

euribor.iloc[-12:].plot.bar()

exercise: Visualize the median of sepal-width and sepal-length for all three types of flowers

Histogram

Exercise: histogram of the sepal length

Box plot

Exercise: box plots of all iris measurements

Scatter plot

iris.plot.scatter(
    x="sepal_length",
    y="sepal_width",
)

with color and size:

iris.plot.scatter(
    x="sepal_length",
    y="sepal_width",
    s="petal_length",
    c="petal_width"
)

Exercise: scatter plot for iris setosa

Pie chart

data = pd.Series({"a": 50, "b": 30, "c": 20})
data.plot.pie(ylabel="Data")

distribution of Titanic passengers into passenger classes (1 - 3)

titanic["Pclass"].value_counts().plot.pie()

Changing data

Renaming columns

df.columns = ["name1", "name2"]

Dropping data

dropping rows:

sp500_new = sp500.drop(["1871-01-01", "1871-02-01"])

dropping columns:

sp500_new = sp500.drop(columns=["Real Earnings", "PE10"])

Converting data

converting types:

titanic["Survived"] = titanic["Survived"].astype("bool")

replacing values:

titanic["Female"] = titanic["Sex"].replace(
    {"female": True, "male": False}
)

Computing derived data

Adding a new column:

iris["sepal_ratio"] = iris["sepal_length"] / iris["sepal_width"]

iris["sepal_ratio"].mean()
iris["sepal_ratio"].std()

iris_setosa = iris.loc[
    iris["name"] == "Iris-setosa"
]

iris_setosa["sepal_ratio"].mean()
iris_setosa["sepal_ratio"].std()

Computing derived data

Task:

Analyze the S&P 500 monthly data and determine the monthly gain / loss in percent for each month

hint: use the .diff() method to get the difference between the previous row and the current one

Computing derived data

sp500["Diff"] = sp500["SP500"].diff()
sp500["Gain"] = sp500["Diff"] / sp500["SP500"]

Computing derived data via custom functions

def classifier(age):
    if age < 18:
        return "youth"
    elif age < 60:
        return "adult"
    else:
        return "senior"

titanic["AgeCls"] = titanic["Age"].apply(classifier)

(more efficient alternative: pd.cut())

Setting data

iris.iloc[0, 0] = 6

iris.loc[:, "sepal_ratio"] = float('nan')

Missing data

Values that represent missing data (since pandas 1.0):

for floats: NaN (as usual in Python)
for other data types: NA (from the pandas package)

Missing data

titanic["Age"].shape
# (891,)

titanic["Age"].count()
# 714

Missing data

show all rows with missing age entries:

titanic[titanic["Age"].isna()]

Removing rows

removing any rows with missing data in any location:

titanic = titanic.dropna()

removing any rows with missing data in the age column:

titanic = titanic.dropna(subset=["Age"])

Filling values

Filling missing data with zeros:

titanic["Age"] = titanic["Age"].fillna(0)

Filling missing data by using the last valid datapoint:

titanic["Age"] = titanic["Age"].fillna(method="ffill")

Filling missing data by using the next valid datapoint:

titanic["Age"] = titanic["Age"].fillna(method="bfill")

Interpolating values

data = pd.Series(
    [1, 2, 4, np.nan, 16, 32, np.nan, np.nan, 256]
)

data.interpolate("nearest")
data.interpolate("linear") # default
data.interpolate("slinear")
data.interpolate("quadratic")
data.interpolate("cubic")

Interpolating values

Exercise:

Use the data from sp500 and euribor to compare the development of American and European interest rates

Note:

The sp500 has data for the first day of each month, the euribor has data for the first workday of each month

Interpolating values

solution:

interest = pd.DataFrame({
    "us": sp500["Long Interest Rate"],
    "eu": euribor["rate"]
})

interest["eu"] = interest["eu"].interpolate("slinear")
interest = interest.dropna()

Multi index

index column: column that uniquely identifies a row in a DataFrame

multi index: combination of multiple columns for unique identification

Multi index

example: exchange rates

Date	Country	Exchange rate
1971-01-01	Australia	0.894
1971-02-01	Australia	0.890
1971-03-01	Australia	0.890

A row can be uniquely identified by a combination of date and country

Multi index

Importing with a multi index:

exchange_rates = pd.read_csv(
    "https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
    index_col=["Country", "Date"],
    parse_dates=["Date"])

Multi index

in some cases, a multi index may allow for using a Series instead of a DataFrame:

exchange_rates_series = exchange_rates["Exchange rate"]

Multi index

querying the first part of a multi index (will return a Series with only the remainder of the index):

exchange_rates_series.loc["France"]

querying multiple parts of the index:

exchange_rates_series.loc["France", "1971-01-01":"1971-12-31"]

Multi index

Querying in a DataFrame:

exchange_rates.loc[("France", "1990"), :]
exchange_rates.loc[(slice(None, None), "1990-01"), :]

Note: in Python x[a:b] is equivalent to x[slice(a, b)]

Note: the Date column could be removed from the index via reset_index

Joins

Join: Combining multiple DataFrames or Series into a single DataFrame

Joins

types:

inner join
outer join
left join
right join

Joins

Joining Series objects by their indices:

outer join:

interest_rates = pd.DataFrame({
    "usd": sp500["Long Interest Rate"],
    "eur": euribor["rate"]
})

inner join:

interest_rates = pd.DataFrame({
    "usd": sp500["Long Interest Rate"],
    "eur": euribor["rate"]
}).dropna()

Joins

joining DataFrame objects by their indices:

inner join:

pd.merge(sp500, euribor, left_index=True, right_index=True)

outer join:

pd.merge(sp500, euribor, left_index=True, right_index=True,
         how="outer")

Joins

joining on specific columns (not on the index):

sp500_no_index = pd.read_csv(
    "https://datahub.io/core/s-and-p-500/r/data.csv",
    parse_dates=["Date"],
)
euribor_no_index = pd.read_csv(
    "https://datahub.io/core/euribor/r/euribor-12m-monthly.csv",
    parse_dates=["date"],
    usecols=["date", "rate"]
)

pd.merge(sp500_no_index, euribor_no_index, left_on="Date",
         right_on="date")

Joins

short form if the corresponding columns have the same name:

pd.merge(sp500_no_index, euribor_no_index, on="date")

will result in one date column instead of two

Exercise

Exercise: country comparisons (scatter plots):

GDP per capita from 2018
COVID-19 vaccination rates from 2021-08

data sources:

Exercise

solution:

gdp = pd.read_csv(
    "https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020))/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020)).csv",
    index_col=["Entity", "Year"]
)
gdp_series = gdp["GDP per capita"]
gdp_2018 = gdp_series[:, 2018]

vac = pd.read_csv(
    "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv",
    index_col=["location", "date"]
)
vac_series = vac["total_vaccinations_per_hundred"]
vac_2021_08 = vac_series[:, "2021-08-01"]

data = pd.DataFrame({"gdp": gdp_2018, "vaccinations": vac_2021_08}).dropna()
data.plot.scatter(x="gdp", y="vaccinations")

Joins

Grouping and aggregation

examples related to Titanic passenger data:

number of passengers by class
average age of passengers by class
number of passengers by class and sex
average age of passengers by class and sex

Grouping and aggregation

Aggregation: Computing a derived value based on multiple data entries in a series / data frame (e.g. number of entries, average, median)

Grouping and aggregation

functions and methods:

series.value_counts()
series.groupby() / df.groupby()
pd.crosstab()
pd.pivot_table()

Grouping and aggregation

number of passengers in each class:

titanic["Pclass"].value_counts()

# 3    491
# 1    216
# 2    184

Grouping and aggregation

median numeric values of passengers in each class:

titanic.groupby("Pclass").median()

median age of passengers in each class:

titanic.groupby("Pclass")["Age"].median()

# 1    37.0
# 2    29.0
# 3    24.0

Grouping and aggregation

number of passengers by class and sex (frequency table):

pd.crosstab(titanic["Pclass"], titanic["Sex"])

sex     female  male
pclass
1           94   122
2           76   108
3          144   347

Grouping and aggregation

average age by class and sex:

pd.crosstab(
    index=titanic["Pclass"],
    columns=titanic["Sex"],
    values=titanic["Age"],
    aggfunc=np.mean)

pd.pivot_table(
    data=titanic,
    values="Age",
    index=["Pclass"],
    columns=["Sex"],
    aggfunc=np.mean)

Exercises

mean values for each of the four iris measurements within each type of flower

mean USD exchange rates for each currency in the 90s based on the following data:

exchange_rates = pd.read_csv(
    "https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
    index_col=["Country", "Date"],
    parse_dates=["Date"])

Exercises - Solutions

iris.groupby("species").mean()

er_90s = exchange_rates.loc[
    (exchange_rates["Date"] >= "1990-01-01") &
    (exchange_rates["Date"] <= "1999-12-31")
]

er_90s_means = er_90s.groupby("Country").mean()

Exercise - Movielens

Based on the Movielens dataset, find the top-rated movies that were rated by at least 10 users

ratings = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
    sep="\t",
    header=None,
    usecols=[0, 1, 2],
    names=["user_id", "movie_id", "rating"],
)
movie_titles = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.item",
    sep="|",
    header=None,
    encoding="latin1",
    usecols=[0, 1],
    names=["movie_id", "title"],
)

Exercise - Movielens: solution

movie_names = movie_titles.groupby("movie_id")["title"].first()
movie_mean_ratings = ratings.groupby("movie_id")["rating"].mean()
movie_num_ratings = ratings.groupby("movie_id")["rating"].count()

movie_ratings = pd.DataFrame({
  "name": movie_names,
  "mean_rating": movie_mean_ratings,
  "num_ratings": movie_num_ratings
})

movie_ratings.query(
    "num_ratings >= 10 and 4.2 < mean_rating"
).sort_values(by="mean_rating", ascending=False)

Time series

data types:

time stamp: a specific moment in time (e.g. 1955-11-12, 10:04)
time period: a specific period in time (e.g. November 1955)
time delta: a difference in time (e.g. 1 hour)

Time series

creating date ranges (as index objects):

every_day = pd.date_range("2000-01-01", "2000-12-31")
last_of_each_month = pd.date_range(
    "2000-01-01", "2000-12-31", freq="M"
)
first_of_each_month = pd.date_range(
    "2000-01-01", "2000-12-31", freq="MS"
)
every_10_days = pd.date_range(
    "2000-01-01", "2000-12-31", freq="10d"
)

Time series

converting between time stamps and time periods:

.to_period()
.to_timestamp()

(works on index objects and on Series / DataFrame objects)

Resampling

resampling: getting derived data for different time intervals

example: based on monthly data:

get yearly data (e.g. average over months)
get daily data (e.g. via interpolation)

Resampling

sp500.resample("Y").mean()
sp500.resample("D").interpolate()

Time series

Checking if the first day of each month is present in the S&P 500 data:

sp500.index.equals(
    pd.date_range(
        sp500.index[0], sp500.index[-1], freq="MS"
    )
)
# True

Cheat sheet

pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf