Pandas is a data analysis library; it is based on NumPy
import pandas as pd
Example:
| Area | Population | Capital | |
|---|---|---|---|
| CN | 9.6 | 1386 | Beijing | 
| RU | 17 | 144 | Moscow | 
| US | 9.8 | 327 | Washington, D.C. | 
area = pd.Series({'CN': 9.6, 'RU': 17, 'US': 9.8})
population = pd.Series({'CN': 1386, 'RU': 144, 'US': 327})
area = pd.Series([9.6, 17, 9.8], ["CN", "RU", "US"])
population = pd.Series([1386, 144, 327], ["CN", "RU", "US"])
Reading a value via the index value:
area['CN'] # 9.6
Each series has a specific data type
area.dtype # float64
Manually setting the data type:
area = pd.Series(
    {"CN": 9.6, "RU": 17, "US": 9.8}, dtype="float32"
)
countries = pd.DataFrame(
    {"area": area, "population": population}
)
data formats:
The following functions can import / export data from files. Imports may read files from online sources.
import: pd.read_csv, pd.read_excel, ...
export: df.to_csv, df.to_excel, ...
Example: Euribor (interest rates of European bonds)
euribor = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv"
)
Possible keyword arguments for read_csv:
index_col: identifies a column to be used as an index instead of a numerical indexheader: passing a value of None signifies that there is no header rownames: column names to be used in the resulting DataFramesep: for specifying other separators than a commausecols: for importing only specific columnsparse_dates: expects a list of column namesdtype: may be a dictionary that specifies data types for certain columnsSee also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Advanced Euribor example:
euribor = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv",
    parse_dates=["date"],
    index_col="date",
    usecols=["date", "rate"]
)
Task: Import the following data sources, ensuring the data is formatted nicely:
possible solutions:
sp500 = pd.read_csv(
    "https://raw.githubusercontent.com/datasets/s-and-p-500/main/data/data.csv",
    index_col="Date",
    parse_dates=["Date"],
)
iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    header=None,
    names=["sepal_length", "sepal_width", "petal_length",
           "petal_width", "species"],
)
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    index_col="PassengerId",
)
requires package openpyxl
reading / writing a single Excel sheet:
pd.read_exceldf.to_excelreading / writing a complete documents (including formatting):
pd.ExcelFilepd.ExcelWriterrequires PyTables (package name: tables)
euribor.to_hdf("data.hdf5", "euribor")
sp500.to_hdf("data.hdf5", "sp500")
euribor = pd.read_hdf("data.hdf5", "euribor")
for many array operations, there is equivalent functionality on Pandas DataFrames / Series:
df.abs()df1 + df2df.shapeconversion:
df.to_numpy()series.to_numpy()titanic["Age"].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64
(see next slide for explanations)
titanic["Age"].describe()
The above computes the following data:
.count().mean().std().quantile(0) or .min().quantile(0.25).quantile(0.5) or .median().quantile(0.75).quantile(1) or .max()get a list of unique values:
titanic["Pclass"].unique()
[2, 3, 1]
count occurences:
titanic["Pclass"].value_counts()
3    491
1    216
2    184
df.index: row index valuesdf.columns: column namesaccessing a column (as a Series):
titanic["Age"]
accessing multiple columns (as a DataFrame):
titanic[["Name", "Age"]]
short notation (does not work for all column names):
titanic.Age
cases where the short notation cannot be used:
df["first name"]  # space in name
df["class"]       # reserved word
df["mean"]        # existing method
single row as a Series:
sp500.loc["1872-01-01"]
multiple rows as a DataFrame (both limits inclusive):
sp500.loc["1872-01-01" : "1872-12-31"]
if the index column was parsed as dates:
sp500.loc["2009"]
single row as a Series:
sp500.iloc[0]
multiple rows as a DataFrame:
sp500.iloc[0 : 10]
titanic[titanic["Pclass"] == 1]
titanic[titanic["Age"] >= 70]
df.sample() - random entrydf.sample(5) - five entriesdf.sample(frac=0.1) - 10% of all entriessp500.iloc[0]sp500.iloc[-1]sp500.iloc[-10:]sp500.sample(10)sp500.loc["2009-01-01"]sp500.loc["2009-01-01": "2009-12-31"]sp500[sp500["Long Interest Rate"] > 14]When was the S&P 500 at its highest value? (compute the maximum, then get the corresponding row in the DataFrame)
sp500_max = sp500["SP500"].max()
# returns a DataFrame
sp500_max_row = sp500[sp500["SP500"] == sp500_max]
shorter alternative (using .idxmax()):
# returns a Series
sp500_max_row = sp500.loc[sp500["SP500"].idxmax()]
titanic[titanic["Survived"] == 0]titanic[titanic["Age"] == 60]titanic["Survived"].value_counts()titanic[titanic["Pclass"] == 1]["Survived"].value_counts()querying by row and column:
df.loc["2009-01-02", "rate"]df.iloc[5, 1]series.sort_values()df.sort_values("column_name")df.sort_values(["col1", "col2"])df.sort_index(ascending=False)basic method:
titanic[titanic["Pclass"] == 1]
titanic[titanic["Age"] >= 70]
male passengers in second or third class:
titanic[(titanic["Pclass"] >= 2) & (titanic["Sex"] == "male")]
passengers who embarked in Southampton or Queenstown:
titanic[titanic["Embarked"].isin(["S", "Q"])]
titanic.query("Pclass >= 2 and Sex == 'male'")
titanic.query("Embarked in ['S', 'Q']")
adult (>= 18) males, sorted by age:
titanic[
    (titanic["Age"] >= 18) & (titanic["Sex"] = "male")
].sort_values("Age")
adult survivors
titanic[(titanic["Age"] >= 18) & (titanic["Survived"] == 1)]
iris.sort_values(by="petal_length")
iris[iris["species"] == "Iris-setosa"].sort_values(
    by="petal_length"
)
plotting functionality of pandas is based on pyplot (similar commands)
basic functionality of pyplot:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0, 1, 2, 3])
y1 = x*2
y2 = x**2
plt.plot(x, y1)
plt.plot(x, y2)
result:
Basic plot types:
example pyplot options:
plt.plot(x, y1, color="C3", marker="X", linestyle="dashed")
wrappers for pyplot functions that exist on Series and DataFrame objects:
data.plot() or data.plot.line()data.plot.bar()data.plot.scatter()data.plot.hist()data.plot.box()data.plot.pie()examples:
euribor.plot.line()
sp500["SP500"].plot.line()
example:
euribor.iloc[-12:].plot.bar()
exercise: Visualize the median of sepal-width and sepal-length for all three types of flowers
Exercise: histogram of the sepal length
Exercise: box plots of all iris measurements
iris.plot.scatter(
    x="sepal_length",
    y="sepal_width",
)
with color and size:
iris.plot.scatter(
    x="sepal_length",
    y="sepal_width",
    s="petal_length",
    c="petal_width"
)
Exercise: scatter plot for iris setosa
data = pd.Series({"a": 50, "b": 30, "c": 20})
data.plot.pie(ylabel="Data")
distribution of Titanic passengers into passenger classes (1 - 3)
titanic["Pclass"].value_counts().plot.pie()
df.columns = ["name1", "name2"]
dropping rows:
sp500_new = sp500.drop(["1871-01-01", "1871-02-01"])
dropping columns:
sp500_new = sp500.drop(columns=["Real Earnings", "PE10"])
converting types:
titanic["Survived"] = titanic["Survived"].astype("bool")
replacing values:
titanic["Female"] = titanic["Sex"].replace(
    {"female": True, "male": False}
)
Adding a new column:
iris["sepal_ratio"] = iris["sepal_length"] / iris["sepal_width"]
iris["sepal_ratio"].mean()
iris["sepal_ratio"].std()
iris_setosa = iris.loc[
    iris["name"] == "Iris-setosa"
]
iris_setosa["sepal_ratio"].mean()
iris_setosa["sepal_ratio"].std()
Task:
Analyze the S&P 500 monthly data and determine the monthly gain / loss in percent for each month
hint: use the .diff() method to get the difference between the previous row and the current one
sp500["Diff"] = sp500["SP500"].diff()
sp500["Gain"] = sp500["Diff"] / sp500["SP500"]
def classifier(age):
    if age < 18:
        return "youth"
    elif age < 60:
        return "adult"
    else:
        return "senior"
titanic["AgeCls"] = titanic["Age"].apply(classifier)
(more efficient alternative: pd.cut())
iris.iloc[0, 0] = 6
iris.loc[:, "sepal_ratio"] = float('nan')
Values that represent missing data (since pandas 1.0):
NaN (as usual in Python)NA (from the pandas package)titanic["Age"].shape
# (891,)
titanic["Age"].count()
# 714
show all rows with missing age entries:
titanic[titanic["Age"].isna()]
removing any rows with missing data in any location:
titanic = titanic.dropna()
removing any rows with missing data in the age column:
titanic = titanic.dropna(subset=["Age"])
Filling missing data with zeros:
titanic["Age"] = titanic["Age"].fillna(0)
Filling missing data by using the last valid datapoint:
titanic["Age"] = titanic["Age"].fillna(method="ffill")
Filling missing data by using the next valid datapoint:
titanic["Age"] = titanic["Age"].fillna(method="bfill")
data = pd.Series(
    [1, 2, 4, np.nan, 16, 32, np.nan, np.nan, 256]
)
data.interpolate("nearest")
data.interpolate("linear") # default
data.interpolate("slinear")
data.interpolate("quadratic")
data.interpolate("cubic")
Exercise:
Use the data from sp500 and euribor to compare the development of American and European interest rates
Note:
The sp500 has data for the first day of each month, the euribor has data for the first workday of each month
solution:
interest = pd.DataFrame({
    "us": sp500["Long Interest Rate"],
    "eu": euribor["rate"]
})
interest["eu"] = interest["eu"].interpolate("slinear")
interest = interest.dropna()
index column: column that uniquely identifies a row in a DataFrame
multi index: combination of multiple columns for unique identification
example: exchange rates
| Date | Country | Exchange rate | 
|---|---|---|
| 1971-01-01 | Australia | 0.894 | 
| 1971-02-01 | Australia | 0.890 | 
| 1971-03-01 | Australia | 0.890 | 
A row can be uniquely identified by a combination of date and country
Importing with a multi index:
exchange_rates = pd.read_csv(
    "https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
    index_col=["Country", "Date"],
    parse_dates=["Date"])
in some cases, a multi index may allow for using a Series instead of a DataFrame:
exchange_rates_series = exchange_rates["Exchange rate"]
querying the first part of a multi index (will return a Series with only the remainder of the index):
exchange_rates_series.loc["France"]
querying multiple parts of the index:
exchange_rates_series.loc["France", "1971-01-01":"1971-12-31"]
Querying in a DataFrame:
exchange_rates.loc[("France", "1990"), :]
exchange_rates.loc[(slice(None, None), "1990-01"), :]
Note: in Python x[a:b] is equivalent to x[slice(a, b)]
Note: the Date column could be removed from the index via reset_index
Join: Combining multiple DataFrames or Series into a single DataFrame
types:
Joining Series objects by their indices:
outer join:
interest_rates = pd.DataFrame({
    "usd": sp500["Long Interest Rate"],
    "eur": euribor["rate"]
})
inner join:
interest_rates = pd.DataFrame({
    "usd": sp500["Long Interest Rate"],
    "eur": euribor["rate"]
}).dropna()
joining DataFrame objects by their indices:
inner join:
pd.merge(sp500, euribor, left_index=True, right_index=True)
outer join:
pd.merge(sp500, euribor, left_index=True, right_index=True,
         how="outer")
joining on specific columns (not on the index):
sp500_no_index = pd.read_csv(
    "https://datahub.io/core/s-and-p-500/r/data.csv",
    parse_dates=["Date"],
)
euribor_no_index = pd.read_csv(
    "https://datahub.io/core/euribor/r/euribor-12m-monthly.csv",
    parse_dates=["date"],
    usecols=["date", "rate"]
)
pd.merge(sp500_no_index, euribor_no_index, left_on="Date",
         right_on="date")
short form if the corresponding columns have the same name:
pd.merge(sp500_no_index, euribor_no_index, on="date")
will result in one date column instead of two
Exercise: country comparisons (scatter plots):
data sources:
solution:
gdp = pd.read_csv(
    "https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020))/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020)).csv",
    index_col=["Entity", "Year"]
)
gdp_series = gdp["GDP per capita"]
gdp_2018 = gdp_series[:, 2018]
vac = pd.read_csv(
    "https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv",
    index_col=["location", "date"]
)
vac_series = vac["total_vaccinations_per_hundred"]
vac_2021_08 = vac_series[:, "2021-08-01"]
data = pd.DataFrame({"gdp": gdp_2018, "vaccinations": vac_2021_08}).dropna()
data.plot.scatter(x="gdp", y="vaccinations")
see also: https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
examples related to Titanic passenger data:
Aggregation: Computing a derived value based on multiple data entries in a series / data frame (e.g. number of entries, average, median)
functions and methods:
series.value_counts()series.groupby() / df.groupby()pd.crosstab()pd.pivot_table()number of passengers in each class:
titanic["Pclass"].value_counts()
# 3    491
# 1    216
# 2    184
median numeric values of passengers in each class:
titanic.groupby("Pclass").median()
median age of passengers in each class:
titanic.groupby("Pclass")["Age"].median()
# 1    37.0
# 2    29.0
# 3    24.0
number of passengers by class and sex (frequency table):
pd.crosstab(titanic["Pclass"], titanic["Sex"])
sex     female  male
pclass
1           94   122
2           76   108
3          144   347
average age by class and sex:
pd.crosstab(
    index=titanic["Pclass"],
    columns=titanic["Sex"],
    values=titanic["Age"],
    aggfunc=np.mean)
pd.pivot_table(
    data=titanic,
    values="Age",
    index=["Pclass"],
    columns=["Sex"],
    aggfunc=np.mean)
mean values for each of the four iris measurements within each type of flower
mean USD exchange rates for each currency in the 90s based on the following data:
exchange_rates = pd.read_csv(
    "https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
    index_col=["Country", "Date"],
    parse_dates=["Date"])
iris.groupby("species").mean()
er_90s = exchange_rates.loc[
    (exchange_rates["Date"] >= "1990-01-01") &
    (exchange_rates["Date"] <= "1999-12-31")
]
er_90s_means = er_90s.groupby("Country").mean()
Based on the Movielens dataset, find the top-rated movies that were rated by at least 10 users
ratings = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
    sep="\t",
    header=None,
    usecols=[0, 1, 2],
    names=["user_id", "movie_id", "rating"],
)
movie_titles = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.item",
    sep="|",
    header=None,
    encoding="latin1",
    usecols=[0, 1],
    names=["movie_id", "title"],
)
movie_names = movie_titles.groupby("movie_id")["title"].first()
movie_mean_ratings = ratings.groupby("movie_id")["rating"].mean()
movie_num_ratings = ratings.groupby("movie_id")["rating"].count()
movie_ratings = pd.DataFrame({
  "name": movie_names,
  "mean_rating": movie_mean_ratings,
  "num_ratings": movie_num_ratings
})
movie_ratings.query(
    "num_ratings >= 10 and 4.2 < mean_rating"
).sort_values(by="mean_rating", ascending=False)
data types:
creating date ranges (as index objects):
every_day = pd.date_range("2000-01-01", "2000-12-31")
last_of_each_month = pd.date_range(
    "2000-01-01", "2000-12-31", freq="M"
)
first_of_each_month = pd.date_range(
    "2000-01-01", "2000-12-31", freq="MS"
)
every_10_days = pd.date_range(
    "2000-01-01", "2000-12-31", freq="10d"
)
converting between time stamps and time periods:
.to_period().to_timestamp()(works on index objects and on Series / DataFrame objects)
resampling: getting derived data for different time intervals
example: based on monthly data:
sp500.resample("Y").mean()
sp500.resample("D").interpolate()
Checking if the first day of each month is present in the S&P 500 data:
sp500.index.equals(
    pd.date_range(
        sp500.index[0], sp500.index[-1], freq="MS"
    )
)
# True
pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf