Pandas is a data analysis library; it is based on NumPy
import pandas as pd
Example:
Area | Population | Capital | |
---|---|---|---|
CN | 9.6 | 1386 | Beijing |
RU | 17 | 144 | Moscow |
US | 9.8 | 327 | Washington, D.C. |
area = pd.Series({'CN': 9.6, 'RU': 17, 'US': 9.8})
population = pd.Series({'CN': 1386, 'RU': 144, 'US': 327})
area = pd.Series([9.6, 17, 9.8], ["CN", "RU", "US"])
population = pd.Series([1386, 144, 327], ["CN", "RU", "US"])
Reading a value via the index value:
area['CN'] # 9.6
Each series has a specific data type
area.dtype # float64
Manually setting the data type:
area = pd.Series(
{"CN": 9.6, "RU": 17, "US": 9.8}, dtype="float32"
)
countries = pd.DataFrame(
{"area": area, "population": population}
)
data formats:
The following functions can import / export data from files. Imports may read files from online sources.
import: pd.read_csv
, pd.read_excel
, ...
export: df.to_csv
, df.to_excel
, ...
Example: Euribor (interest rates of European bonds)
euribor = pd.read_csv(
"https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv"
)
Possible keyword arguments for read_csv
:
index_col
: identifies a column to be used as an index instead of a numerical indexheader
: passing a value of None
signifies that there is no header rownames
: column names to be used in the resulting DataFrame
sep
: for specifying other separators than a commausecols
: for importing only specific columnsparse_dates
: expects a list of column namesdtype
: may be a dictionary that specifies data types for certain columnsSee also: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Advanced Euribor example:
euribor = pd.read_csv(
"https://raw.githubusercontent.com/datasets/euribor/master/data/euribor-12m-monthly.csv",
parse_dates=["date"],
index_col="date",
usecols=["date", "rate"]
)
Task: Import the following data sources, ensuring the data is formatted nicely:
possible solutions:
sp500 = pd.read_csv(
"https://raw.githubusercontent.com/datasets/s-and-p-500/main/data/data.csv",
index_col="Date",
parse_dates=["Date"],
)
iris = pd.read_csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header=None,
names=["sepal_length", "sepal_width", "petal_length",
"petal_width", "species"],
)
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
index_col="PassengerId",
)
requires package openpyxl
reading / writing a single Excel sheet:
pd.read_excel
df.to_excel
reading / writing a complete documents (including formatting):
pd.ExcelFile
pd.ExcelWriter
requires PyTables (package name: tables)
euribor.to_hdf("data.hdf5", "euribor")
sp500.to_hdf("data.hdf5", "sp500")
euribor = pd.read_hdf("data.hdf5", "euribor")
for many array operations, there is equivalent functionality on Pandas DataFrames / Series:
df.abs()
df1 + df2
df.shape
conversion:
df.to_numpy()
series.to_numpy()
titanic["Age"].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
(see next slide for explanations)
titanic["Age"].describe()
The above computes the following data:
.count()
.mean()
.std()
.quantile(0)
or .min()
.quantile(0.25)
.quantile(0.5)
or .median()
.quantile(0.75)
.quantile(1)
or .max()
get a list of unique values:
titanic["Pclass"].unique()
[2, 3, 1]
count occurences:
titanic["Pclass"].value_counts()
3 491
1 216
2 184
df.index
: row index valuesdf.columns
: column namesaccessing a column (as a Series):
titanic["Age"]
accessing multiple columns (as a DataFrame):
titanic[["Name", "Age"]]
short notation (does not work for all column names):
titanic.Age
cases where the short notation cannot be used:
df["first name"] # space in name
df["class"] # reserved word
df["mean"] # existing method
single row as a Series:
sp500.loc["1872-01-01"]
multiple rows as a DataFrame (both limits inclusive):
sp500.loc["1872-01-01" : "1872-12-31"]
if the index column was parsed as dates:
sp500.loc["2009"]
single row as a Series:
sp500.iloc[0]
multiple rows as a DataFrame:
sp500.iloc[0 : 10]
titanic[titanic["Pclass"] == 1]
titanic[titanic["Age"] >= 70]
df.sample()
- random entrydf.sample(5)
- five entriesdf.sample(frac=0.1)
- 10% of all entriessp500.iloc[0]
sp500.iloc[-1]
sp500.iloc[-10:]
sp500.sample(10)
sp500.loc["2009-01-01"]
sp500.loc["2009-01-01": "2009-12-31"]
sp500[sp500["Long Interest Rate"] > 14]
When was the S&P 500 at its highest value? (compute the maximum, then get the corresponding row in the DataFrame)
sp500_max = sp500["SP500"].max()
# returns a DataFrame
sp500_max_row = sp500[sp500["SP500"] == sp500_max]
shorter alternative (using .idxmax()
):
# returns a Series
sp500_max_row = sp500.loc[sp500["SP500"].idxmax()]
titanic[titanic["Survived"] == 0]
titanic[titanic["Age"] == 60]
titanic["Survived"].value_counts()
titanic[titanic["Pclass"] == 1]["Survived"].value_counts()
querying by row and column:
df.loc["2009-01-02", "rate"]
df.iloc[5, 1]
series.sort_values()
df.sort_values("column_name")
df.sort_values(["col1", "col2"])
df.sort_index(ascending=False)
basic method:
titanic[titanic["Pclass"] == 1]
titanic[titanic["Age"] >= 70]
male passengers in second or third class:
titanic[(titanic["Pclass"] >= 2) & (titanic["Sex"] == "male")]
passengers who embarked in Southampton or Queenstown:
titanic[titanic["Embarked"].isin(["S", "Q"])]
titanic.query("Pclass >= 2 and Sex == 'male'")
titanic.query("Embarked in ['S', 'Q']")
adult (>= 18) males, sorted by age:
titanic[
(titanic["Age"] >= 18) & (titanic["Sex"] = "male")
].sort_values("Age")
adult survivors
titanic[(titanic["Age"] >= 18) & (titanic["Survived"] == 1)]
iris.sort_values(by="petal_length")
iris[iris["species"] == "Iris-setosa"].sort_values(
by="petal_length"
)
plotting functionality of pandas is based on pyplot (similar commands)
basic functionality of pyplot:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0, 1, 2, 3])
y1 = x*2
y2 = x**2
plt.plot(x, y1)
plt.plot(x, y2)
result:
Basic plot types:
example pyplot options:
plt.plot(x, y1, color="C3", marker="X", linestyle="dashed")
wrappers for pyplot functions that exist on Series and DataFrame objects:
data.plot()
or data.plot.line()
data.plot.bar()
data.plot.scatter()
data.plot.hist()
data.plot.box()
data.plot.pie()
examples:
euribor.plot.line()
sp500["SP500"].plot.line()
example:
euribor.iloc[-12:].plot.bar()
exercise: Visualize the median of sepal-width and sepal-length for all three types of flowers
Exercise: histogram of the sepal length
Exercise: box plots of all iris measurements
iris.plot.scatter(
x="sepal_length",
y="sepal_width",
)
with color and size:
iris.plot.scatter(
x="sepal_length",
y="sepal_width",
s="petal_length",
c="petal_width"
)
Exercise: scatter plot for iris setosa
data = pd.Series({"a": 50, "b": 30, "c": 20})
data.plot.pie(ylabel="Data")
distribution of Titanic passengers into passenger classes (1 - 3)
titanic["Pclass"].value_counts().plot.pie()
df.columns = ["name1", "name2"]
dropping rows:
sp500_new = sp500.drop(["1871-01-01", "1871-02-01"])
dropping columns:
sp500_new = sp500.drop(columns=["Real Earnings", "PE10"])
converting types:
titanic["Survived"] = titanic["Survived"].astype("bool")
replacing values:
titanic["Female"] = titanic["Sex"].replace(
{"female": True, "male": False}
)
Adding a new column:
iris["sepal_ratio"] = iris["sepal_length"] / iris["sepal_width"]
iris["sepal_ratio"].mean()
iris["sepal_ratio"].std()
iris_setosa = iris.loc[
iris["name"] == "Iris-setosa"
]
iris_setosa["sepal_ratio"].mean()
iris_setosa["sepal_ratio"].std()
Task:
Analyze the S&P 500 monthly data and determine the monthly gain / loss in percent for each month
hint: use the .diff()
method to get the difference between the previous row and the current one
sp500["Diff"] = sp500["SP500"].diff()
sp500["Gain"] = sp500["Diff"] / sp500["SP500"]
def classifier(age):
if age < 18:
return "youth"
elif age < 60:
return "adult"
else:
return "senior"
titanic["AgeCls"] = titanic["Age"].apply(classifier)
(more efficient alternative: pd.cut()
)
iris.iloc[0, 0] = 6
iris.loc[:, "sepal_ratio"] = float('nan')
Values that represent missing data (since pandas 1.0):
NaN
(as usual in Python)NA
(from the pandas package)titanic["Age"].shape
# (891,)
titanic["Age"].count()
# 714
show all rows with missing age entries:
titanic[titanic["Age"].isna()]
removing any rows with missing data in any location:
titanic = titanic.dropna()
removing any rows with missing data in the age column:
titanic = titanic.dropna(subset=["Age"])
Filling missing data with zeros:
titanic["Age"] = titanic["Age"].fillna(0)
Filling missing data by using the last valid datapoint:
titanic["Age"] = titanic["Age"].fillna(method="ffill")
Filling missing data by using the next valid datapoint:
titanic["Age"] = titanic["Age"].fillna(method="bfill")
data = pd.Series(
[1, 2, 4, np.nan, 16, 32, np.nan, np.nan, 256]
)
data.interpolate("nearest")
data.interpolate("linear") # default
data.interpolate("slinear")
data.interpolate("quadratic")
data.interpolate("cubic")
Exercise:
Use the data from sp500 and euribor to compare the development of American and European interest rates
Note:
The sp500 has data for the first day of each month, the euribor has data for the first workday of each month
solution:
interest = pd.DataFrame({
"us": sp500["Long Interest Rate"],
"eu": euribor["rate"]
})
interest["eu"] = interest["eu"].interpolate("slinear")
interest = interest.dropna()
index column: column that uniquely identifies a row in a DataFrame
multi index: combination of multiple columns for unique identification
example: exchange rates
Date | Country | Exchange rate |
---|---|---|
1971-01-01 | Australia | 0.894 |
1971-02-01 | Australia | 0.890 |
1971-03-01 | Australia | 0.890 |
A row can be uniquely identified by a combination of date and country
Importing with a multi index:
exchange_rates = pd.read_csv(
"https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
index_col=["Country", "Date"],
parse_dates=["Date"])
in some cases, a multi index may allow for using a Series instead of a DataFrame:
exchange_rates_series = exchange_rates["Exchange rate"]
querying the first part of a multi index (will return a Series with only the remainder of the index):
exchange_rates_series.loc["France"]
querying multiple parts of the index:
exchange_rates_series.loc["France", "1971-01-01":"1971-12-31"]
Querying in a DataFrame:
exchange_rates.loc[("France", "1990"), :]
exchange_rates.loc[(slice(None, None), "1990-01"), :]
Note: in Python x[a:b]
is equivalent to x[slice(a, b)]
Note: the Date column could be removed from the index via reset_index
Join: Combining multiple DataFrame
s or Series
into a single DataFrame
types:
Joining Series
objects by their indices:
outer join:
interest_rates = pd.DataFrame({
"usd": sp500["Long Interest Rate"],
"eur": euribor["rate"]
})
inner join:
interest_rates = pd.DataFrame({
"usd": sp500["Long Interest Rate"],
"eur": euribor["rate"]
}).dropna()
joining DataFrame
objects by their indices:
inner join:
pd.merge(sp500, euribor, left_index=True, right_index=True)
outer join:
pd.merge(sp500, euribor, left_index=True, right_index=True,
how="outer")
joining on specific columns (not on the index):
sp500_no_index = pd.read_csv(
"https://datahub.io/core/s-and-p-500/r/data.csv",
parse_dates=["Date"],
)
euribor_no_index = pd.read_csv(
"https://datahub.io/core/euribor/r/euribor-12m-monthly.csv",
parse_dates=["date"],
usecols=["date", "rate"]
)
pd.merge(sp500_no_index, euribor_no_index, left_on="Date",
right_on="date")
short form if the corresponding columns have the same name:
pd.merge(sp500_no_index, euribor_no_index, on="date")
will result in one date
column instead of two
Exercise: country comparisons (scatter plots):
data sources:
solution:
gdp = pd.read_csv(
"https://raw.githubusercontent.com/owid/owid-datasets/master/datasets/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020))/Maddison%20Project%20Database%202020%20(Bolt%20and%20van%20Zanden%20(2020)).csv",
index_col=["Entity", "Year"]
)
gdp_series = gdp["GDP per capita"]
gdp_2018 = gdp_series[:, 2018]
vac = pd.read_csv(
"https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/vaccinations/vaccinations.csv",
index_col=["location", "date"]
)
vac_series = vac["total_vaccinations_per_hundred"]
vac_2021_08 = vac_series[:, "2021-08-01"]
data = pd.DataFrame({"gdp": gdp_2018, "vaccinations": vac_2021_08}).dropna()
data.plot.scatter(x="gdp", y="vaccinations")
see also: https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
examples related to Titanic passenger data:
Aggregation: Computing a derived value based on multiple data entries in a series / data frame (e.g. number of entries, average, median)
functions and methods:
series.value_counts()
series.groupby()
/ df.groupby()
pd.crosstab()
pd.pivot_table()
number of passengers in each class:
titanic["Pclass"].value_counts()
# 3 491
# 1 216
# 2 184
median numeric values of passengers in each class:
titanic.groupby("Pclass").median()
median age of passengers in each class:
titanic.groupby("Pclass")["Age"].median()
# 1 37.0
# 2 29.0
# 3 24.0
number of passengers by class and sex (frequency table):
pd.crosstab(titanic["Pclass"], titanic["Sex"])
sex female male
pclass
1 94 122
2 76 108
3 144 347
average age by class and sex:
pd.crosstab(
index=titanic["Pclass"],
columns=titanic["Sex"],
values=titanic["Age"],
aggfunc=np.mean)
pd.pivot_table(
data=titanic,
values="Age",
index=["Pclass"],
columns=["Sex"],
aggfunc=np.mean)
mean values for each of the four iris measurements within each type of flower
mean USD exchange rates for each currency in the 90s based on the following data:
exchange_rates = pd.read_csv(
"https://datahub.io/core/us-euro-foreign-exchange-rate/r/monthly.csv",
index_col=["Country", "Date"],
parse_dates=["Date"])
iris.groupby("species").mean()
er_90s = exchange_rates.loc[
(exchange_rates["Date"] >= "1990-01-01") &
(exchange_rates["Date"] <= "1999-12-31")
]
er_90s_means = er_90s.groupby("Country").mean()
Based on the Movielens dataset, find the top-rated movies that were rated by at least 10 users
ratings = pd.read_csv(
"https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
sep="\t",
header=None,
usecols=[0, 1, 2],
names=["user_id", "movie_id", "rating"],
)
movie_titles = pd.read_csv(
"https://files.grouplens.org/datasets/movielens/ml-100k/u.item",
sep="|",
header=None,
encoding="latin1",
usecols=[0, 1],
names=["movie_id", "title"],
)
movie_names = movie_titles.groupby("movie_id")["title"].first()
movie_mean_ratings = ratings.groupby("movie_id")["rating"].mean()
movie_num_ratings = ratings.groupby("movie_id")["rating"].count()
movie_ratings = pd.DataFrame({
"name": movie_names,
"mean_rating": movie_mean_ratings,
"num_ratings": movie_num_ratings
})
movie_ratings.query(
"num_ratings >= 10 and 4.2 < mean_rating"
).sort_values(by="mean_rating", ascending=False)
data types:
creating date ranges (as index objects):
every_day = pd.date_range("2000-01-01", "2000-12-31")
last_of_each_month = pd.date_range(
"2000-01-01", "2000-12-31", freq="M"
)
first_of_each_month = pd.date_range(
"2000-01-01", "2000-12-31", freq="MS"
)
every_10_days = pd.date_range(
"2000-01-01", "2000-12-31", freq="10d"
)
converting between time stamps and time periods:
.to_period()
.to_timestamp()
(works on index objects and on Series / DataFrame objects)
resampling: getting derived data for different time intervals
example: based on monthly data:
sp500.resample("Y").mean()
sp500.resample("D").interpolate()
Checking if the first day of each month is present in the S&P 500 data:
sp500.index.equals(
pd.date_range(
sp500.index[0], sp500.index[-1], freq="MS"
)
)
# True
pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf