Distribution fitting and kernel density estimation

Given some random results from a probability distribution, we want to determine the distribution

example: age distribution of the titanic data

titanic["Age"].plot.hist(density=True, bins=20)

Kernel density estimation

visualization of the kernel density estimation (KDE):

titanic["Age"].plot.kde(xlim=(0, 80))

kernel density estimation = approximation of density by combining (summing) many base functions (e.g. normal distributions)

see: Python Data Science Handbook: Kernel Density Estimation

Distribution fitting

package scipy.stats: allows for fitting various types of distributions:

normal
skew normal
log-normal
gamma
...

Distribution fitting

approximating a distribution - via normal or skew normal:

from scipy import stats

# location=mu, scale=sigma
(location, scale) = stats.norm.fit(titanic["Age"].dropna())
dist_norm = stats.norm(location, scale)

(shape, location, scale) = stats.skewnorm.fit(titanic["Age"].dropna())
dist_skewnorm = stats.skewnorm(shape, location, scale)

Distribution fitting

visualize approximations:

x = np.linspace(0, 80, 100)
plt.plot(x, dist_norm.pdf(x))
plt.plot(x, dist_skewnorm.pdf(x))
titanic["Age"].plot.hist(density=True, bins=20)

Distribution fitting

Task: fit some other distributions (e.g. lognorm, gamma, ...)

Distribution fitting

evaluation (Kolmogorov-Smirnov-test):

stats.kstest(titanic["Age"].dropna(), dist_norm.cdf)

# KstestResult(statistic=0.054, pvalue=0.029)

Linear regression

Linear regression: a linear function is fitted to given data points (usually via least squares)

independent variable(s) (x)
dependent variable(s) (y)

Linear regression

Example: various purchases in different supermarkets:

1 l of milk, 1 kg of bread: 5.00€
2 l of milk, 3 kg of bread: 13.50€
3 l of milk, 2 kg of bread: 10.90€
(0 l of milk, 0 kg of bread: 0€)

task: estimate prices of:

1 l of milk
1 kg of bread
2 l of milk and 2 kg of bread

This may be solved via regression

Linear regression

input data:

1, 1 → 5.00
2, 3 → 13.50
3, 2 → 10.90
0, 0 → 0.00

result of a linear regression:

price = 0.05 + 1.13*x + 3.73*y

Linear regression

most common veriant:

OLS (ordinary least squares): the sum of squared errors (deviations) should be minimal

Metrics

general:

F-test / F-statistic: probability of there being a correlation (or not) (assuming normal distributions)
RMSE (root mean squared error): square root of mean squared errors
R² ("R squared", coefficient of determination)
- measure expressing the relative error (between 0 and 1)
- how much of the variance of the dependent variable can be explained by the independent variables?
Akaike information criterion (AIC), bayesian information criterion (BIC) (lower is better)

Metrics

values per feature / independent variable:

value: estimated coefficient
std error: standard error of coefficient
p-value: probability of outputs being independent from this input
t-statistic: ratio between average deviation and coincidentally expected deviation

Linear regression with statsmodels

Linear regression in Python

Python packages for linear regression:

statsmodels
scikit-learn

Example: diabetes data

goal: find correlation between patient data and diabetes disease progression after one year

from sklearn.datasets import load_diabetes

# load dataset as a pandas dataframe
dataset = load_diabetes(as_frame=True)

print(dataset.DESCR)

print(dataset.data)
print(dataset.target)
print(dataset.target.describe())

Example: diabetes data

import statsmodels.api as sm

# add constant column for base value of y (intercept)
x = sm.add_constant(dataset.data)

y = dataset.target

Example: diabetes data

model = sm.OLS(y, x)

res = model.fit()

res.summary()

Example: diabetes data

only keep values whose influence is significant:

x_cleaned = x.drop(["age", "s3", "s6", "s4"], axis=1)

Regression: polynomial regression and regularization

Polynomial regression

instead of a linear relation, we could assume a polynomial (or other) relation between independent and dependent variables

Polynomial regression

assuming a quadratic relation with age in the diabetes dataset

adding another column:

x["age_squared"] = x["age"] ** 2

Regularization

regularizations to avoid big coefficients:

L_1 regularization (Lasso): absolute values of coefficients are penalized
L_2 regularization (Ridge): squared values of coefficients are penalized

model.fit_regularized(alpha=0.01, L1_wt=0.1)

Logistic regression with statsmodels

appropriate regression for binary classification (0 / 1): logistic regression

Logistic regression with statsmodels

https://www.kaggle.com/jojoker/titanic-survival-chance-with-statsmodels

Logistic regression with statsmodels

import data:

import pandas as pd
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    index_col="PassengerId",
)

Logistic regression with statsmodels

prepare data:

# column with a numeric value
titanic["Female"] = titanic["Sex"].replace(
    {"female": 1, "male": 0}
)

# remove rows with missing age
titanic = titanic.dropna(subset=["Age"])

Logistic regression with statsmodels

import statsmodels.api as sm

x = titanic[["Pclass", "Age", "SibSp", "Parch", "Fare", "Female"]]
x = sm.add_constant(x)

y = titanic["Survived"]

Logistic regression with statsmodels

model = sm.Logit(y, x)
results = model.fit()
results.summary()

Logistic regression with statsmodels

dismissing properties with a high p-value:

x = x.drop(["Parch", "Fare"], axis=1)

Dimensionality reduction

Simplification of data with a large number of attributes/features to data with fewer, but more expressive/meaningful attributes

Techniques

feature selection: select some features, discard the rest
principal component analysis
autoencoder (neural network)
manifold learning

Techniques

Example: MovieLens recommendations

We will apply a dimensionality reduction algorithm to the MovieLens user ratings, reducing the rating data to few attributes (roughly 25); thereby we can then recommend similar movies

Clustering

finding clusters in input data

strategies:

k-means clustering
gaussian mixture models

k-means clustering

process: Cluster centers are determined in n-dimensional space. A data point is associated to the cluster whose center is closest

determining the cluster centers:

random initialization of centers

repeatedly:

classify each data point according to the closest cluster center
recompute the cluster centers as the mean of all associated points

This process converges

Python Data Science Handbook - k-Means Clustering

k-means clustering: example

example for iris measurements (assuming species are unknown):

measurements = np.array([[5.1, 3.5, 1.4, 0.2],
                         [4.9, 3.0, 1.4, 0.2],
                         [4.7, 3.2, 1.3, 0.2], ...])

from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(measurements)

k-means clustering: example

cluster centers:

model.cluster_centers_

k-means clustering: example

visualizing clusters:

species_pred = model.predict(measurements)

plt.scatter(
    measurements[:, 0],
    measurements[:, 1],
    c=species_pred
)

k-means clustering

examples: