Distribution fitting and kernel density estimation

Distribution fitting and kernel density estimation

Given some random results from a probability distribution, we want to determine the distribution

example: age distribution of the titanic data

titanic["Age"].plot.hist(density=True, bins=20)

Kernel density estimation

visualization of the kernel density estimation (KDE):

titanic["Age"].plot.kde(xlim=(0, 80))

kernel density estimation = approximation of density by combining (summing) many base functions (e.g. normal distributions)

see: Python Data Science Handbook: Kernel Density Estimation

Distribution fitting

package scipy.stats: allows for fitting various types of distributions:

  • normal
  • skew normal
  • log-normal
  • gamma
  • ...

Distribution fitting

approximating a distribution - via normal or skew normal:

from scipy import stats

# location=mu, scale=sigma
(location, scale) = stats.norm.fit(titanic["Age"].dropna())
dist_norm = stats.norm(location, scale)

(shape, location, scale) = stats.skewnorm.fit(titanic["Age"].dropna())
dist_skewnorm = stats.skewnorm(shape, location, scale)

Distribution fitting

visualize approximations:

x = np.linspace(0, 80, 100)
plt.plot(x, dist_norm.pdf(x))
plt.plot(x, dist_skewnorm.pdf(x))
titanic["Age"].plot.hist(density=True, bins=20)

Distribution fitting

Task: fit some other distributions (e.g. lognorm, gamma, ...)

Distribution fitting

evaluation (Kolmogorov-Smirnov-test):

stats.kstest(titanic["Age"].dropna(), dist_norm.cdf)

# KstestResult(statistic=0.054, pvalue=0.029)

Linear regression

Linear regression

Linear regression: a linear function is fitted to given data points (usually via least squares)

  • independent variable(s) (x)
  • dependent variable(s) (y)

Linear regression

Example: various purchases in different supermarkets:

  • 1 l of milk, 1 kg of bread: 5.00€
  • 2 l of milk, 3 kg of bread: 13.50€
  • 3 l of milk, 2 kg of bread: 10.90€
  • (0 l of milk, 0 kg of bread: 0€)

task: estimate prices of:

  • 1 l of milk
  • 1 kg of bread
  • 2 l of milk and 2 kg of bread

This may be solved via regression

Linear regression

input data:

1, 1 → 5.00
2, 3 → 13.50
3, 2 → 10.90
0, 0 → 0.00

result of a linear regression:

price = 0.05 + 1.13*x + 3.73*y

Linear regression

most common veriant:

OLS (ordinary least squares): the sum of squared errors (deviations) should be minimal

Metrics

general:

  • F-test / F-statistic: probability of there being a correlation (or not) (assuming normal distributions)
  • RMSE (root mean squared error): square root of mean squared errors
  • R² ("R squared", coefficient of determination)
    • measure expressing the relative error (between 0 and 1)
    • how much of the variance of the dependent variable can be explained by the independent variables?
  • Akaike information criterion (AIC), bayesian information criterion (BIC) (lower is better)

Metrics

values per feature / independent variable:

  • value: estimated coefficient
  • std error: standard error of coefficient
  • p-value: probability of outputs being independent from this input
  • t-statistic: ratio between average deviation and coincidentally expected deviation

Linear regression with statsmodels

Linear regression in Python

Python packages for linear regression:

  • statsmodels
  • scikit-learn

Example: diabetes data

goal: find correlation between patient data and diabetes disease progression after one year

from sklearn.datasets import load_diabetes

# load dataset as a pandas dataframe
dataset = load_diabetes(as_frame=True)

print(dataset.DESCR)

print(dataset.data)
print(dataset.target)
print(dataset.target.describe())

Example: diabetes data

import statsmodels.api as sm

# add constant column for base value of y (intercept)
x = sm.add_constant(dataset.data)

y = dataset.target

Example: diabetes data

model = sm.OLS(y, x)

res = model.fit()

res.summary()

Example: diabetes data

only keep values whose influence is significant:

x_cleaned = x.drop(["age", "s3", "s6", "s4"], axis=1)

Regression: polynomial regression and regularization

Polynomial regression

instead of a linear relation, we could assume a polynomial (or other) relation between independent and dependent variables

Polynomial regression

assuming a quadratic relation with age in the diabetes dataset

adding another column:

x["age_squared"] = x["age"] ** 2

Regularization

regularizations to avoid big coefficients:

  • L_1 regularization (Lasso): absolute values of coefficients are penalized
  • L_2 regularization (Ridge): squared values of coefficients are penalized
model.fit_regularized(alpha=0.01, L1_wt=0.1)

Logistic regression with statsmodels

Logistic regression with statsmodels

appropriate regression for binary classification (0 / 1): logistic regression

Logistic regression with statsmodels

https://www.kaggle.com/jojoker/titanic-survival-chance-with-statsmodels

Logistic regression with statsmodels

import data:

import pandas as pd
titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    index_col="PassengerId",
)

Logistic regression with statsmodels

prepare data:

# column with a numeric value
titanic["Female"] = titanic["Sex"].replace(
    {"female": 1, "male": 0}
)

# remove rows with missing age
titanic = titanic.dropna(subset=["Age"])

Logistic regression with statsmodels

import statsmodels.api as sm

x = titanic[["Pclass", "Age", "SibSp", "Parch", "Fare", "Female"]]
x = sm.add_constant(x)

y = titanic["Survived"]

Logistic regression with statsmodels

model = sm.Logit(y, x)
results = model.fit()
results.summary()

Logistic regression with statsmodels

dismissing properties with a high p-value:

x = x.drop(["Parch", "Fare"], axis=1)

Dimensionality reduction

Dimensionality reduction

Simplification of data with a large number of attributes/features to data with fewer, but more expressive/meaningful attributes

Techniques

  • feature selection: select some features, discard the rest
  • principal component analysis
  • autoencoder (neural network)
  • manifold learning

see also: Dimensionality Reduction on Machine Learning Mastery

Techniques

Example: MovieLens recommendations

We will apply a dimensionality reduction algorithm to the MovieLens user ratings, reducing the rating data to few attributes (roughly 25); thereby we can then recommend similar movies

Clustering

Clustering

finding clusters in input data

strategies:

  • k-means clustering
  • gaussian mixture models

k-means clustering

process: Cluster centers are determined in n-dimensional space. A data point is associated to the cluster whose center is closest

determining the cluster centers:

random initialization of centers

repeatedly:

  • classify each data point according to the closest cluster center
  • recompute the cluster centers as the mean of all associated points

This process converges

Python Data Science Handbook - k-Means Clustering

k-means clustering: example

example for iris measurements (assuming species are unknown):

measurements = np.array([[5.1, 3.5, 1.4, 0.2],
                         [4.9, 3.0, 1.4, 0.2],
                         [4.7, 3.2, 1.3, 0.2], ...])

from sklearn.cluster import KMeans

model = KMeans(n_clusters=3)
model.fit(measurements)

k-means clustering: example

cluster centers:

model.cluster_centers_

k-means clustering: example

visualizing clusters:

species_pred = model.predict(measurements)

plt.scatter(
    measurements[:, 0],
    measurements[:, 1],
    c=species_pred
)

k-means clustering

examples: