Given some random results from a probability distribution, we want to determine the distribution
example: age distribution of the titanic data
titanic["Age"].plot.hist(density=True, bins=20)
visualization of the kernel density estimation (KDE):
titanic["Age"].plot.kde(xlim=(0, 80))
kernel density estimation = approximation of density by combining (summing) many base functions (e.g. normal distributions)
see: Python Data Science Handbook: Kernel Density Estimation
package scipy.stats
: allows for fitting various types of distributions:
approximating a distribution - via normal or skew normal:
from scipy import stats
# location=mu, scale=sigma
(location, scale) = stats.norm.fit(titanic["Age"].dropna())
dist_norm = stats.norm(location, scale)
(shape, location, scale) = stats.skewnorm.fit(titanic["Age"].dropna())
dist_skewnorm = stats.skewnorm(shape, location, scale)
visualize approximations:
x = np.linspace(0, 80, 100)
plt.plot(x, dist_norm.pdf(x))
plt.plot(x, dist_skewnorm.pdf(x))
titanic["Age"].plot.hist(density=True, bins=20)
Task: fit some other distributions (e.g. lognorm, gamma, ...)
evaluation (Kolmogorov-Smirnov-test):
stats.kstest(titanic["Age"].dropna(), dist_norm.cdf)
# KstestResult(statistic=0.054, pvalue=0.029)
Linear regression: a linear function is fitted to given data points (usually via least squares)
Example: various purchases in different supermarkets:
task: estimate prices of:
This may be solved via regression
input data:
1, 1 → 5.00
2, 3 → 13.50
3, 2 → 10.90
0, 0 → 0.00
result of a linear regression:
price = 0.05 + 1.13*x + 3.73*y
most common veriant:
OLS (ordinary least squares): the sum of squared errors (deviations) should be minimal
general:
values per feature / independent variable:
Python packages for linear regression:
goal: find correlation between patient data and diabetes disease progression after one year
from sklearn.datasets import load_diabetes
# load dataset as a pandas dataframe
dataset = load_diabetes(as_frame=True)
print(dataset.DESCR)
print(dataset.data)
print(dataset.target)
print(dataset.target.describe())
import statsmodels.api as sm
# add constant column for base value of y (intercept)
x = sm.add_constant(dataset.data)
y = dataset.target
model = sm.OLS(y, x)
res = model.fit()
res.summary()
only keep values whose influence is significant:
x_cleaned = x.drop(["age", "s3", "s6", "s4"], axis=1)
instead of a linear relation, we could assume a polynomial (or other) relation between independent and dependent variables
assuming a quadratic relation with age in the diabetes dataset
adding another column:
x["age_squared"] = x["age"] ** 2
regularizations to avoid big coefficients:
model.fit_regularized(alpha=0.01, L1_wt=0.1)
appropriate regression for binary classification (0 / 1): logistic regression
https://www.kaggle.com/jojoker/titanic-survival-chance-with-statsmodels
import data:
import pandas as pd
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
index_col="PassengerId",
)
prepare data:
# column with a numeric value
titanic["Female"] = titanic["Sex"].replace(
{"female": 1, "male": 0}
)
# remove rows with missing age
titanic = titanic.dropna(subset=["Age"])
import statsmodels.api as sm
x = titanic[["Pclass", "Age", "SibSp", "Parch", "Fare", "Female"]]
x = sm.add_constant(x)
y = titanic["Survived"]
model = sm.Logit(y, x)
results = model.fit()
results.summary()
dismissing properties with a high p-value:
x = x.drop(["Parch", "Fare"], axis=1)
Simplification of data with a large number of attributes/features to data with fewer, but more expressive/meaningful attributes
see also: Dimensionality Reduction on Machine Learning Mastery
We will apply a dimensionality reduction algorithm to the MovieLens user ratings, reducing the rating data to few attributes (roughly 25); thereby we can then recommend similar movies
finding clusters in input data
strategies:
process: Cluster centers are determined in n-dimensional space. A data point is associated to the cluster whose center is closest
determining the cluster centers:
random initialization of centers
repeatedly:
This process converges
example for iris measurements (assuming species are unknown):
measurements = np.array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2], ...])
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(measurements)
cluster centers:
model.cluster_centers_
visualizing clusters:
species_pred = model.predict(measurements)
plt.scatter(
measurements[:, 0],
measurements[:, 1],
c=species_pred
)
examples: