Machine learning: theory

Topics

overview of methods
libraries
overview of machine learning
algorithms for supervised learning
preparing training data
training a model
validating a model

Overview of methods

supervised learning
- regression
- classification
unsupervised learning
- clustering
- dimensionality reduction / compression
reinforcement learning
- optimization

Regression

Assigning numeric values to numeric input data

examples:

estimation of distance of galaxy based on its redshift
prediction of stock price based on economic data

Classification

Assigning labels to numeric input data

examples:

classification as spam based occurences of words / phrases
recognition of objects / persons / symbols in images
diagnosis of illnesses based on symptoms / medical test data

Clustering

Recognizing groups / clusters in input data

examples:

recognizing reoccuring elements in computer vision

Dimensionality reduction

mapping points in n-dimensional space to points in m-dimensional space (m << n, mapping is mostly reversible)

Reinforcement learning

Optimization of strategies within a simulation

examples:

simulating the progression of an illness, find the best treatment strategy

Example datasets and tasks

Example datasets

Possible tasks

image data: e.g. facial recognition, handwriting recognition
text data: e.g. sentiment analysis
speech data: e.g. speech recognition

Commonly-used datasets

Libraries

Python libraries for machine learning:

scikit-learn
keras
pytorch

Libraries

scikit-learn:

supports many different classes of algorithms (including very basic neural networks)
based on NumPy

keras:

implements neural networks
based on the TensorFlow library
can run on the GPU or TPU (tensor processing unit)

pytorch:

implements neural networks
low-level

Supervised learning procedure

steps:

collect and prepare training data (input data and associated output data)
train a model based on the input and output (can take time)
validate the model's accuracy
use the model to predict outputs for new data

Supervised learning procedure

in scikit-learn:

create an input matrix (commonly x or X) and a target vector / matrix (commonly y or Y)
instantiate an algorithm class - e.g. KNeighborsClassifier, MLPClassifier, LinearRegression, ...
"learn" via model.fit(x, y)
validate via model.score(x_val, y_val) or metrics.accuracy_score(x_val, y_val), ...
predict more results via model.predict(...) or model.predict_proba(...)

Supervised learning procedure

in keras:

create an input array (commonly x) and a target array (commonly y)
build a model from various layers (e.g. preprocessing layers, neural layers, ...)
compile via model.compile() and "learn" via model.fit(x, y)
validate via model.evaluate(x_val, y_val)
predict more results via model.predict(...)

Iris dataset

Iris dataset: simple example dataset for machine learning / data science

contains measurements of 150 iris plants: 3 different species with 50 samples each

Iris dataset

properties in the dataset:

sepal length
sepal width
petal length
petal width
species name

Iris dataset

example CSV data from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data:

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
...
7.0,3.2,4.7,1.4,Iris-versicolor
...
6.3,3.3,6.0,2.5,Iris-virginica
...

Example: Iris classification

Task: Train an algorithm to classify iris plants based on their measurements

Getting data

# load data via pandas
iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/" +
    "machine-learning-databases/iris/iris.data",
    header=None,
    names=["sepal_length", "sepal_width", "petal_length",
           "petal_width", "species"]
)

Preparing data

iris_shuffled = iris.sample(frac=1.0)

# convert to numerical numpy arrays
measurements = iris_shuffled[[
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
]].to_numpy()
species = (
    iris_shuffled["species"]
    .replace({
        "Iris-setosa": 0,
        "Iris-versicolor": 1,
        "Iris-virginica": 2,
    })
    .to_numpy()
)

Creating a model in scikit-learn

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()

Creating a model in keras

from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(8),
    keras.layers.Activation("relu"),
    keras.layers.Dense(3),
    keras.layers.Activation("softmax")
])
model.compile(
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

Creating a model via NumPy

class NearestNeighborClassifier():
    def fit(self, x, y):
        # In a "real" machine learning algorithm,
        #   a lot of processing could happen here.
        # In this case we're just storing the training data.
        self.x = x
        self.y = y

    def predict_single(self, x):
        vectors = self.x - x
        distances = np.linalg.norm(vectors, axis=1)
        min_index = np.argmin(distances)
        return self.y[min_index]

model = NearestNeighborClassifier()

Training the model

Training based on the first 130 entries:

in scikit-learn:

model.fit(measurements[:130], species[:130])

in keras:

model.fit(
    measurements[:130],
    species[:130],
    epochs=300,
    validation_split=0.1
)

Using the model

Applying classification to new data (in scikit-learn):

demo_measurements = np.array([
    [5.3, 3.4, 1.9, 0.6],
    [6.0, 3.0, 4.7, 1.5],
    [6.5, 3.1, 5.0, 1.7]
])

model.predict(demo_measurements)
# e.g. [0, 1, 1] in scikit-learn

model.predict_proba(demo_measurements)
# [[0.9 0.1 0. ]
#  [0.  0.8 0.2]
#  [0.  0.7 0.3]]

Evaluating the model

measurements_val = measurements[130:]
species_val = species[130:]

in scikit-learn (accuracy):

print(model.score(measurements_val, species_val))

in keras (categorical crossentropy, accuracy):

print(model.evaluate(measurements_val, species_val))

Algorithms for supervised learning

Regression:

neural networks
linear regression, polynomial regression, ...

Classification:

neural networks
k-nearest-neighbors
decision trees
logistic regression
naive Bayes
support vector machines

Linear regression

Linear regression: a linear function is fitted to given data points (usually via least squares)

Linear regression

Example: various purchases in different supermarkets:

1 l of milk, 1 kg of bread: 5.00€
2 l of milk, 3 kg of bread: 13.50€
3 l of milk, 2 kg of bread: 10.90€
(0 l of milk, 0 kg of bread: 0€)

task: estimate prices of:

1 l of milk
1 kg of bread
2 l of milk and 2 kg of bread

This may be solved via regression

Linear regression

input data:

1, 1 → 5.00
2, 3 → 13.50
3, 2 → 10.90
0, 0 → 0.00

result of a linear regression:

price = 0.05 + 1.13*x + 3.73*y

Neural networks

Machine learning strategy that vaguely resembles how neurons in the brain interact

Neural networks

diagram of a neural network with two inputs, five intermediate neurons and one output (source: Dake, Mysid via Wikimedia Commons / CC BY)

Neurons

model of a single neuron with multiple inputs and one output

Activation functions

ReLU (Rectified Linear Unit)
Softmax - often used in the last layer for classification
Sigmoid - often used in the last layer for "tagging" (tags may overlap)

Demo

http://playground.tensorflow.org

Resource

https://victorzhou.com/blog/intro-to-neural-networks/

Classification algorithms

neural networks
k-nearest-neighbors
decision trees
logistic regression
naive Bayes
support vector machines

k-nearest-neighbors

classification algorithm that assigns a class to a data point by looking at similar data points with a known classification

Decision trees

Example decision tree for iris classification:

is the petal length less than 2.5?
- yes: setosa
- no: is the petal width less than 1.8?
  - yes: is the petal length smaller than 5.1?
    - yes: versicolor
    - no: virginica
  - no: virginica

Logistic regression

At the boundary of two classes a logistic function is used to determine how likely it is that the data point belongs to the one or the other class

The logisitic function itself is determined via regression (hence the name)

Naive Bayes

Data points are assumed to be part of a specific probability distribution; these distributions are derived from the training data.

For a new data point, the algorithm determines under which of the distributions it would most likely occur.

two important distributions:

normal distribution (continuous values)
multinomial distribution (discrete values)

Support vector machines

Simplest case: separation of classes via lines / planes / hyperplanes - these separators should have maximum distance from the separated points

Borders may take different shapes by using kernel functions - e.g. conic sections or other curves

Classification algorithms

overview of classification algorithms in scikit-learn

Example: iris classification with various algorithms

task: use other classifiers, e.g.:

sklearn.tree.DecisionTreeClassifier
sklearn.svm.SVC
sklearn.naive_bayes.GaussianNB
sklearn.neural_network.MLPClassifier

Preparing data

desired data format for machine learning algorithms:

x or X: array with numeric input data (2-dimensional for scikit-learn, can be higher-dimensional for keras)
y or Y: one- or two-dimensional array with numeric results

Preparing data

tasks:

flattening nested arrays
scaling values
handling missing data
encoding categorical data as numerical data
encoding text data as numerical data

Scaling values

Which of these stars is more similar to the sun?

# data: radius (km), mass (kg), temperature (K)
sun =    [7.0e7, 2.0e30, 5.8e3]

star_a = [6.5e7, 2.2e30, 5.2e3]
star_b = [7.0e8, 2.1e30, 8.1e3]

some machine learning algorithms (like k-Nearest-Neighbors) use absolute values.

Here the algorithm would only take into account the mass as all other values are tiny in comparison

Scaling values

Solution: Before applying an algorithm, the values are centered and scaled (e.g. so their mean is 0 and the standard deviation is 1)

Missing data

Missing data will often appear as NaNs

possible handling:

deleting any rows that contain undefined values somewhere
interpolating missing values by other values

Categories as data

input or output data may be categorical data - e.g. country, occupation, measuring method

example input data:

[["fr", "chrome"], ["uk", "chrome"], ["us", "firefox"]]

desired result: encoding as numeric values

Categories as data

input data:

[["fr", "chrome"], ["uk", "chrome"], ["us", "firefox"]]

encoding as ordinals (not appropriate for all algorithms, as there is an implicit order):

[[0., 0.], [1., 0.], [2., 1.]]

Categories as data

input data:

[["fr", "chrome"],
 ["uk", "chrome"],
 ["us", "firefox"]]

one-hot-encoding:

# fr?, uk?, us?, chrome?, firefox?
[[1., 0., 0., 1., 0.],
 [0., 1., 0., 1., 0.],
 [0., 0., 1., 0., 1.]]

Text data

example preprocessing for text classification: counting words

Example: loading and preparing iris data

Example: loading data

iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/" +
    "machine-learning-databases/iris/iris.data",
    header=None)
iris_measures = iris.iloc[:, :4].to_numpy()
iris_species = iris.iloc[:, 4].to_numpy()

Example: preparing data

encoder = LabelBinarizer()
encoder.fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)

scaler = StandardScaler()
scaler.fit(iris_measures)
iris_measures_scaled = scaler.transform(iris_measures)

x = iris_measures_scaled
y = iris_species_one_hot

Model validation and selection

Train data and validation data

In order to verify the results of an algorithm:

Data are split into training data and test data / validation data

Train data and validation data

for iterative algorithms (e.g. neural networks in keras):

train data
test data (used during iterative training)
validation data (used for validating the final model)

for other alogirthms (e.g. sklearn):

train data
validation data or test data (used for validating the model)

Model validation and selection

To find the best model:

test multiple algorithms
test multiple parameters for the algorithm
test if more input data leads to better results

see Python Data Science Handbook → Hyperparameters and Model Validation → Selecting the Best Model

Basic validation metrics

classification metrics:

accuracy
confusion matrix

regression metrics:

mean squared error
coefficient of determination (R²)

Basic classification metrics

example:

a basket of fruits contains 10 apples, 10 oranges and 10 peaches

a classification algorithm yields these results:

classification of apples: 8 as apples, 0 as oranges, 2 as peaches
classification of oranges: 10 as oranges
classification of peaches: 1 as an apple, 0 as oranges, 9 as peaches

Basic classification metrics

accuracy: relative amount of correct classifications (in our example: 27/30=0.9)

confusion matrix: table of classifications for each category

	apples	oranges	peaches
apples	8	0	2
oranges	0	10	0
peaches	1	0	9

Basic regression metrics

mean squared error

coefficient of determination (R²):

compares the mean squared error of the regression with the variance of the dataset

R²=1 - perfect interpolation
R²=0 - interpolation is no better than taking the average of all data
R²<0 - worse than taking the average of all data

Classification metrics

accuracy metrics
- accuracy
- confusion matrix
metrics based on true/false positives/negatives
- precision
- recall
- f-score
- ROC and AUC
probabilistic metrics
- cross entropy

Metrics based on true/false positives/negatives

binary classification: spam detection

true positive: a spam message is classified as spam
true negative: a regular message is classified as no spam
false positive: a regular message is classified as spam (type I error)
false negative: a spam message is classified as no spam (type II error)

Metrics based on true/false positives/negatives

example:

60 regular messages, 40 spam messages

1 regular message is misclassified as spam

5 spam messages are misclassified as regular messages

precision = 35/36 = 0.97 (35 out of 36 messages that were classified as spam are actually spam)

recall = 35/40 = 0.88 (35 out of 40 spam messages were recognized as spam)

Metrics based on true/false positives/negatives

precision and recall have different relevance in different scenarios

example: when classifying emails as spam, precision is very important (avoiding classifying a regular email as spam)

Metrics based on true/false positives/negatives

f-score = harmonic mean of precision and recall

Metrics based on true/false positives/negatives

ROC (receiver operating characteristic)

= metric that represents true positives and false positives

a classification algorithm could be fine-tuned in respect to its true positives rate and false positives rate:

option 1: 60% true positives rate, 0% false positives rate
option 2: 70% true positives rate, 5% false positives rate
option 3: 80% true positives rate, 25% false positives rate
option 4: 90% true positives rate, 55% false positives rate
option 5: 95% true positives rate, 90% false positives rate

Metrics based on true/false positives/negatives

The ROC may be displayed as a curve; the bigger the area under the curve (AUC), the better the classification

Probabilistic metrics

cross entropy (log loss): measures how well a model of a probability distribution approximates the actual probability distribution

relevant for neural networks and logistic regression

Overfitting

possible problem for machine learning algorithms: The algorithm is too flexible and recognizes apparent patterns in the training data

Algorithms that are vulnerable to overfitting:

neural networks
polynomial regression
decision trees

Overfitting

Overfitting - solutions

expand the set of learning data
reduce model flexibility (e.g. degree of the polynomial, size of a neural network / decision tree)
randomly disable some neuron outputs during learning (dropout)
combining multiple decision trees (random forest)
"penalize" large coefficients in polynomial regression (L2- and L1-regulatization)

for polynomial regression see: Data Science Handbook - Regularization

Example: iris validation in scikit-learn

manual train-test split:

rng = np.random.default_rng(seed=1)

random_indexes = rng.permutation(x.shape[0])
# e.g. [65, 44, 22, 133, 47, ...]

x_train = x[random_indexes[:120]]
y_train = y[random_indexes[:120]]

x_test = x[random_indexes[120:]]
y_test = y[random_indexes[120:]]

Example: iris validation in scikit-learn

automated train-test split via scikit-learn:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

Example: iris validation in scikit-learn

validation based on test data:

from sklearn import metrics

y_pred = model.predict(x_test)
score = metrics.accuracy_score(y_pred, y_test)
print("accuracy:", score)

Iris classification in scikit-learn - complete

Iris classification - complete

import pandas as pd
from sklearn.preprocessing import (
    LabelBinarizer,
    StandardScaler,
)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# loading data

iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/" +
    "machine-learning-databases/iris/iris.data",
    header=None)
iris_measures = iris.iloc[:, :4].to_numpy()
iris_species = iris.iloc[:, 4].to_numpy()

# preparing data

encoder = LabelBinarizer()
encoder.fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)

scaler = StandardScaler()
scaler.fit(iris_measures)
iris_measures_scaled = scaler.transform(iris_measures)

X = iris_measures_scaled
Y = iris_species_one_hot

# train-test-split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

# training

model = KNeighborsClassifier()
model.fit(X_train, Y_train)

# validation

Y_prediction = model.predict(X_test)
score = metrics.accuracy_score(Y_prediction, Y_test)
print("accuracy:", score)

# predicting further species

new_iris_data = [
    [5.3, 3.4, 1.9, 0.6],
    [6.0, 3.0, 4.7, 1.5],
    [6.5, 3.1, 5.0, 1.7]
]
new_iris_predictions = model.predict(
    scaler.transform(new_iris_data)
)
print("prediction data:")
print(new_iris_predictions)
predicted_labels = encoder.inverse_transform(
    new_iris_predictions
)
print("predicted labels:")
print(predicted_labels)