Assigning numeric values to numeric input data
examples:
Assigning labels to numeric input data
examples:
Recognizing groups / clusters in input data
examples:
mapping points in n-dimensional space to points in m-dimensional space (m << n, mapping is mostly reversible)
Optimization of strategies within a simulation
examples:
Python libraries for machine learning:
scikit-learn:
keras:
pytorch:
steps:
in scikit-learn:
x
or X
) and a target vector / matrix (commonly y
or Y
)KNeighborsClassifier
, MLPClassifier
, LinearRegression
, ...model.fit(x, y)
model.score(x_val, y_val)
or metrics.accuracy_score(x_val, y_val)
, ...model.predict(...)
or model.predict_proba(...)
in keras:
x
) and a target array (commonly y
)model.compile()
and "learn" via model.fit(x, y)
model.evaluate(x_val, y_val)
model.predict(...)
Iris dataset: simple example dataset for machine learning / data science
contains measurements of 150 iris plants: 3 different species with 50 samples each
properties in the dataset:
example CSV data from http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
...
7.0,3.2,4.7,1.4,Iris-versicolor
...
6.3,3.3,6.0,2.5,Iris-virginica
...
Task: Train an algorithm to classify iris plants based on their measurements
# load data via pandas
iris = pd.read_csv(
"http://archive.ics.uci.edu/ml/" +
"machine-learning-databases/iris/iris.data",
header=None,
names=["sepal_length", "sepal_width", "petal_length",
"petal_width", "species"]
)
iris_shuffled = iris.sample(frac=1.0)
# convert to numerical numpy arrays
measurements = iris_shuffled[[
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
]].to_numpy()
species = (
iris_shuffled["species"]
.replace({
"Iris-setosa": 0,
"Iris-versicolor": 1,
"Iris-virginica": 2,
})
.to_numpy()
)
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(8),
keras.layers.Activation("relu"),
keras.layers.Dense(3),
keras.layers.Activation("softmax")
])
model.compile(
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
class NearestNeighborClassifier():
def fit(self, x, y):
# In a "real" machine learning algorithm,
# a lot of processing could happen here.
# In this case we're just storing the training data.
self.x = x
self.y = y
def predict_single(self, x):
vectors = self.x - x
distances = np.linalg.norm(vectors, axis=1)
min_index = np.argmin(distances)
return self.y[min_index]
model = NearestNeighborClassifier()
Training based on the first 130 entries:
in scikit-learn:
model.fit(measurements[:130], species[:130])
in keras:
model.fit(
measurements[:130],
species[:130],
epochs=300,
validation_split=0.1
)
Applying classification to new data (in scikit-learn):
demo_measurements = np.array([
[5.3, 3.4, 1.9, 0.6],
[6.0, 3.0, 4.7, 1.5],
[6.5, 3.1, 5.0, 1.7]
])
model.predict(demo_measurements)
# e.g. [0, 1, 1] in scikit-learn
model.predict_proba(demo_measurements)
# [[0.9 0.1 0. ]
# [0. 0.8 0.2]
# [0. 0.7 0.3]]
measurements_val = measurements[130:]
species_val = species[130:]
in scikit-learn (accuracy):
print(model.score(measurements_val, species_val))
in keras (categorical crossentropy, accuracy):
print(model.evaluate(measurements_val, species_val))
Regression:
Classification:
Linear regression: a linear function is fitted to given data points (usually via least squares)
Example: various purchases in different supermarkets:
task: estimate prices of:
This may be solved via regression
input data:
1, 1 → 5.00
2, 3 → 13.50
3, 2 → 10.90
0, 0 → 0.00
result of a linear regression:
price = 0.05 + 1.13*x + 3.73*y
Machine learning strategy that vaguely resembles how neurons in the brain interact
classification algorithm that assigns a class to a data point by looking at similar data points with a known classification
Example decision tree for iris classification:
At the boundary of two classes a logistic function is used to determine how likely it is that the data point belongs to the one or the other class
The logisitic function itself is determined via regression (hence the name)
Data points are assumed to be part of a specific probability distribution; these distributions are derived from the training data.
For a new data point, the algorithm determines under which of the distributions it would most likely occur.
two important distributions:
Simplest case: separation of classes via lines / planes / hyperplanes - these separators should have maximum distance from the separated points
Borders may take different shapes by using kernel functions - e.g. conic sections or other curves
task: use other classifiers, e.g.:
sklearn.tree.DecisionTreeClassifier
sklearn.svm.SVC
sklearn.naive_bayes.GaussianNB
sklearn.neural_network.MLPClassifier
desired data format for machine learning algorithms:
tasks:
Which of these stars is more similar to the sun?
# data: radius (km), mass (kg), temperature (K)
sun = [7.0e7, 2.0e30, 5.8e3]
star_a = [6.5e7, 2.2e30, 5.2e3]
star_b = [7.0e8, 2.1e30, 8.1e3]
some machine learning algorithms (like k-Nearest-Neighbors) use absolute values.
Here the algorithm would only take into account the mass as all other values are tiny in comparison
Solution: Before applying an algorithm, the values are centered and scaled (e.g. so their mean is 0 and the standard deviation is 1)
Missing data will often appear as NaN
s
possible handling:
input or output data may be categorical data - e.g. country, occupation, measuring method
example input data:
[["fr", "chrome"], ["uk", "chrome"], ["us", "firefox"]]
desired result: encoding as numeric values
input data:
[["fr", "chrome"], ["uk", "chrome"], ["us", "firefox"]]
encoding as ordinals (not appropriate for all algorithms, as there is an implicit order):
[[0., 0.], [1., 0.], [2., 1.]]
input data:
[["fr", "chrome"],
["uk", "chrome"],
["us", "firefox"]]
one-hot-encoding:
# fr?, uk?, us?, chrome?, firefox?
[[1., 0., 0., 1., 0.],
[0., 1., 0., 1., 0.],
[0., 0., 1., 0., 1.]]
example preprocessing for text classification: counting words
iris = pd.read_csv(
"http://archive.ics.uci.edu/ml/" +
"machine-learning-databases/iris/iris.data",
header=None)
iris_measures = iris.iloc[:, :4].to_numpy()
iris_species = iris.iloc[:, 4].to_numpy()
encoder = LabelBinarizer()
encoder.fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)
scaler = StandardScaler()
scaler.fit(iris_measures)
iris_measures_scaled = scaler.transform(iris_measures)
x = iris_measures_scaled
y = iris_species_one_hot
In order to verify the results of an algorithm:
Data are split into training data and test data / validation data
for iterative algorithms (e.g. neural networks in keras):
for other alogirthms (e.g. sklearn):
To find the best model:
see Python Data Science Handbook → Hyperparameters and Model Validation → Selecting the Best Model
classification metrics:
regression metrics:
example:
a basket of fruits contains 10 apples, 10 oranges and 10 peaches
a classification algorithm yields these results:
accuracy: relative amount of correct classifications (in our example: 27/30=0.9)
confusion matrix: table of classifications for each category
apples | oranges | peaches | |
---|---|---|---|
apples | 8 | 0 | 2 |
oranges | 0 | 10 | 0 |
peaches | 1 | 0 | 9 |
mean squared error
coefficient of determination (R²):
compares the mean squared error of the regression with the variance of the dataset
binary classification: spam detection
example:
60 regular messages, 40 spam messages
1 regular message is misclassified as spam
5 spam messages are misclassified as regular messages
precision = 35/36 = 0.97 (35 out of 36 messages that were classified as spam are actually spam)
recall = 35/40 = 0.88 (35 out of 40 spam messages were recognized as spam)
see also: precision and recall on Wikipedia
precision and recall have different relevance in different scenarios
example: when classifying emails as spam, precision is very important (avoiding classifying a regular email as spam)
f-score = harmonic mean of precision and recall
ROC (receiver operating characteristic)
= metric that represents true positives and false positives
a classification algorithm could be fine-tuned in respect to its true positives rate and false positives rate:
The ROC may be displayed as a curve; the bigger the area under the curve (AUC), the better the classification
cross entropy (log loss): measures how well a model of a probability distribution approximates the actual probability distribution
relevant for neural networks and logistic regression
possible problem for machine learning algorithms: The algorithm is too flexible and recognizes apparent patterns in the training data
Algorithms that are vulnerable to overfitting:
for polynomial regression see: Data Science Handbook - Regularization
manual train-test split:
rng = np.random.default_rng(seed=1)
random_indexes = rng.permutation(x.shape[0])
# e.g. [65, 44, 22, 133, 47, ...]
x_train = x[random_indexes[:120]]
y_train = y[random_indexes[:120]]
x_test = x[random_indexes[120:]]
y_test = y[random_indexes[120:]]
automated train-test split via scikit-learn:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)
validation based on test data:
from sklearn import metrics
y_pred = model.predict(x_test)
score = metrics.accuracy_score(y_pred, y_test)
print("accuracy:", score)
import pandas as pd
from sklearn.preprocessing import (
LabelBinarizer,
StandardScaler,
)
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
# loading data
iris = pd.read_csv(
"http://archive.ics.uci.edu/ml/" +
"machine-learning-databases/iris/iris.data",
header=None)
iris_measures = iris.iloc[:, :4].to_numpy()
iris_species = iris.iloc[:, 4].to_numpy()
# preparing data
encoder = LabelBinarizer()
encoder.fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)
scaler = StandardScaler()
scaler.fit(iris_measures)
iris_measures_scaled = scaler.transform(iris_measures)
X = iris_measures_scaled
Y = iris_species_one_hot
# train-test-split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
# training
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
# validation
Y_prediction = model.predict(X_test)
score = metrics.accuracy_score(Y_prediction, Y_test)
print("accuracy:", score)
# predicting further species
new_iris_data = [
[5.3, 3.4, 1.9, 0.6],
[6.0, 3.0, 4.7, 1.5],
[6.5, 3.1, 5.0, 1.7]
]
new_iris_predictions = model.predict(
scaler.transform(new_iris_data)
)
print("prediction data:")
print(new_iris_predictions)
predicted_labels = encoder.inverse_transform(
new_iris_predictions
)
print("predicted labels:")
print(predicted_labels)