steps:
X
and a target vector y
/ a target matrix Y
KNeighborsClassifier
, MLPClassifier
, LinearRegression
, ...model.fit(X, y)
model.predict(...)
Example: classification of iris plants
known data: measurements and classification of 150 iris plants
Task: Train an algorithm to classify iris plants based on their measurements
example data (sepal length, sepal width, petal length, petal width, name):
[5.1, 3.5, 1.4, 0.2]
→ "Iris-setosa"
[7.0, 3.2, 4.7, 1.4]
→ "Iris-versicolor"
[6.3, 3.3, 6.0, 2.5]
→ "Iris-virginica"
in our data: setosa=0, versicolor=1, virginica=2
loading data:
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
Training an algorithm (k-nearest-neighbor):
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(X, y)
Applying classification to new data:
test_data = [
[5.3, 3.4, 1.9, 0.6],
[6.0, 3.0, 4.7, 1.5],
[6.5, 3.1, 5.0, 1.7]
]
y_pred = model.predict(test_data)
# [0, 1, 1]
y_pred_proba = model.predict_proba(test_data)
# [[1. 0. 0. ]
# [0. 0.8 0.2]
# [0. 0.6 0.4]]
task: use other classifiers, e.g.:
sklearn.neural_network.MLPClassifier
sklearn.svm.SVC
sklearn.tree.DecisionTreeClassifier
sklearn.naive_bayes.GaussianNB
input data: grayscale images of 1797 handwritten digits
target data: digit (e.g. 0, 1, 2, 3, ...)
from sklearn import datasets
digits = datasets.load_digits()
images are in digits.images
labels are in digits.target
task:
display some of the images and their correct labels via pyplot's imshow
simple solution:
plt.imshow(digits.images[3], cmap="gray")
plt.axis("off")
plt.title(digits.target[3])
task:
flatten input array from 1797x8x8 to 1797x64
explicit solution:
x = digits.images.reshape(1797, 64)
robust solution:
x = digits.images.reshape(digits.images.shape[0], -1)
Task: select the first 1500 entries as training data and train the model
Solution:
from sklearn.neighbors import KNeighborsClassifier
x_train = x[:1500]
y_train = y[:1500]
model = KNeighborsClassifier(1)
model.fit(x_train, y_train)
Task: select the remaining entries as testing data and compute the percentage of correct classifications
Solution:
x_test = x[1500:]
y_test = y[1500:]
y_pred = model.predict(x_test)
import numpy as np
num_correct = np.sum(y_pred == y_test)
print(num_correct / y_test.size)
desired data format for machine learning algorithms:
tasks:
Classes for preparing data have these methods:
.fit
: creates a data transformation based on existing input data (X1
).transform
: transforms input data (X2
) based on the transformation.fit_transform
: does both in one step (for the same data).inverse_transfrom
: reverses a transformation (not available for all transformations)Centering and scaling values so the mean is 0 and the standard deviation is 1:
from sklearn import preprocessing
import numpy as np
stars = np.array([[ 7.0e7, 2.0e30, 5.8e3],
[ 6.5e7, 2.2e30, 5.2e3],
[ 7.0e9, 2.1e30, 3.1e3]])
scaler = preprocessing.StandardScaler().fit(stars)
X = scaler.transform(stars)
scaled values:
array([[-0.70634165, -1.22474487, 0.95025527],
[-0.70787163, 1.22474487, 0.43193421],
[ 1.41421329, 0. , -1.38218948]])
interpolation:
import numpy as np
from sklearn.impute import SimpleImputer
X = np.array([[ np.nan, 0, 3 ],
[ 3, 7, 9 ],
[ 3, 5, 2 ],
[ 4, np.nan, 6 ],
[ 8, 8, 1 ]])
imputer = SimpleImputer(strategy="mean").fit(X)
imputer.transform(X)
imputer.transform(np.array([[np.nan, 1, 1]]))
preprocessors:
OrdinalEncoder
(ordinals for input categories)LabelEncoder
(ordinals for target categories)OneHotEncoder
(one-hot-encoding for input categories, sparse by default)LabelBinarizer
(one-hot-encoding for target categories)example:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer().fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)
example for preprocessing text data: counting words
from sklearn.feature_extraction.text import CountVectorizer
sample = ['problem of evil',
'evil queen',
'horizon problem']
vectorizer = CountVectorizer().fit(sample)
print(vectorizer.vocabulary_)
X = vectorizer.transform(sample)
print(X)
print(X.todense())
import pandas as pd
iris = pd.read_csv(
"http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
header=None)
first line: 5.1,3.5,1.4,0.2,Iris-setosa
tasks:
Pipelines can be composed from several transforming algorithms and one predicting algorithm:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
model = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler(),
LinearRegression()
)
task:
create a pipeline that categorizes iris data
A trained model can be saved for later use
In Python, objects can be saved and loaded via the pickle
module
import pickle
with open("model.pickle", mode="wb") as picklefile:
pickle.dump(model, picklefile)
import pickle
with open("model.pickle", mode="rb") as picklefile:
model = pickle.load(picklefile)
model.predict(data)
regression:
sklearn.linear_model.LinearRegression
sklearn.neural_network.MLPRegressor
classification:
sklearn.neighbors.KNeighborsClassifier
sklearn.tree.DecisionTreeClassifier
sklearn.ensemble.RandomForestClassifier
sklearn.linear_model.LogisticRegression
sklearn.naive_bayes.GaussianNB
sklearn.naive_bayes.MultinomialNB
sklearn.svm.SVC
sklearn.neural_network.MLPClassifier
sklearn.neighbors.KNeighborsClassifier
The number k
of neighbors can be chosen (default: 5)
See: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
see: Python Data Science Handbook - Decision Trees and Random Forests
random forests: data are split into different subsets; for each subset a separate decision tree is created; all decision trees are combined into a so-called random forest
RandomForestClassifier(n_estimators=100)
See: https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html
LogisticRegression(solver="liblinear", multi_class="auto")
see:
Example: various purchases in different supermarkets:
from sklearn.linear_model import LinearRegression
X = [[1, 1], [2, 3], [3, 2], [0, 0]]
y = [5.00, 13.50, 10.90, 0.0]
model = LinearRegression()
model.fit(X, y)
yfit = model.predict([[1, 0], [0, 1], [2, 2]])
print(yfit)
# [1.18333333 3.78333333 9.78333333]
characteristic numbers of the regression:
model.coef_
model.intercept_
Iris data: Estimate the petal width (column 3) based on the petal length (column 2)
from sklearn import datasets
iris = datasets.load_iris()
Some data won't fit a linear relation like:
y = a*x + b
We could try a polynomial relation, e.g.:
y = a*x^2 + b*x + c
y = a*x^3 + b*x^2 + c*x + d
scikit-learn offers a preprocessor called PolynomialFeatures
from sklearn.preprocessing import PolynomialFeatures
poly_model = make_pipeline(
PolynomialFeatures(2),
LinearRegression()
)
poly_model.fit(x, y)
Iris data: Estimate the sepal length (column 0) based on the sepal width (column 1) and petal length (column 2)
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
iris = datasets.load_iris()
X = iris.data[:,1:3]
y = iris.data[:, 0]
model = MLPRegressor(
hidden_layer_sizes=(8, 8),
alpha=1.0,
max_iter=2000
)
model.fit(X, y)
test_data = [
[3.4, 1.9],
[3.0, 4.7],
[3.1, 5.0]
]
y_pred = model.predict(test_data)
print(y_pred)
classification:
regression:
See also https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
How well does a model categorize iris data?
from sklearn import metrics
y_prediction = model.predict(x_test)
print(metrics.accuracy_score(y_test, y_prediction))
print(metrics.confusion_matrix(y_test, y_prediction))
print(list(metrics.precision_recall_fscore_support(
y_test, y_prediction)))
helper function in scikit-learn:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)
optional parameters:
test_size
(default value: 0.25
)random_state
(integer seed for shuffling)Data are repeatedly split into different training and test sets so each entry appears in a test set once
from sklearn.model_selection import cross_validate
test_results = cross_validate(
model, X, y, cv=5, scoring="accuracy"
)
print(test_results["test_score"])
computing the ROC with scikit-learn:
# false positive rates, true positive rates, thresholds
fpr, tpr, thresholds = metrics.roc_curve(
y_test,
classifier.predict_proba(X_test)[:, 0]
)
ideal combination: false positive rate = 0, true positive rate = 1
plotting the ROC:
plt.plot(fpr, tpr, marker="o")
determining the AUC:
auc = metrics.auc(fpr, tpr)
pipelines can abstract the processing of input values x
custom classes can abstract the processing of both x and y
direct model usage to predict survival on the Titanic:
model.predict([[2, 0, 28.0, 0]])
# [0]
abstracted interface:
classifier.predict_survival(
pclass=2, sex="male", age=28.0, sibsp=0
)
# False
input data: greyscale images of famous people (sized 62 x 47) and their names
goal: train a neural network to recognize a person
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)
entries:
faces.images
: array of images (size: 1248 x 62 x 47)faces.target
: array of numeric labels (1, 3, 3, 3, 5, ...)faces.target_names
: array of label names (0="Ariel Sharon", 1="Colin Powell", ...)num_images = faces.images.shape[0]
num_pixels = faces.images.shape[1] * faces.images.shape[2]
X = faces.images.reshape(num_images, num_pixels)
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer().fit(faces.target)
Y = encoder.transform(faces.target)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(250, 150, 100),
early_stopping=True,
n_iter_no_change=100,
max_iter=2000,
verbose=True)
model.fit(X_train, Y_train)
algorithm configuration:
from sklearn import metrics
real_labels = Y_test.argmax(axis=1)
pred_labels = model.predict_proba(X_test).argmax(axis=1)
print(metrics.accuracy_score(real_labels, pred_labels))
argmax
returns the index of the biggest entry in the array
Display a random face and print the real name and the predicted name:
import matplotlib.pyplot as plt
from random import randrange
# randomly select a face
index = randrange(X_test.shape[0])
plt.imshow(X_test[index].reshape(62, 47), cmap="gray")
real_label = real_labels[index]
pred_label = pred_labels[index]
print("real name:", faces.target_names[real_label])
print("predicted name:", faces.target_names[pred_label])