Supervised learning in scikit-learn

Supervised learning in scikit-learn

steps:

  • create an input matrix X and a target vector y / a target matrix Y
  • instantiate an algorithm class - e.g. KNeighborsClassifier, MLPClassifier, LinearRegression, ...
  • "learn" via model.fit(X, y)
  • predict more results via model.predict(...)

Example: iris classification in scikit-learn

Example

Example: classification of iris plants

known data: measurements and classification of 150 iris plants

Task: Train an algorithm to classify iris plants based on their measurements

Example

example data (sepal length, sepal width, petal length, petal width, name):

  • [5.1, 3.5, 1.4, 0.2] → "Iris-setosa"
  • [7.0, 3.2, 4.7, 1.4] → "Iris-versicolor"
  • [6.3, 3.3, 6.0, 2.5] → "Iris-virginica"

in our data: setosa=0, versicolor=1, virginica=2

Example

loading data:

from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target

Example

Training an algorithm (k-nearest-neighbor):

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(X, y)

Example

Applying classification to new data:

test_data = [
    [5.3, 3.4, 1.9, 0.6],
    [6.0, 3.0, 4.7, 1.5],
    [6.5, 3.1, 5.0, 1.7]
]

y_pred = model.predict(test_data)
# [0, 1, 1]

y_pred_proba = model.predict_proba(test_data)
# [[1.  0.  0. ]
#  [0.  0.8 0.2]
#  [0.  0.6 0.4]]

Example

task: use other classifiers, e.g.:

  • sklearn.neural_network.MLPClassifier
  • sklearn.svm.SVC
  • sklearn.tree.DecisionTreeClassifier
  • sklearn.naive_bayes.GaussianNB

Example: digit recognition

Example: digit recognition

input data: grayscale images of 1797 handwritten digits

target data: digit (e.g. 0, 1, 2, 3, ...)

Loading digit data

from sklearn import datasets

digits = datasets.load_digits()

images are in digits.images

labels are in digits.target

Visualizing digits

task:

display some of the images and their correct labels via pyplot's imshow

Visualizing digits

simple solution:

plt.imshow(digits.images[3], cmap="gray")
plt.axis("off")
plt.title(digits.target[3])

Preparing data

task:

flatten input array from 1797x8x8 to 1797x64

Preparing data

explicit solution:

x = digits.images.reshape(1797, 64)

robust solution:

x = digits.images.reshape(digits.images.shape[0], -1)

Training

Task: select the first 1500 entries as training data and train the model

Training

Solution:

from sklearn.neighbors import KNeighborsClassifier

x_train = x[:1500]
y_train = y[:1500]

model = KNeighborsClassifier(1)
model.fit(x_train, y_train)

Testing

Task: select the remaining entries as testing data and compute the percentage of correct classifications

Testing

Solution:

x_test = x[1500:]
y_test = y[1500:]

y_pred = model.predict(x_test)

import numpy as np

num_correct = np.sum(y_pred == y_test)

print(num_correct / y_test.size)

Preparing data

Preparing data

desired data format for machine learning algorithms:

  • x or X: two-dimensional array with numeric input data
  • y or Y: one- or two-dimensional array with numeric results

Preparing data

tasks:

  • flattening nested arrays
  • scaling values
  • handling missing data
  • encoding categorical data as numerical data
  • encoding text data as numerical data

Preparing data in sklearn

Classes for preparing data have these methods:

  • .fit: creates a data transformation based on existing input data (X1)
  • .transform: transforms input data (X2) based on the transformation
  • .fit_transform: does both in one step (for the same data)
  • .inverse_transfrom: reverses a transformation (not available for all transformations)

Scaling values

Centering and scaling values so the mean is 0 and the standard deviation is 1:

from sklearn import preprocessing
import numpy as np

stars = np.array([[ 7.0e7, 2.0e30, 5.8e3],
                  [ 6.5e7, 2.2e30, 5.2e3],
                  [ 7.0e9, 2.1e30, 3.1e3]])

scaler = preprocessing.StandardScaler().fit(stars)
X = scaler.transform(stars)

Scaling values

scaled values:

array([[-0.70634165, -1.22474487,  0.95025527],
       [-0.70787163,  1.22474487,  0.43193421],
       [ 1.41421329,  0.        , -1.38218948]])

Handling missing data

interpolation:

import numpy as np
from sklearn.impute import SimpleImputer

X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])

imputer = SimpleImputer(strategy="mean").fit(X)

imputer.transform(X)
imputer.transform(np.array([[np.nan, 1, 1]]))

Categories as data

preprocessors:

  • OrdinalEncoder (ordinals for input categories)
  • LabelEncoder (ordinals for target categories)
  • OneHotEncoder (one-hot-encoding for input categories, sparse by default)
  • LabelBinarizer (one-hot-encoding for target categories)

Categories as data

example:

from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer().fit(iris_species)
iris_species_one_hot = encoder.transform(iris_species)

Text data

example for preprocessing text data: counting words

from sklearn.feature_extraction.text import CountVectorizer

sample = ['problem of evil',
          'evil queen',
          'horizon problem']

vectorizer = CountVectorizer().fit(sample)
print(vectorizer.vocabulary_)
X = vectorizer.transform(sample)
print(X)
print(X.todense())

Task: preparing iris data

import pandas as pd
iris = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    header=None)

first line: 5.1,3.5,1.4,0.2,Iris-setosa

tasks:

  • represent categories via one-hot-encoding
  • scale input data
  • compare k-nearest-neighbor classification on scaled and unscaled data

Pipelines

Pipelines

Pipelines can be composed from several transforming algorithms and one predicting algorithm:

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

model = make_pipeline(
    SimpleImputer(strategy='mean'),
    StandardScaler(),
    LinearRegression()
)

Pipelines

task:

create a pipeline that categorizes iris data

Saving and loading models

Saving and loading models

A trained model can be saved for later use

In Python, objects can be saved and loaded via the pickle module

Saving models

import pickle

with open("model.pickle", mode="wb") as picklefile:
    pickle.dump(model, picklefile)

Loading models

import pickle

with open("model.pickle", mode="rb") as picklefile:
    model = pickle.load(picklefile)

model.predict(data)

Supervised learning algorithms in scikit-learn

Algorithms in scikit-learn

regression:

  • sklearn.linear_model.LinearRegression
  • sklearn.neural_network.MLPRegressor

classification:

  • sklearn.neighbors.KNeighborsClassifier
  • sklearn.tree.DecisionTreeClassifier
  • sklearn.ensemble.RandomForestClassifier
  • sklearn.linear_model.LogisticRegression
  • sklearn.naive_bayes.GaussianNB
  • sklearn.naive_bayes.MultinomialNB
  • sklearn.svm.SVC
  • sklearn.neural_network.MLPClassifier

K-nearest-neighbors

sklearn.neighbors.KNeighborsClassifier

The number k of neighbors can be chosen (default: 5)

See: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

Decision trees

see: Python Data Science Handbook - Decision Trees and Random Forests

random forests: data are split into different subsets; for each subset a separate decision tree is created; all decision trees are combined into a so-called random forest

RandomForestClassifier(n_estimators=100)

Logistic regression

See: https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic.html

LogisticRegression(solver="liblinear", multi_class="auto")

Naive Bayes

see: Python Data Science Handbook - Naive Bayes

Support vector machines

see:

Examples

Linear regression with scikit-learn

Linear regression with scikit-learn

Example: various purchases in different supermarkets:

  • 1 l of milk, 1 kg of bread: 5.00€
  • 2 l of milk, 3 kg of bread: 13.50€
  • 3 l of milk, 2 kg of bread: 10.90€
  • (0 l of milk, 0 kg of bread: 0€)

Linear regression with scikit-learn

from sklearn.linear_model import LinearRegression

X = [[1, 1], [2, 3], [3, 2], [0, 0]]
y = [5.00, 13.50, 10.90, 0.0]

model = LinearRegression()
model.fit(X, y)

yfit = model.predict([[1, 0], [0, 1], [2, 2]])
print(yfit)
# [1.18333333 3.78333333 9.78333333]

Linear regression with scikit-learn

characteristic numbers of the regression:

  • model.coef_
  • model.intercept_

Linear regression with scikit-learn

Iris data: Estimate the petal width (column 3) based on the petal length (column 2)

from sklearn import datasets
iris = datasets.load_iris()

Examples

Polynomial regression with scikit-learn

Polynomial regression with scikit-learn

Some data won't fit a linear relation like:

y = a*x + b

We could try a polynomial relation, e.g.:

y = a*x^2 + b*x + c

y = a*x^3 + b*x^2 + c*x + d

Polynomial regression with scikit-learn

scikit-learn offers a preprocessor called PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures

poly_model = make_pipeline(
    PolynomialFeatures(2),
    LinearRegression()
)

poly_model.fit(x, y)

Exercises

  • use a polynomial regression instead of a linear regression for one of the previous examples
  • use a polynomial regression for dataset II of the so-called anscombe data (can be loaded via the seaborn library)

Regression via a neural network

Regression via a neural network

Iris data: Estimate the sepal length (column 0) based on the sepal width (column 1) and petal length (column 2)

from sklearn import datasets
from sklearn.neural_network import MLPRegressor

iris = datasets.load_iris()

X = iris.data[:,1:3]
y = iris.data[:, 0]

model = MLPRegressor(
    hidden_layer_sizes=(8, 8),
    alpha=1.0,
    max_iter=2000
)
model.fit(X, y)

Regression via a neural network

test_data = [
    [3.4, 1.9],
    [3.0, 4.7],
    [3.1, 5.0]
]

y_pred = model.predict(test_data)
print(y_pred)

Validation

Validation metrics in scikit-learn

classification:

  • accuracy_score
  • confusion_matrix
  • precision_recall_fscore_support
  • log_loss
  • roc_curve
  • roc_auc

regression:

  • mean_squared_error
  • r2_score

See also https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Train-test split

How well does a model categorize iris data?

from sklearn import metrics

y_prediction = model.predict(x_test)

print(metrics.accuracy_score(y_test, y_prediction))
print(metrics.confusion_matrix(y_test, y_prediction))
print(list(metrics.precision_recall_fscore_support(
           y_test, y_prediction)))

Train-test split

helper function in scikit-learn:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y)

optional parameters:

  • test_size (default value: 0.25)
  • random_state (integer seed for shuffling)

Cross validation

Data are repeatedly split into different training and test sets so each entry appears in a test set once

from sklearn.model_selection import cross_validate

test_results = cross_validate(
    model, X, y, cv=5, scoring="accuracy"
)
print(test_results["test_score"])

Example: ROC of the iris classification

computing the ROC with scikit-learn:

# false positive rates, true positive rates, thresholds
fpr, tpr, thresholds = metrics.roc_curve(
    y_test,
    classifier.predict_proba(X_test)[:, 0]
)

ideal combination: false positive rate = 0, true positive rate = 1

Example: ROC of the iris classification

plotting the ROC:

plt.plot(fpr, tpr, marker="o")

determining the AUC:

auc = metrics.auc(fpr, tpr)

Abstraction

Abstraction

  • pipelines
  • custom classes

Abstraction

pipelines can abstract the processing of input values x

custom classes can abstract the processing of both x and y

Abstraction

direct model usage to predict survival on the Titanic:

model.predict([[2, 0, 28.0, 0]])
# [0]

abstracted interface:

classifier.predict_survival(
    pclass=2, sex="male", age=28.0, sibsp=0
)
# False

Example: labeled faces with scikit-learn

Example: labeled faces

data source example

input data: greyscale images of famous people (sized 62 x 47) and their names

goal: train a neural network to recognize a person

Getting data

from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

entries:

  • faces.images: array of images (size: 1248 x 62 x 47)
  • faces.target: array of numeric labels (1, 3, 3, 3, 5, ...)
  • faces.target_names: array of label names (0="Ariel Sharon", 1="Colin Powell", ...)

Preparing data

num_images = faces.images.shape[0]
num_pixels = faces.images.shape[1] * faces.images.shape[2]
X = faces.images.reshape(num_images, num_pixels)

from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer().fit(faces.target)
Y = encoder.transform(faces.target)

Train-test split

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Create a classifier and train it

from sklearn.neural_network import MLPClassifier

model = MLPClassifier(hidden_layer_sizes=(250, 150, 100),
                      early_stopping=True,
                      n_iter_no_change=100,
                      max_iter=2000,
                      verbose=True)
model.fit(X_train, Y_train)

algorithm configuration:

  • three layers of neurons with 250, 150 and 100 neurons each
  • algorithm will stop if the last 100 iterations did not yield improvements
  • algorithm will stop after a maximum of 2000 iterations

Test the classifier

from sklearn import metrics

real_labels = Y_test.argmax(axis=1)
pred_labels = model.predict_proba(X_test).argmax(axis=1)

print(metrics.accuracy_score(real_labels, pred_labels))

argmax returns the index of the biggest entry in the array

Test the classifier

Display a random face and print the real name and the predicted name:

import matplotlib.pyplot as plt
from random import randrange

# randomly select a face
index = randrange(X_test.shape[0])

plt.imshow(X_test[index].reshape(62, 47), cmap="gray")

real_label = real_labels[index]
pred_label = pred_labels[index]

print("real name:", faces.target_names[real_label])
print("predicted name:", faces.target_names[pred_label])

Exercises and datasets

Datasets

Exercises

  • diabetes progression (regression)
  • digits (classification)
  • Titanic survival (regression)
  • labeled faces (classification via neural networks)