Compiled packages (like NumPy, TensorFlow or PyTables) may take several months before they are available for the newest Python version.
recommendation: use an older Python version (e.g. 3.8 instead of 3.9) or a pre-built distribution (like Anaconda)
installing the most important packages in an existing Python environment:
pip install jupyter numpy pandas matplotlib sklearn tensorflow
IPython = advanced interactive Python console, supports features like autocompletion
free online Jupyter environments:
VS Code can connect to the IPython kernel:
In VS Code's command pallette (F1), search for: Python: Create New Blank Jupyter Notebook
run Jupyterlab from the terminal:
jupyter-lab
Write code into a cell, e.g.
import time
time.sleep(3)
1 + 1
and press Shift + Enter
IPython has numbered inputs / outputs, e.g. [1]
If the last statement in a cell evaluates to something it will be considered the output and be displayed
To supress this behavior, end the statement with a semicolon
interface functionality (varies amongst notebook types):
We can add documentation via the standardized markdown language:
Switch from Code to Markdown and try the following code:
# Heading
- item 1
- item 2
Run (or leave) the cell to display the result, double click to edit again
displaying documentation in any Python console:
help(str)
(navigate through long outputs via Enter, exit via Q)
shortcut for IPython / Jupyter:
str?
IPython includes direct access to many terminal commands, e.g. ls
, cd
, ...
We can execute any terminal command by prefixing it with !
Anaconda = Python distribution that includes many pre-built packages and developer tools
Uses ~5GB of disk space
Conda: environment and package manager for Anaconda
download from https://www.anaconda.com/products/individual
On Windows, the installation path should not contain spaces or underscores (recommendation: C:/anaconda
) - see https://docs.anaconda.com/anaconda/user-guide/faq/#distribution-faq-windows-folder
options during installation:
to launch a Jupyter notebook: Entry Jupyter Notebook in the start menu / terminal command jupyter notebook
Stopping Jupyter: Press Quit in the top right corner of the directory tree view (usually under http://localhost:8888/tree)
major phases in CRISP-DM:
resource: PDF version
phases:
NumPy: library for efficient processing of numerical data - based on multidimensional arrays
import numpy as np
# create a 2-dimensional array
iris = np.array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3.0, 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5.0, 3.6, 1.4, 0.2]])
# get the first column
iris[:, 0] # [5.1, 4.9, 4.7, 4.6, 5.0]
# get the second column
iris[:, 1] # [3.5, 3.0, 3.2, 3.1, 3.6]
# get the mean value of the first column
iris[:, 0].mean() # 4.86
# divide the entries in the first column by the entries
# in the second column
iris[:, 0] / iris[:, 1] # [1.46, 1.63, 1.47, 1.48, 1.39]
Pandas: library for data analysis, based on NumPy
load a data table (DataFrame) from a CSV file:
import pandas as pd
titanic = pd.read_csv(
"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
index_col="PassengerId",
)
display data:
titanic
display one column (series):
titanic["Age"]
summarize all numeric data:
titanic.describe()
summarize one column (series):
titanic["Age"].describe()
mean value of one column (series):
titanic["Age"].mean()
categorical data:
titanic["Pclass"].value_counts()
querying data: passengers younger than 1 year
titanic[titanic["Age"] < 1]
preparing data for a machine learning exercise:
# column with a numeric value
titanic["Female"] = titanic["Sex"].replace(
{"female": 1, "male": 0}
)
# remove rows with missing age
titanic = titanic.dropna(subset=["Age"])
Pyplot: data plotting interface - included in matplotlib, accessible from pandas
using pyplot directly:
import matplotlib.pyplot as plt
plt.hist(
titanic["Pclass"],
bins=[1, 2, 3, 4],
align="left",
)
plt.xticks([1, 2, 3]);
using pyplot from pandas:
titanic["Pclass"].plot.hist(
bins=[1, 2, 3, 4],
align="left",
xticks=[1, 2, 3],
);
plt.boxplot(
titanic["Age"].dropna(),
whis=(0, 100),
labels=["Age"]
);
plt.hist(
titanic["Age"],
bins=[0, 10, 20, 30, 40, 50, 60, 70, 80],
);
exercise: predicting survival on the Titanic via a linear regression
simple algorithms can be trained to predict survival with 80% accuracy (based on sex, passenger class, age, number of siblings or spouses, number of parents or children)
defining input data and output data:
passenger_data = titanic[
["Female", "Pclass", "Age", "SibSp", "Parch"]
]
survived = titanic["Survived"]
"training" a model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(passenger_data, survived)
predicting a value for the survival of:
new_passenger_data = pd.DataFrame(
[
[1, 1, 40, 0, 0],
[0, 2, 40, 0, 0]
],
columns=["Female", "Pclass", "Age", "SibSp", "Parch"],
)
model.predict(new_passenger_data)
# [0.93, 0.23]