Python and Data Science: Overview

Packages

Python packages for data science

  • Jupyter and IPython: interactive Python environments
  • NumPy: library for efficient processing of numerical data
  • Pandas: library for data analysis, based on NumPy
  • Matplotlib and Pyplot: library for data visualization
  • Scikit-Learn: library for machine learning, based on NumPy
  • Tensorflow / Keras: library for deep learning

Compiled packages and Python versions

Compiled packages (like NumPy, TensorFlow or PyTables) may take several months before they are available for the newest Python version.

recommendation: use an older Python version (e.g. 3.8 instead of 3.9) or a pre-built distribution (like Anaconda)

Python packages for data science

installing the most important packages in an existing Python environment:

pip install jupyter numpy pandas matplotlib sklearn tensorflow

Jupyter and IPython

IPython

IPython = advanced interactive Python console, supports features like autocompletion

Jupyter notebooks

  • interactive Python document (based on IPython)
  • file format .ipynb
  • may contain code, output text / graphics, documentation / notes

Jupyter interfaces

  • Jupyter Notebook: web-based interface that can run on a remote server or locally
  • JupyterLab: successor to Jupyter Notebook
  • VS Code: supports jupyter notebooks

Jupyter notebook - online

free online Jupyter environments:

Jupyter notebook - VS Code

VS Code can connect to the IPython kernel:

In VS Code's command pallette (F1), search for: Python: Create New Blank Jupyter Notebook

Jupyter notebook - Jupyterlab

run Jupyterlab from the terminal:

jupyter-lab

Writing and evaluating code

Write code into a cell, e.g.

import time
time.sleep(3)
1 + 1

and press Shift + Enter

Writing and evaluating code

IPython has numbered inputs / outputs, e.g. [1]

Writing and evaluating code

If the last statement in a cell evaluates to something it will be considered the output and be displayed

To supress this behavior, end the statement with a semicolon

Writing and evaluating code

interface functionality (varies amongst notebook types):

  • run cell
  • restart (forgets previous variables and state)
  • run all cells / restart and run all cells
  • interrupt evaluation

Writing documentation via markdown

We can add documentation via the standardized markdown language:

Switch from Code to Markdown and try the following code:

# Heading

- item 1
- item 2

Run (or leave) the cell to display the result, double click to edit again

markdown cheatsheet

Documentation

displaying documentation in any Python console:

help(str)

(navigate through long outputs via Enter, exit via Q)

shortcut for IPython / Jupyter:

str?

Running terminal commands

IPython includes direct access to many terminal commands, e.g. ls, cd, ...

We can execute any terminal command by prefixing it with !

In depth: Anaconda

Anaconda

Anaconda = Python distribution that includes many pre-built packages and developer tools

Uses ~5GB of disk space

Conda

Conda: environment and package manager for Anaconda

  • pre-built binaries for many packages
  • environments: installation of different packages and different versions of packages for different projects

Anaconda installation

download from https://www.anaconda.com/products/individual

On Windows, the installation path should not contain spaces or underscores (recommendation: C:/anaconda) - see https://docs.anaconda.com/anaconda/user-guide/faq/#distribution-faq-windows-folder

options during installation:

  • check "Add Anaconda3 to my PATH environment variable" (event if it says it's not recommended)
  • check "Register Anaconda3 as my default Python 3.x"

Anaconda

to launch a Jupyter notebook: Entry Jupyter Notebook in the start menu / terminal command jupyter notebook

Stopping Jupyter: Press Quit in the top right corner of the directory tree view (usually under http://localhost:8888/tree)

In depth: data mining process models

In depth: data mining process models

  • CRISP-DM: Cross-industry standard process for data mining
  • ASUM-DM: Analytics Solutions Unified Method for Data Mining/Predictive Analytics

CRISP-DM

major phases in CRISP-DM:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

CRISP-DM

resource: PDF version

ASUM-DM

phases:

  • Analyze
  • Design
  • Configure and Build
  • Operate and Optimize

NumPy: overview and demo

NumPy

NumPy: library for efficient processing of numerical data - based on multidimensional arrays

NumPy: overview and demo

import numpy as np

# create a 2-dimensional array
iris = np.array([[5.1, 3.5, 1.4, 0.2],
                 [4.9, 3.0, 1.4, 0.2],
                 [4.7, 3.2, 1.3, 0.2],
                 [4.6, 3.1, 1.5, 0.2],
                 [5.0, 3.6, 1.4, 0.2]])

NumPy: overview and demo

# get the first column
iris[:, 0]  # [5.1, 4.9, 4.7, 4.6, 5.0]
# get the second column
iris[:, 1]  # [3.5, 3.0, 3.2, 3.1, 3.6]

NumPy: overview and demo

# get the mean value of the first column
iris[:, 0].mean()  # 4.86

# divide the entries in the first column by the entries
# in the second column
iris[:, 0] / iris[:, 1]  # [1.46, 1.63, 1.47, 1.48, 1.39]

Pandas: overview and demo

Pandas

Pandas: library for data analysis, based on NumPy

Pandas: overview and demo

load a data table (DataFrame) from a CSV file:

import pandas as pd

titanic = pd.read_csv(
    "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
    index_col="PassengerId",
)

Pandas: overview and demo

display data:

titanic

display one column (series):

titanic["Age"]

Pandas: overview and demo

summarize all numeric data:

titanic.describe()

summarize one column (series):

titanic["Age"].describe()

mean value of one column (series):

titanic["Age"].mean()

Pandas: overview and demo

categorical data:

titanic["Pclass"].value_counts()

Pandas: overview and demo

querying data: passengers younger than 1 year

titanic[titanic["Age"] < 1]

Pandas: overview and demo

preparing data for a machine learning exercise:

# column with a numeric value
titanic["Female"] = titanic["Sex"].replace(
    {"female": 1, "male": 0}
)

# remove rows with missing age
titanic = titanic.dropna(subset=["Age"])

Pyplot: overview and demo

Pyplot: overview and demo

Pyplot: data plotting interface - included in matplotlib, accessible from pandas

Pyplot: overview and demo

using pyplot directly:

import matplotlib.pyplot as plt

plt.hist(
    titanic["Pclass"],
    bins=[1, 2, 3, 4],
    align="left",
)
plt.xticks([1, 2, 3]);

using pyplot from pandas:

titanic["Pclass"].plot.hist(
    bins=[1, 2, 3, 4],
    align="left",
    xticks=[1, 2, 3],
);

Pyplot: overview and demo

plt.boxplot(
    titanic["Age"].dropna(),
    whis=(0, 100),
    labels=["Age"]
);

Pyplot: overview and demo

plt.hist(
    titanic["Age"],
    bins=[0, 10, 20, 30, 40, 50, 60, 70, 80],
);

Scikit-learn: overview and demo

Scikit-learn: overview and demo

exercise: predicting survival on the Titanic via a linear regression

simple algorithms can be trained to predict survival with 80% accuracy (based on sex, passenger class, age, number of siblings or spouses, number of parents or children)

Scikit-learn: overview and demo

defining input data and output data:

passenger_data = titanic[
    ["Female", "Pclass", "Age", "SibSp", "Parch"]
]
survived = titanic["Survived"]

Scikit-learn: overview and demo

"training" a model:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(passenger_data, survived)

Scikit-learn: overview and demo

predicting a value for the survival of:

  • 40-year-old woman in first class (without companions)
  • 40-year-old man in second class (without companions)
new_passenger_data = pd.DataFrame(
    [
        [1, 1, 40, 0, 0],
        [0, 2, 40, 0, 0]
    ],
    columns=["Female", "Pclass", "Age", "SibSp", "Parch"],
)
model.predict(new_passenger_data)
# [0.93, 0.23]