Library for efficient data processing
Data are stored in multidimensional arrays of numeric values which are implemented in an efficient way:
Data can represent images, sound, measurements and much more
common import convention:
import numpy as np
creating a 1-dimensional array:
a1d = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
creating a 2-dimensional array:
a2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
output:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
creating a 3-dimensional array:
a3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7,8]]])
output:
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
Arrays are implemented in C, the numeric entries are not full Python Objects and require less resources
Python list (with references to Python integer objects):
list_a = [1, 2, 3, 4]
NumPy array (data are contained within the array without referencing Python integers):
array_a = np.array(list_a)
Fast element-wise operation (implemented in C):
array_a * array_a
Exercise:
Compare the execution time of an operation in pure Python and in NumPy by using time.perf_counter()
e.g. compute the square roots of all numbers from 0 to 1,000,000
We can query these attributes:
a3d.shape
: (2, 2, 2)a3d.ndim
: 3a3d.size
: 8from the Zen of Python:
There should be one-- and preferably only one --obvious way to do it.
this philosophy is often not applied in NumPy
example: transposing an array
a2d.T
a2d.transpose()
np.transpose(a2d)
many operations available in two ways:
numpy
packagearray
classwe will be using mostly functions
available as functions and methods:
np.max(a2d)
a2d.max()
np.round(a2d)
a2d.round()
available as functions only:
np.sin(a2d)
np.exp(a2d)
np.expand_dims(a2d, 2)
creating a 2x6 array filled with 0:
np.zeros((2, 6))
# or
np.full((2, 6), 0.0)
creating number sequences:
np.linspace(0, 1.0, 11)
# [0.0, 0.1, ... 1.0]
np.arange(0, 3.14, 0.1)
# [0.0, 0.1, ... 3.1]
creating a 2x2 array of random values:
# create a random number generator
rng = np.random.default_rng(seed=1)
# floats between 0 and 1:
rng.random((2, 2))
# integers between 1 and 6:
rng.integers(1, 7, (2, 2))
older interface: np.random.random()
and np.random.randint()
a1d[0] # 0
a2d[0, 1] # 2
a2d[0, :] # [1, 2, 3]
a2d[:, 0] # [1, 4, 7]
with 2D arrays: [row index, column index]
in general:
a2d[0, :] # [1, 2, 3]
shorter form:
a2d[0] # [1, 2, 3]
a1d[:3] # [0, 1, 2]
a1d[3:6] # [3, 4, 5]
a1d[6:] # [6, 7, 8, 9]
a1d[0:8:2] # [0, 2, 4, 6]
a1d[3:0:-1] # [3, 2, 1]
a1d[::-1] # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
a2d[1:, :] # [[5, 6, 7], [8, 9, 10]]
also works on Python lists
Operators are applied element-wise:
a = np.array([0, 1, 2, 3])
b = np.array([2, 2, 2, 2])
-a
# np.array([0, -1, -2, -3])
a + b
# np.array([2, 3, 4, 5])
a * b
# np.array([0, 2, 4, 6])
element-wise comparison of arrays:
a < b
# np.array([True, True, False, False])
a == b
# np.array([False, False, True, False])
Warning: a == b
cannot be used reasonably in if statements - use np.array_equal(a, b)
operations with single numbers (broadcasting):
print(a + 1)
# np.array([1, 2, 3, 4])
Some constants are available directly in NumPy:
print(a + np.pi)
print(a + np.e)
print(np.nan)
NumPy provides some mathematical functions that are applied element-wise:
print(np.sin(a))
# [0.0, 0.84147098, 0.9... ]
print(np.sqrt(a))
# [0.0, 1.0, 1.414... ]
abs
sin
cos
sqrt
exp
log
log10
round
Aggregations compute scalar values for an entire array or for each of its rows / columns / ...
sum over all entries:
np.sum(a2d)
sum along axis 0 ("downwards"):
np.sum(a2d, axis=0)
sum along axis 1 ("rightwards"):
np.sum(a2d, axis=1)
sum
min
max
std
percentile
(see next slides)
given an array of prices and an array of quantities, determine the total price:
prices = np.array([3.99, 4.99, 3.99, 12.99])
# buying the first item 3 times and the last item 2 times
quantities = np.array([3, 0, 0, 2])
# solution: 37.95
given an array of masses and velocities of some bodies, determine the kinetic energy of every body and the total kinetic of all bodies together
masses = np.array([1.2, 2.2, 1.5, 2.0])
velocities = np.array([12.0, 14.0, 14.0, 7.5])
formula: E = m*v^2 / 2
given the coordinates of the vertices of a triangle, determine its centroid (arithmetic mean of its vertices)
a = np.array([5, 1])
b = np.array([6, 8])
c = np.array([1, 3])
# solution: [4, 4]
create a "value table" for the sine and cosine functions in the interval between 0 and 2*pi.
result:
# x, sin(x), cos(x)
np.array([[0.0, 0.01, 0.02, ...],
[0.0, 0.0099998, 0.0199999, ...],
[1.0, 0.99995, 0.99980, ...]])
using this data, verify the following equation: sin(x)^2 + cos(x)^2 = 1
Simulate 1 million dice rolls with 10 dice each
a = np.array([4.1, 2.7, -1, 3.8, -1])
a_valid = a > 0
# array([True, True, False, True, False])
a_filtered = a[a_valid]
# array([4.1, 2.7, 3.8])
a_invalid = a < 0
a_with_nans = np.copy(a)
a_with_nans[a_invalid] = np.nan
# array([4.1, 2.7, nan, 3.8, nan])
a = np.array([4.1, 2.7, -1, 3.8, -1])
a_filtered = a[a >= 0]
a_with_nans = np.copy(a)
a_with_nans[a_with_nans < 0] = np.nan
an int8 consists of 8 bits and can store 2^8 (256) different numbers
number of representable values for integer types:
an unsigned integer (uint) can only represent non-negative numbers
e.g. uint8: 0 to 255
standardized way of representing real numbers in computers: IEEE 754
common floating point types:
rounding errors: some numbers cannot be represented as floating point numbers, they will always be approximations
examples in the decimal system: 1/3, 1/7, π
examples in the binary system (i.e. floats): 1/10, 1/5, 1/3, π
example: π + π evaluates to 6.2
when using decimal numbers with a precision of 2 (a more exact result would be 6.3
)
example: 0.1 + 0.2
evaluates to ~ 0.30000000000000004
when using 64 bit floats
some operations result in loss of precision - e.g. subtracting numbers that are close to each other
example:
a = 0.001234567 (7 significant decimal places)
b = 0.001234321 (7 significant decimal places)
c = a - b
c = 0.000000246 (3 significant decimal places)
Special values in IEEE 754:
inf
and -inf
(infinite values)nan
(not-a-number: undefined / unknown value)storage format:
(-) 2^e * s
pi as float32:
0 10000000 10010010000111111011011
2*pi as float32:
0 10000001 10010010000111111011011
pi/2 as float32:
0 01111111 10010010000111111011011
numbers 0.20000000, 0.20000001, ... 0.20000005 expressed as closest float32 numbers:
0 01111100 10011001100110011001101
0 01111100 10011001100110011001101
0 01111100 10011001100110011001110
0 01111100 10011001100110011001111
0 01111100 10011001100110011001111
0 01111100 10011001100110011010000
Avogadro constant (6.02214076 * 10^23):
0 11001101 11111110000110001000001
planck length (1.61625518 * 10^-35):
0 00001011 01010111101011110110100
largest float32 number:
0 11111110 11111111111111111111111
~ 2^127.9999 ~ 3.403e38
smallest positive float32 number with full precision:
0 00000001 00000000000000000000000
= 2^-126 ~ 1.175e-36
larger numbers will yield inf
smaller numbers will lose precision or yield 0.0
inf: 0 11111111 00000000000000000000000
nan: 0 11111111 00000000000000000000001
Each NumPy array can only hold data of one type (e.g. only 64 bit floats or only bytes)
Each array has a predefined data type for all entries
a = np.array([1])
a.dtype # int32
b = np.array([1.0])
b.dtype # float64
c = np.array(['abc'])
c.dtype # <U3
d = np.array([b'abc'])
d.dtype # |S3
Types may be stated explicitly:
a = np.array([1, 2, 3, 4], dtype='int64')
b = np.array([1, 2, 3, 4], dtype='uint8')
If possible, types are converted automatically:
c = a + b
c.dtype # int64
common types:
precision for float types:
float16: exact for ~3 decimal digits
np.array([2.71828, 0.271828], dtype="float16")
# array([2.719 , 0.2717])
float16: overflow
np.array([65450, 65500, 65550], dtype="float16")
# array([65440, 65500, inf])
float16: underflow
np.array(
[3.141e-5, 3.141e-6, 3.141e-7, 3.141e-8, 3.141e-9],
dtype="float16"
)
# array([3.14e-05, 3.16e-06, 2.98e-07, 5.96e-08, 0.00e+00])
Several operations in numpy will produce views of the data - multiple numpy arrays can refer to the same data in the background (for efficiency)
comparison: creating a copy of a list, creating a view of an array
list = [1, 2, 3]
list_copy = list[:]
list_copy[0] = 10 # does NOT change list
array = np.array([1, 2, 3])
array_view = array[:]
array_view[0] = 10 # DOES change array
Arrays can be copied via np.copy()
np.reshape(a3d, (8, )) # 1d array
np.reshape(a3d, (2, 4)) # 2d array
automatic sizing for one axis:
np.ravel(a3d) # 1d array
np.reshape(a3d, (-1, )) # 1d array
np.reshape(a3d, (2, -1)) # 2d array
these operations will create views
reversing order of axes (flipping axes in 2D):
np.transpose(a2d)
a2d.T
concatenating along an existing axis (axis 0 by default):
np.concatenate([a1d, a1d])
np.concatenate([a2d, a2d])
np.concatenate([a2d, a2d], axis=1)
concatenating along a new axis:
np.stack([a1d, a1d])
np.transpose(m)
np.linalg.inv(m)
np.eye(2) # unit matrix
via the binary operator @
example: rotating several points by 45° / 90° (counterclockwise):
points = np.array([[0, 0], [0, 1], [1, 1], [1, 0]])
m = np.array([[np.sqrt(0.5), np.sqrt(0.5)],
[-np.sqrt(0.5), np.sqrt(0.5)]])
print(points @ m)
print(points @ m @ m)
example:
known data: prices of various products, number of items in stock for different stores
prices = np.array([3.99, 12.99, 5.90, 15])
quantities = np.array([[0, 80, 80, 100],
[100, 0, 0, 0],
[50, 0, 0, 50]])
wanted: total value for each of the three stores