Python: scientific libraries

23 May 2024 - Houston

NumPy

Version

import numpy as np
np.__version__

'1.12.1'

Print options

np.set_printoptions(precision=4)

Creating from Python data structures

Every element in an np array must have the same type.

np.array([1, 2, 3, 4, 5])

array([1, 2, 3, 4, 5])

Data type promotion (all elements converted to consistent type):

mixed_nums = (14, -3.54, 5+7j)
np.array(mixed_nums)

array([ 14.00+0.j,  -3.54+0.j,   5.00+7.j])

Creating using NumPy methods

np.arange(10, step=2)

array([0, 2, 4, 6, 8])

np.arange(5, 10) + 1

array([ 6,  7,  8,  9, 10])

len(np.arange(0, 10, 2))
np.arange(10).size

5
10

np.arange(24, 25)

array([24])

Operations are performed element-wise

np.array([1, 2, 3, 4]) * 10

array([10, 20, 30, 40])

linspace, zeros, ones, data types

`linspace`

Return n (default 50) evenly spaced nums over the given interval. Closed interval: Stop parameter is included in range.

np.linspace(5, 10, 9).size

return the step size between each entry

np.linspace(5, 15, 3, retstep=True)

(array([  5.,  10.,  15.]), 5.0)

`zeros`

np.zeros(5)

array([ 0.,  0.,  0.,  0.,  0.])

np.zeros((5, 3))

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

np.zeros(11, dtype='int64')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

`ones`

np.ones(5)

array([ 1.,  1.,  1.,  1.,  1.])

np.ones((3, 2))

array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

`ndarray`

np.ndarray(shape=(2, 4), dtype=float)

array([[  1.2882e-231,   1.2882e-231,   1.2882e-231,   1.2882e-231],
       [  1.2882e-231,   1.2882e-231,   1.7587e-310,   3.5098e+064]])

np.ndarray(shape=(3,), dtype=int, order=True)

array([4617315517961601024, 4621819117588971520, 4624633867356078080])

Slicing and iterating

Arrays

np_arr = np.array([-17, -4, 0, 2, 21])
np_arr[0]
np_arr[-1]
np_arr[-1] = 33
np_arr

-17
21
array([-17,  -4,   0,   2,  33])

Multidimensional arrays: `shape`

matr = np.arange(35)
matr.shape = (7, 5)
matr[2]
matr[2, 3]
matr[2][3]

array([10, 11, 12, 13, 14])
13
13

3-D Arrays

array_3d = np.arange(70)
array_3d.shape = (2, 7, 5)
array_3d


array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34]],

       [[35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59],
        [60, 61, 62, 63, 64],
        [65, 66, 67, 68, 69]]])

array_3d[1][4][3]

array_3d[1][4][3] = 1111
array_3d


array([[[   0,    1,    2,    3,    4],
        [   5,    6,    7,    8,    9],
        [  10,   11,   12,   13,   14],
        [  15,   16,   17,   18,   19],
        [  20,   21,   22,   23,   24],
        [  25,   26,   27,   28,   29],
        [  30,   31,   32,   33,   34]],

       [[  35,   36,   37,   38,   39],
        [  40,   41,   42,   43,   44],
        [  45,   46,   47,   48,   49],
        [  50,   51,   52,   53,   54],
        [  55,   56,   57, 1111,   59],
        [  60,   61,   62,   63,   64],
        [  65,   66,   67,   68,   69]]])

Boolean mask arrays

vector = np.array([-17, -4, 0, 21, 37])
divisible_by_7_mask = (vector % 7) == 0
divisible_by_7_mask

array([False, False,  True,  True, False], dtype=bool)

vector[vector % 7 == 0]

array([ 0, 21])

div_by_3_test = vector % 3 == 0
positive_test = vector > 0
combined_test = np.logical_and(div_by_3_test, positive_test)
vector[combined_test]

array([21])

Broadcasting

How numpy handles operations between arrays of different sizes.

Matrix attributes

my_3d_array = np.arange(70)
my_3d_array.shape = (2, 7, 5)
my_3d_array.ndim
my_3d_array.size
my_3d_array.dtype

3
70
dtype('int64')

Scalars

5 * my_3d_array - 2

array([[[ -2,   3,   8,  13,  18],
        [ 23,  28,  33,  38,  43],
        [ 48,  53,  58,  63,  68],
        [ 73,  78,  83,  88,  93],
        [ 98, 103, 108, 113, 118],
        [123, 128, 133, 138, 143],
        [148, 153, 158, 163, 168]],

       [[173, 178, 183, 188, 193],
        [198, 203, 208, 213, 218],
        [223, 228, 233, 238, 243],
        [248, 253, 258, 263, 268],
        [273, 278, 283, 288, 293],
        [298, 303, 308, 313, 318],
        [323, 328, 333, 338, 343]]])

Vectors

inner product: np.inner and np.dot

np.inner: inner product of two arrays. For 1D arrays, inner product of vectors.

left_matrix = np.arange(6).reshape((2, 3))
right_matrix = np.arange(15).reshape((3, 5))
np.inner(left_matrix, right_matrix)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: shapes (2,3) and (5,3) not aligned: 3 (dim 1) != 5 (dim 0)

For 2D arrays, need to use matrix product. For 1D arrays, inner product of vectors.

np.dot(left_matrix, right_matrix)

array([[ 25,  28,  31,  34,  37],
       [ 70,  82,  94, 106, 118]])

Operations along axes

one_to_three = np.arange(3) + 1
matrix = [one_to_three, one_to_three, one_to_three]
np.array(one_to_three).sum()
np.array(matrix).sum() # sum all elements
np.array(matrix).sum(axis=0) # rows cross-section
np.array(matrix).sum(axis=1) # cols cross-section

6
18
array([3, 6, 9])
array([6, 6, 6])

my_3d_array.sum(axis=0)

array([[ 35,  37,  39,  41,  43],
       [ 45,  47,  49,  51,  53],
       [ 55,  57,  59,  61,  63],
       [ 65,  67,  69,  71,  73],
       [ 75,  77,  79,  81,  83],
       [ 85,  87,  89,  91,  93],
       [ 95,  97,  99, 101, 103]])

Structured and record arrays

Structured arrays for data definition

person_data_def = [('name', 'S6'), ('height', 'f8'), ('weight', 'f8'), ('age', 'i8')]
people_array = np.zeros((4), dtype=person_data_def)
people_array[0] = ('Alpha', 65, 112, 25)
people_array[1] = ('Beta', 43, 128, 33)
people_array[2] = ('Gamma', 29, 188, 35)
people_array[3] = ('Delta', 73, 205, 34)
people_array

array([(b'Alpha',  65.,  112., 25), (b'Beta',  43.,  128., 33),
       (b'Gamma',  29.,  188., 35), (b'Delta',  73.,  205., 34)],
      dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])

Accessing data in structured arrays

people_array[2:]

array([(b'Gamma',  29.,  188., 35), (b'Delta',  73.,  205., 34)],
      dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])

ages = people_array['age']
ages

array([25, 33, 35, 34])

Record arrays: A wrapper around structured arrays

Instead of using indexes, use attributes

person_record_array = np.rec.array(people_array)
person_record_array

rec.array([
 (b'Alpha',  65.,  112., 25),
 (b'Beta',  43.,  128., 33),
 (b'Gamma',  29.,  188., 35),
 (b'Delta',  73.,  205., 34)
],
 dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])

person_record_array[0].age

Views and copies

Assigning to a new variable creates a new reference. Same NumPy object / location in memory, same underlying data.

import numpy as np
mi_casa = np.array([-45, -31, -12, 0])
su_casa = mi_casa
# same object
     id(mi_casa)
     id(su_casa)
     mi_casa is su_casa

     # equal values
     mi_casa == su_casa

     # values remain in sync when mutated
     su_casa[0] = 100
     mi_casa is su_casa
     mi_casa == su_casa

4642127504
4642127504
True
array([ True,  True,  True,  True], dtype=bool)
True
array([ True,  True,  True,  True], dtype=bool)

Views

Returns a shallow copy of the receiver.

dog_house = mi_casa.view()

      dog_house is mi_casa # new object at different location
      dog_house == mi_casa # same values as original

      mi_casa[0] = 345
      dog_house is mi_casa # still a new object
      dog_house == mi_casa # values remain in sync

False
array([ True,  True,  True,  True], dtype=bool)
False
array([ True,  True,  True,  True], dtype=bool)

Copies

Provides a deep copy.

tree_house = mi_casa.copy()

      # different object, same values
      tree_house is mi_casa
      tree_house == mi_casa

      # values are distinct
      tree_house[0] = 983798739
      tree_house is mi_casa
      tree_house == mi_casa

False
array([ True,  True,  True,  True], dtype=bool)
False
array([False,  True,  True,  True], dtype=bool)

Array attributes

import numpy as np
arr = np.array(np.arange(24)).reshape((2, 3, 4))
arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

ndim
```
arr.ndim
```
```
3
```

shape
```
arr.shape
```
```
(2, 3, 4)
```

size
```
arr.size
```
```
24
```

dtype
```
arr.dtype
```
```
dtype('int64')
```

itemsize
```
arr.itemsize
```
```
8
```

Add and remove elements

`append`

Append to the given array. Shape of the given array is not maintained. Returns a copy not a view.

arr

ay([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

arr2 = np.append(arr, [5, 6, 7, 8])
arr2.shape


(28,)

Use reshape to reshape:

arr2.reshape((7, 4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [ 5,  6,  7,  8]])

`append` to a specific axis

matrix = np.array(np.arange(9)).reshape((3, 3))
matrix


array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Append new matrix as new rows:

new_matrix = np.array(np.arange(9) + 10).reshape((3, 3))
np.append(matrix, new_matrix, axis=0)


array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

Append new matrix as new columns:

hstack = np.append(matrix, new_matrix, axis=1)
hstack


array([[ 0,  1,  2, 10, 11, 12],
       [ 3,  4,  5, 13, 14, 15],
       [ 6,  7,  8, 16, 17, 18]])

Horizontal stacking: `hstack`

Convenience method for appending to the last axis. Returns a copy not a view. (Same as append.)

a = np.array(np.arange(9)).reshape((3, 3))
b = np.array(np.arange(9) + 10).reshape((3, 3))
haystack = np.hstack((a, b))
haystack


array([[ 0,  1,  2, 10, 11, 12],
       [ 3,  4,  5, 13, 14, 15],
       [ 6,  7,  8, 16, 17, 18]])

`insert`

Interpolates data in between existing data. Creates a new array with new data.

arr

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

np.insert(arr, 1, 444, axis=0)

array([[[  0,   1,   2,   3],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[444, 444, 444, 444],
        [444, 444, 444, 444],
        [444, 444, 444, 444]],

       [[ 12,  13,  14,  15],
        [ 16,  17,  18,  19],
        [ 20,  21,  22,  23]]])

np.insert(arr, 1, 444, axis=1)

array([[[  0,   1,   2,   3],
        [444, 444, 444, 444],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[ 12,  13,  14,  15],
        [444, 444, 444, 444],
        [ 16,  17,  18,  19],
        [ 20,  21,  22,  23]]])

`delete`

arr

ay([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

Delete element 1 at axis 0:

np.delete(arr, 1, axis=0)

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]]])

Delete element 0 at axis 1:

np.delete(arr, 0, axis=1)

array([[[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[16, 17, 18, 19],
        [20, 21, 22, 23]]])

Delete element 2 at axis 2:

np.delete(arr, 2, axis=2)

array([[[ 0,  1,  3],
        [ 4,  5,  7],
        [ 8,  9, 11]],

       [[12, 13, 15],
        [16, 17, 19],
        [20, 21, 23]]])

Joining and splitting

`concatenate`

Returns a copy not a view

import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
np.concatenate((arr1, arr2), axis=0)
np.concatenate((arr1, arr2), axis=1)

array([[1, 2], [3, 4],
       [5, 6], [7, 8]])
array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

`stack`

np.stack([arr1, arr2], axis=0)

array([[[1, 2], [3, 4]],
       [[5, 6], [7, 8]]])

`split`

temp = np.arange(9).reshape((3, 3))
np.split(temp, 3, axis=0)

[array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])]

Rearrange elements

`fliplr`

Reverse order of elements along the second axis

orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.fliplr(orig_array)


array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
array([[ 4,  3,  2,  1,  0],
       [ 9,  8,  7,  6,  5],
       [14, 13, 12, 11, 10]])

orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.fliplr(orig_array)


array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])

array([[[2, 3], [0, 1]],
       [[6, 7], [4, 5]]])

`flipud`

Reverse order of elements along the first axis

orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.flipud(orig_array)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
array([[10, 11, 12, 13, 14],
       [ 5,  6,  7,  8,  9],
       [ 0,  1,  2,  3,  4]])

orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.flipud(orig_array)


array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])

array([[[4, 5], [6, 7]],
       [[0, 1], [2, 3]]])

`roll`

Rotate elements n times along the second dimension. CW for n > 0, CCW for n < 0.

arr
np.roll(arr, 4)
print('------')
np.roll(arr, 5)
np.roll(arr, 1)
print('------')
np.roll(arr, -1)

array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
array([[[20, 21, 22, 23],
        [ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])
------
array([[[19, 20, 21, 22],
        [23,  0,  1,  2],
        [ 3,  4,  5,  6]],

       [[ 7,  8,  9, 10],
        [11, 12, 13, 14],
        [15, 16, 17, 18]]])
array([[[23,  0,  1,  2],
        [ 3,  4,  5,  6],
        [ 7,  8,  9, 10]],

       [[11, 12, 13, 14],
        [15, 16, 17, 18],
        [19, 20, 21, 22]]])
------
array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[13, 14, 15, 16],
        [17, 18, 19, 20],
        [21, 22, 23,  0]]])

`rot90`

Rotate 90 degrees.

orig_array

array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])

np.rot90(orig_array)

array([[[2, 3], [6, 7]],
       [[0, 1], [4, 5]]])

np.rot90(orig_array, k=-1)

array([[[4, 5], [0, 1]],
       [[6, 7], [2, 3]]])

Transpose-like operations

arr_3x8 = np.array(np.arange(24)).reshape((3, 8))
arr_3x8

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23]])

`transpose`

np.transpose(arr_3x8)

array([[ 0,  8, 16],
       [ 1,  9, 17],
       [ 2, 10, 18],
       [ 3, 11, 19],
       [ 4, 12, 20],
       [ 5, 13, 21],
       [ 6, 14, 22],
       [ 7, 15, 23]])

np.transpose(arr_3x8, axes=(1, 0))

array([[ 0,  8, 16],
       [ 1,  9, 17],
       [ 2, 10, 18],
       [ 3, 11, 19],
       [ 4, 12, 20],
       [ 5, 13, 21],
       [ 6, 14, 22],
       [ 7, 15, 23]])

`swapaxes`

arr_3x2x4 = np.array(np.arange(24)).reshape((3, 2, 4))
arr_3x2x4
np.swapaxes(arr_3x2x4, axis1=0, axis2=2)


array([[[ 0,  1,  2,  3], [ 4,  5,  6,  7]],
       [[ 8,  9, 10, 11], [12, 13, 14, 15]],
       [[16, 17, 18, 19], [20, 21, 22, 23]]])
array([[[ 0,  8, 16], [ 4, 12, 20]],
       [[ 1,  9, 17], [ 5, 13, 21]],
       [[ 2, 10, 18], [ 6, 14, 22]],
       [[ 3, 11, 19], [ 7, 15, 23]]])

`rollaxes`

A view is returned.

mat_4d = np.ones((3, 4, 5, 6))
rolled = np.rollaxis(mat_4d, 1)
rolled.shape

(4, 3, 5, 6)

Applications

Universal functions

np.frompyfunc

ufunc: a function that operates on ~ndarray~s element-wise, supporting array broadcasting, type casting, and other standard features. Vectorized wrapper for a Python function.

import numpy as np

       def truncated_binomial(x):
           return (x + 1)**3 - x**3

       truncated_binomial(4)

args: func name, number of args, number of scalars to return

nums = np.ones(6).reshape((2, 3)) * 4
nums
trunc_binom = np.frompyfunc(truncated_binomial, 1, 1)
trunc_binom(nums)

array([[ 4.,  4.,  4.],
       [ 4.,  4.,  4.]])
array([[61.0, 61.0, 61.0],
       [61.0, 61.0, 61.0]], dtype=object)

Linear algebra

Matrices

Common functions are accessed via properties instead of functions

my_matrix = np.matrix([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
my_matrix

matrix([[3, 1, 4],
        [1, 5, 9],
        [2, 6, 5]])

Transpose:

my_matrix.T

matrix([[3, 1, 2],
        [1, 5, 6],
        [4, 9, 5]])

Inverse:

my_matrix.I

matrix([[ 0.3222, -0.2111,  0.1222],
        [-0.1444, -0.0778,  0.2556],
        [ 0.0444,  0.1778, -0.1556]])

Identity matrices

np.eye(3, dtype=int)

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

Solving systems of linear equations

\begin{align} A \mathbf{x} &= \mathbf{b} \\ A^{-1} A \mathbf{x} &= A^{-1}\mathbf{b} \\ \mathbf{x} &= A^{-1}\mathbf{b} \end{align}

my_matrix
rhs = np.matrix([[11], [22], [33]])
inverse = my_matrix.I
solution = inverse * rhs
solution

matrix([[3, 1, 4],
        [1, 5, 9],
        [2, 6, 5]])
matrix([[ 2.9333],
        [ 5.1333],
        [-0.7333]])

Optimized version:

from numpy.linalg import solve
solve(my_matrix, rhs)

matrix([[ 2.9333],
        [ 5.1333],
        [-0.7333]])

Computing eigenvalues / eigenvectors

Use eig to compute the eigenvalues and right eigenvectors of the given matrix

from numpy.linalg import eig
eigvals, eigvects = eig(my_matrix)
eigvals
eigvects


array([ 13.0858,   2.58  ,  -2.6658])
matrix([[-0.3154, -0.9512, -0.3237],
        [-0.7231,  0.3078, -0.7022],
        [-0.6146,  0.0229,  0.6341]])

Pattern detection

Given a sequence of numbers as an array, find the next number in the sequence.

import numpy as np
seq_array = np.array([1, 7, 19, 37, 61, 91, 127, 169, 217, 271, 331])
np.diff(seq_array) # calculate first differences
np.diff(seq_array, n=2) # second differences
np.diff(seq_array, n=3) # third differences

array([ 6, 12, 18, 24, 30, 36, 42, 48, 54, 60])
array([6, 6, 6, 6, 6, 6, 6, 6, 6])
array([0, 0, 0, 0, 0, 0, 0, 0])

Symbolic Python
Use Jupyter notebooks. Output like Wolfram or Matlab.
```
from sympy import init_session
init_session()
```

Statistics

Basic statistics: mean, median, min, max, std, var

import scipy as sp
import numpy as np
from scipy.stats import norm

Generate a data set from normally distributed data points.

number_of_data_points = 10000
data_set = sp.randn(number_of_data_points)
type(data_set)

<class 'numpy.ndarray'>

mean
```
data_set.mean()
```
```
0.0016312105909250228
```

sp.median

sp.median(data_set)

0.0028936738357498988

min
```
data_set.min()
```
```
-3.4733354484199768
```

max
```
data_set.max()
```
```
3.5011182300650168
```

sp.std
```
sp.std(data_set)
```
```
1.0046141873536045
```

sp.var
```
sp.var(data_set)
```
```
1.0092496654321432
```

Probability distributions
- Continuous
  - Normal: norm
  - Chi squared: chi2
  - Student’s T: t
  - Uniform: uniform
- Discrete
  - Poisson: poisson
  - Binomial: binomial

Example: Normal Distribution

Print random variates from the IQ distribution

iq_mean = 100
iq_std_dev = 15
iq_distribution = norm(loc=iq_mean, scale=iq_std_dev)
       for n in np.arange(8):
           print('{:6.2f}'.format(iq_distribution.rvs()))

Print a histogram

import numpy as np
import matplotlib.pyplot as plt

       mu, sigma = 100, 15
       dataset = mu + sigma * np.random.randn(10_000)

       n, bins, patches = plt.hist(dataset, 50, normed=1, facecolor='g', alpha=0.75)
       plt.xlabel('IQ Score')
       plt.ylabel('Probability')
       plt.title('Histogram of IQ')
       plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
       plt.axis([40, 160, 0, 0.03])
       plt.grid(True)
       plt.show()

Pandas

Object creation

Integer index (default)

import pandas as pd
import numpy as np

       default_series = pd.Series([1, 3, 5, np.nan, 6, 8])
       print(default_series)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Datetime index

dates_index = pd.date_range('20170101', periods=6)
print(dates_index)

DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')

Sample numpy data

print(np.arange(5))
print(np.array(np.arange(5)))

       sample_data = np.array(np.arange(24)).reshape((6, 4))
       print(sample_data)

[0 1 2 3 4]
[0 1 2 3 4]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]

Data Frames

With specified index and column headers

sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD'))
print(sample_df)

             A   B   C   D
2017-01-01   0   1   2   3
2017-01-02   4   5   6   7
2017-01-03   8   9  10  11
2017-01-04  12  13  14  15
2017-01-05  16  17  18  19
2017-01-06  20  21  22  23

From a python dictionary

py_dict_to_df = pd.DataFrame(
    dict(
        float=1.0,
        time=pd.Timestamp('20170101'),
        series=pd.Series(1, index=list(range(4)), dtype='float32'),
        array=np.array([3] * 4, dtype='int32'),
        categories=pd.Categorical(['test', 'train', 'taxes', 'tools']),
        dull='boring data'))

       print(py_dict_to_df)

   array categories         dull  float  series       time
0      3       test  boring data    1.0     1.0 2017-01-01
1      3      train  boring data    1.0     1.0 2017-01-01
2      3      taxes  boring data    1.0     1.0 2017-01-01
3      3      tools  boring data    1.0     1.0 2017-01-01

Attributes info: dtypes

print(py_dict_to_df.dtypes)

array                  int32
categories          category
dull                  object
float                float64
series               float32
time          datetime64[ns]
dtype: object

Peeking: head and tail

print(py_dict_to_df.head())
print(py_dict_to_df.tail(2))

   array categories         dull  float  series       time
0      3       test  boring data    1.0     1.0 2017-01-01
1      3      train  boring data    1.0     1.0 2017-01-01
2      3      taxes  boring data    1.0     1.0 2017-01-01
3      3      tools  boring data    1.0     1.0 2017-01-01
array categories         dull  float  series       time
2      3      taxes  boring data    1.0     1.0 2017-01-01
3      3      tools  boring data    1.0     1.0 2017-01-01

Underlying data: values, index, columns

values

print(py_dict_to_df.values)
print(sample_df.values)

[[3 'test' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
 [3 'train' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
 [3 'taxes' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
 [3 'tools' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]]

index

print(py_dict_to_df.index)
print(sample_df.index)

Int64Index([0, 1, 2, 3], dtype='int64')
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
               '2017-01-05', '2017-01-06'],
              dtype='datetime64[ns]', freq='D')

columns

print(py_dict_to_df.columns)
print(sample_df.columns)

Index(['array', 'categories', 'dull', 'float', 'series', 'time'], dtype='object')
Index(['A', 'B', 'C', 'D'], dtype='object')

Statistical summary: describe

print(py_dict_to_df.describe())

array  float  series
count    4.0    4.0     4.0
mean     3.0    1.0     1.0
std      0.0    0.0     0.0
min      3.0    1.0     1.0
25%      3.0    1.0     1.0
50%      3.0    1.0     1.0
75%      3.0    1.0     1.0
max      3.0    1.0     1.0

print(sample_df.describe())

               A          B          C          D
count   6.000000   6.000000   6.000000   6.000000
mean   10.000000  11.000000  12.000000  13.000000
std     7.483315   7.483315   7.483315   7.483315
min     0.000000   1.000000   2.000000   3.000000
25%     5.000000   6.000000   7.000000   8.000000
50%    10.000000  11.000000  12.000000  13.000000
75%    15.000000  16.000000  17.000000  18.000000
max    20.000000  21.000000  22.000000  23.000000

Control floating-point display precision (and other options): set_options

pd.set_option('display.precision', 2)
print(sample_df.describe())

           A      B      C      D
count   6.00   6.00   6.00   6.00
mean   10.00  11.00  12.00  13.00
std     7.48   7.48   7.48   7.48
min     0.00   1.00   2.00   3.00
25%     5.00   6.00   7.00   8.00
50%    10.00  11.00  12.00  13.00
75%    15.00  16.00  17.00  18.00
max    20.00  21.00  22.00  23.00

Transpose: dataframe.T

print(sample_df.T)

2017-01-01  2017-01-02  2017-01-03  2017-01-04  2017-01-05  2017-01-06
A           0           4           8          12          16          20
B           1           5           9          13          17          21
C           2           6          10          14          18          22
D           3           7          11          15          19          23

Sort by an axis: sort_index

axis: 0 for rows, 1 for cols

# sort columns in descending order
       print(sample_df.sort_index(axis=1, ascending=False))

             D   C   B   A
2017-01-01   3   2   1   0
2017-01-02   7   6   5   4
2017-01-03  11  10   9   8
2017-01-04  15  14  13  12
2017-01-05  19  18  17  16
2017-01-06  23  22  21  20

# sort rows in descending order
       print(sample_df.sort_index(axis=0, ascending=False))

             A   B   C   D
2017-01-06  20  21  22  23
2017-01-05  16  17  18  19
2017-01-04  12  13  14  15
2017-01-03   8   9  10  11
2017-01-02   4   5   6   7
2017-01-01   0   1   2   3

Sort by data within a given column: sort_values

# sort the 'B' column in descending order, adjust others to match
       print(sample_df.sort_values(by='B', ascending=False))

             A   B   C   D
2017-01-06  20  21  22  23
2017-01-05  16  17  18  19
2017-01-04  12  13  14  15
2017-01-03   8   9  10  11
2017-01-02   4   5   6   7
2017-01-01   0   1   2   3

Selecting values

Pandas in production

For production (as opposed to interactive) work, the pandas team recommends the optimized data access methods: .at .iat .loc .iloc .ix.

.at: Fast label-based scalar accessor
.iat: Fast integer location scalar accessor.
.loc: Purely label-location based indexer for selection by label.
.iloc: Purely integer-location based indexing for selection by position.

.ix: A primarily label-location based indexer, with integer position fallback.

See the docs for more details.

import numpy as np
import pandas as pd

        sample_numpy_data = np.array(np.arange(24)).reshape((6, 4))
        dates_index = pd.date_range('20160601', periods=6)
        sample_df = pd.DataFrame(
            sample_numpy_data, index=dates_index, columns=list('ABCD'))
        print(sample_df.head())

             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15
2016-06-05  16  17  18  19

Selection using column name

col_c = sample_df['C']
print(col_c)

2016-06-01     2
2016-06-02     6
2016-06-03    10
2016-06-04    14
2016-06-05    18
2016-06-06    22
Freq: D, Name: C, dtype: int64

Selection using slice

first_4_rows = sample_df[:4]
print(first_4_rows)

             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15

Selection by datetime index

first_four_periods = sample_df['2016-06-01':'2016-06-04']
print(first_four_periods)

             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15

Selection by label

print(dates_index[1:3])
date_selection = sample_df.loc[dates_index[1:3]]
print(date_selection)

DatetimeIndex(['2016-06-02', '2016-06-03'], dtype='datetime64[ns]', freq='D')
            A  B   C   D
2016-06-02  4  5   6   7
2016-06-03  8  9  10  11

Selection (multi-axis) by label

all_rows_of_cols_a_and_b = sample_df.loc[:, ['A', 'B']]
print(all_rows_of_cols_a_and_b)

             A   B
2016-06-01   0   1
2016-06-02   4   5
2016-06-03   8   9
2016-06-04  12  13
2016-06-05  16  17
2016-06-06  20  21

Label slicing, including both endpoints

a_and_b_between_dates = sample_df.loc['2016-06-01':'2016-06-04', ['A', 'B']]
print(a_and_b_between_dates)

             A   B
2016-06-01   0   1
2016-06-02   4   5
2016-06-03   8   9
2016-06-04  12  13

Reduce dimensions of returned object

print(sample_df.loc['2016-06-03', ['D', 'B']])
print(sample_df.loc['2016-06-03', ['B', 'D']])

D    11
B     9
Name: 2016-06-03 00:00:00, dtype: int64
B     9
D    11
Name: 2016-06-03 00:00:00, dtype: int64

Working with result objects

result = sample_df.loc['2016-06-03', ['D', 'B']]
print(result[0] * 4)

Selecting scalars

print(sample_df.loc[:, 'C'])
print('------------')
print(dates_index[2])
print('------------')
print(sample_df.loc[dates_index[2], 'C'])

2016-06-01     2
2016-06-02     6
2016-06-03    10
2016-06-04    14
2016-06-05    18
2016-06-06    22
Freq: D, Name: C, dtype: int64
------------
2016-06-03 00:00:00
------------
10

Selecting by position: `iloc`

sample_numpy_data[3]

array([12, 13, 14, 15])

sample_df.iloc[3]

A    12
B    13
C    14
D    15
Name: 2016-06-04 00:00:00, dtype: int64

Selecting using integer slices with iloc

sample_df.iloc[1:3, 2:4]

             C   D
2016-06-02   6   7
2016-06-03  10  11

Selecting lists of rows with iloc

sample_df.iloc[[0, 1, 3], [0, 2]]

             A   C
2016-06-01   0   2
2016-06-02   4   6
2016-06-04  12  14

Slicing rows explicitly (selecting all cols implicitly)

sample_df.iloc[1:3, :]

            A  B   C   D
2016-06-02  4  5   6   7
2016-06-03  8  9  10  11

Slicing cols explicitly, all rows implicitly

sample_df.iloc[:, 1:3]

             B   C
2016-06-01   1   2
2016-06-02   5   6
2016-06-03   9  10
2016-06-04  13  14
2016-06-05  17  18
2016-06-06  21  22

Boolean indexing

Test based upon one column’s data

sample_df.C >= 14

2016-06-01    False
2016-06-02    False
2016-06-03    False
2016-06-04     True
2016-06-05     True
2016-06-06     True
Freq: D, Name: C, dtype: bool

Test based upon the entire data set

sample_df
sample_df[sample_df >= 14]

             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15
2016-06-05  16  17  18  19
2016-06-06  20  21  22  23
               A     B     C     D
2016-06-01   NaN   NaN   NaN   NaN
2016-06-02   NaN   NaN   NaN   NaN
2016-06-03   NaN   NaN   NaN   NaN
2016-06-04   NaN   NaN  14.0  15.0
2016-06-05  16.0  17.0  18.0  19.0
2016-06-06  20.0  21.0  22.0  23.0

isin method

Returns a boolean series showing whether each element in the series is exactly contained in the passed sequence of values.

sample_df_2 = sample_df.copy()
sample_df_2['Fruits'] = [
    'apple', 'orange', 'banana', 'strawberry', 'blueberry', 'pineapple'
]
sample_df_2

             A   B   C   D      Fruits
2016-06-01   0   1   2   3       apple
2016-06-02   4   5   6   7      orange
2016-06-03   8   9  10  11      banana
2016-06-04  12  13  14  15  strawberry
2016-06-05  16  17  18  19   blueberry
2016-06-06  20  21  22  23   pineapple

Generate a boolean vector describing whether or not any of the given set of values isin the given column.

selection = sample_df_2['Fruits'].isin(['banana', 'pineapple', 'smoothy'])
print(selection)

2016-06-01    False
2016-06-02    False
2016-06-03     True
2016-06-04    False
2016-06-05    False
2016-06-06     True
Freq: D, Name: Fruits, dtype: bool

Select all rows where any of the given set of values isin the given column.

sample_df_2[selection]

             A   B   C   D     Fruits
2016-06-03   8   9  10  11     banana
2016-06-06  20  21  22  23  pineapple

Missing data

import numpy as np
import pandas as pd

      start_date = '20160101'
      dates_index = pd.date_range(start_date, periods=6)
      sample_data = np.array(np.arange(24)).reshape((6, 4))
      sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD'))

      sample_df_2 = sample_df.copy()
      sample_df_2[
          'Fruits'] = 'apple orange banana strawberry blueberry pineapple'.split()

      sample_series = pd.Series(
          np.arange(6) + 1, index=pd.date_range(start_date, periods=6))
      sample_df_2['Extra Data'] = sample_series * 3 + 1

      second_numpy_array = np.array(np.arange(len(sample_df_2))) * 100 + 7
      sample_df_2['G'] = second_numpy_array

      sample_df_2

             A   B   C   D      Fruits  Extra Data    G
2016-01-01   0   1   2   3       apple           4    7
2016-01-02   4   5   6   7      orange           7  107
2016-01-03   8   9  10  11      banana          10  207
2016-01-04  12  13  14  15  strawberry          13  307
2016-01-05  16  17  18  19   blueberry          16  407
2016-01-06  20  21  22  23   pineapple          19  507

reindex

Creates a copy rather than a view

browser_index = 'Firefox Chrome Safari IE10 Konqueror'.split()

       browser_df = pd.DataFrame(
           dict(
               http_status=[200, 200, 404, 404, 301],
               response_time=[0.04, 0.02, 0.07, 0.08, 1.0]),
           index=browser_index)

       browser_df

           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00

Created a =reindex=ed copy

new_index = 'Safari Iceweasel ComodoDragon IE10 Chrome'.split()
browser_df_2 = browser_df.reindex(new_index)
browser_df_2

              http_status  response_time
Safari              404.0           0.07
Iceweasel             NaN            NaN
ComodoDragon          NaN            NaN
IE10                404.0           0.08
Chrome              200.0           0.02

Drop rows with missing data

browser_df_3 = browser_df_2.dropna(how='any')
browser_df_3

        http_status  response_time
Safari        404.0           0.07
IE10          404.0           0.08
Chrome        200.0           0.02

Fill in missing data

browser_df_2.fillna(value=-0.05555)

              http_status  response_time
Safari          404.00000        0.07000
Iceweasel        -0.05555       -0.05555
ComodoDragon     -0.05555       -0.05555
IE10            404.00000        0.08000
Chrome          200.00000        0.02000

Boolean mask for NA values

pd.isnull(browser_df_2)

              http_status  response_time
Safari              False          False
Iceweasel            True           True
ComodoDragon         True           True
IE10                False          False
Chrome              False          False

NaN s propagate during calculations

browser_df_2 * 3 + 10

http_status  response_time
Safari             1222.0          10.21
Iceweasel             NaN            NaN
ComodoDragon          NaN            NaN
IE10               1222.0          10.24
Chrome              610.0          10.06

Operations

Descriptive statistics: describe

pd.set_option('display.precision', 2)
sample_df_2.describe()

           A      B      C      D  Extra Data       G
count   6.00   6.00   6.00   6.00        6.00    6.00
mean   10.00  11.00  12.00  13.00       11.50  257.00
std     7.48   7.48   7.48   7.48        5.61  187.08
min     0.00   1.00   2.00   3.00        4.00    7.00
25%     5.00   6.00   7.00   8.00        7.75  132.00
50%    10.00  11.00  12.00  13.00       11.50  257.00
75%    15.00  16.00  17.00  18.00       15.25  382.00
max    20.00  21.00  22.00  23.00       19.00  507.00

Column mean

sample_df_2.mean()

A              10.0
B              11.0
C              12.0
D              13.0
Extra Data     11.5
G             257.0
dtype: float64

Row mean

sample_df_2.mean(axis=1)

2016-01-01      2.83
2016-01-02     22.67
2016-01-03     42.50
2016-01-04     62.33
2016-01-05     82.17
2016-01-06    102.00
Freq: D, dtype: float64

apply a function to a data frame

sample_df_2[['A', 'B', 'C', 'Fruits']]

             A   B   C      Fruits
2016-01-01   0   1   2       apple
2016-01-02   4   5   6      orange
2016-01-03   8   9  10      banana
2016-01-04  12  13  14  strawberry
2016-01-05  16  17  18   blueberry
2016-01-06  20  21  22   pineapple

sample_df_2[['A', 'B', 'Fruits']].apply(np.cumsum, axis=0)

             A   B                                         Fruits
2016-01-01   0   1                                          apple
2016-01-02   4   6                                    appleorange
2016-01-03  12  15                              appleorangebanana
2016-01-04  24  28                    appleorangebananastrawberry
2016-01-05  40  45           appleorangebananastrawberryblueberry
2016-01-06  60  66  appleorangebananastrawberryblueberrypineapple

sample_df_2[['A', 'B', 'C']].apply(np.cumsum, axis=1)

             A   B   C
2016-01-01   0   1   3
2016-01-02   4   9  15
2016-01-03   8  17  27
2016-01-04  12  25  39
2016-01-05  16  33  51
2016-01-06  20  41  63

String methods

series = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
series.str.lower()
series.str.len()


0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

0    1.0
1    1.0
2    1.0
3    4.0
4    4.0
5    NaN
6    4.0
7    3.0
8    3.0
dtype: float64

Merging data frames

import numpy as np
import pandas as pd
import random as rand

     index = np.arange(1, 7)
     attrs = 'clicks height score time'.split()
     values = rand.sample(range(50), 24)
     sample_data = np.array(values).reshape((6, 4))
     sample_df = pd.DataFrame(sample_data, index=index, columns=attrs)
     sample_df

   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
4      40      48     36     0
5       6      47      1     3
6      14      43     13    34

`concat`

Concatenate pandas objects along a particular axis with optional set logic along the other axes.

pd.concat([sample_df[0:3], sample_df[0:3]])

   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35

pd.concat([sample_df.iloc[0:3], sample_df.iloc[0:3]], axis=1)

   clicks  height  score  time  clicks  height  score  time
1      19      39     11    17      19      39     11    17
2       7      25     26    44       7      25     26    44
3       2       8     21    35       2       8     21    35

`join`

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

sample_df.join(sample_df.iloc[0:3], how='inner', rsuffix='_r')

   clicks  height  score  time  clicks_r  height_r  score_r  time_r
1      19      39     11    17        19        39       11      17
2       7      25     26    44         7        25       26      44
3       2       8     21    35         2         8       21      35

`append`

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

new_row = pd.DataFrame(dict(clicks=10, height=20, score=30, time=40), index=[10])
sample_df.iloc[0:3].append(new_row)

    clicks  height  score  time
1       19      39     11    17
2        7      25     26    44
3        2       8     21    35
10      10      20     30    40

`merge`

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

sample_df

   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
4      40      48     36     0
5       6      47      1     3
6      14      43     13    34

entries = { 1: dict(height=10, width=20), 2: dict(height=34, width=35), 3: dict(height=5, width=80), 4: dict(height=39, width=32) }
related_df = pd.DataFrame(entries)
related_df.T

   height  width
1      10     20
2      34     35
3       5     80
4      39     32

sample_df.merge(related_df.T)

   clicks  height  score  time  width
0      19      39     11    17     32

sample_df.merge(related_df.T, how='left')

   clicks  height  score  time  width
0      19      39     11    17   32.0
1       7      25     26    44    NaN
2       2       8     21    35    NaN
3      40      48     36     0    NaN
4       6      47      1     3    NaN
5      14      43     13    34    NaN

sample_df.merge(related_df.T, how='outer')

   clicks  height  score  time  width
0    19.0      39   11.0  17.0   32.0
1     7.0      25   26.0  44.0    NaN
2     2.0       8   21.0  35.0    NaN
3    40.0      48   36.0   0.0    NaN
4     6.0      47    1.0   3.0    NaN
5    14.0      43   13.0  34.0    NaN
6     NaN      10    NaN   NaN   20.0
7     NaN      34    NaN   NaN   35.0
8     NaN       5    NaN   NaN   80.0

Categoricals

import numpy as np
import pandas as pd
from io import StringIO

     csv_data = """
     Department,Name,YearsOfService,Grade\n0,Marketing,Able,4,a\n1,Engineering,Baker,7,b\n2,Accounting,Charlie,12,c\n3,Marketing,Delta,1,d\n4,Engineering,Echo,15,f\n5,Accounting,Foxtrot,9,a\n6,Marketing,Golf,3,b\n7,Engineering,Hotel,1,c\n8,Accounting,India,2,d\n9,Marketing,Juliet,5,f\n10,Engineering,Kilo,7,a\n11,Accounting,Lima,11,b\n12,Marketing,Mike,2,c\n13,Engineering,November,3,d\n14,Accounting,Oscar,4,f\n15,Marketing,Papa,9,a\n16,Engineering,Quebec,1,b\n17,Accounting,Romeo,1,c\n18,Marketing,Sierra,1,d\n19,Engineering,Tango,7,f\n20,Accounting,Uniform,5,a\n21,Marketing,Victor,19,b\n22,Engineering,Whiskey,2,c\n23,Accounting,Xray,3,d\n24,Marketing,Yankee,8,f\n25,Engineering,Zulu,17,a\n
     """

     employees = pd.read_csv(StringIO(csv_data))
     employees.head()

    Department     Name  YearsOfService Grade
0    Marketing     Able               4     a
1  Engineering    Baker               7     b
2   Accounting  Charlie              12     c
3    Marketing    Delta               1     d
4  Engineering     Echo              15     f

Convert String data to categorical data

employees.dtypes

Department        object
Name              object
YearsOfService     int64
Grade             object
dtype: object

employees['Department'] = employees['Department'].astype('category')
employees.dtypes

Department        category
Name                object
YearsOfService       int64
Grade               object
dtype: object

Rename categories

employees['Grade'] = employees['Grade'].astype('category')
employees['Grade'].cat.categories = 'excellent good acceptable poor unacceptable'.split()
employees.head()

    Department     Name  YearsOfService         Grade
0    Marketing     Able               4     excellent
1  Engineering    Baker               7          good
2   Accounting  Charlie              12    acceptable
3    Marketing    Delta               1          poor
4  Engineering     Echo              15  unacceptable

Categories before and after renaming:

# Index(['a', 'b', 'c', 'd', 'f'], dtype='object')
# Index(['excellent', 'good', 'acceptable', 'poor', 'unacceptable'], dtype='object')

Grouping

Cumulative length of service by employees in each department.

employees.groupby('Department').sum()

             YearsOfService
Department
Accounting               47
Engineering              60
Marketing                52

Number of employees per grade.

employees.groupby('Grade').count()['Name']

Grade
excellent       6
good            5
acceptable      5
poor            5
unacceptable    5
Name: Name, dtype: int64

Number of employees, by department, obtaining each grade.

employees.groupby(['Grade', 'Department']).count()['Name']

Grade         Department

excellent     Accounting     2
              Engineering    2
              Marketing      2

good          Accounting     1
              Engineering    2
              Marketing      2

acceptable    Accounting     2
              Engineering    2
              Marketing      1

poor          Accounting     2
              Engineering    1
              Marketing      2

unacceptable  Accounting     1
              Engineering    2
              Marketing      2

Time series resampling

Create a date range to use as an index: `pandas.date_range`

my_index = pd.date_range('9/1/2016', periods=9, freq='min')
my_index

DatetimeIndex(['2016-09-01 00:00:00', '2016-09-01 00:01:00',
               '2016-09-01 00:02:00', '2016-09-01 00:03:00',
               '2016-09-01 00:04:00', '2016-09-01 00:05:00',
               '2016-09-01 00:06:00', '2016-09-01 00:07:00',
               '2016-09-01 00:08:00'],
              dtype='datetime64[ns]', freq='T')

Create a time series that includes a simple pattern: `pandas.Series`

my_series = pd.Series(np.arange(9), index=my_index)
my_series

2016-09-01 00:00:00    0
2016-09-01 00:01:00    1
2016-09-01 00:02:00    2
2016-09-01 00:03:00    3
2016-09-01 00:04:00    4
2016-09-01 00:05:00    5
2016-09-01 00:06:00    6
2016-09-01 00:07:00    7
2016-09-01 00:08:00    8
Freq: T, dtype: int64

Downsampling: `pandas.resample`

my_series.resample('3min')

DatetimeIndexResampler [freq=<3 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]

my_series.resample('3min').sum()

2016-09-01 00:00:00     3
2016-09-01 00:03:00    12
2016-09-01 00:06:00    21
Freq: 3T, dtype: int64

Use upper bound for each time period as the label.

my_series.resample('3min', label='right').sum()

2016-09-01 00:03:00     3
2016-09-01 00:06:00    12
2016-09-01 00:09:00    21
Freq: 3T, dtype: int64

Close the right side of the bin interval.

my_series.resample('3min', label='right', closed='right').sum()

2016-09-01 00:00:00     0
2016-09-01 00:03:00     6
2016-09-01 00:06:00    15
2016-09-01 00:09:00    15
Freq: 3T, dtype: int64

Upsampling

my_series.resample('30s').asfreq().head()

2016-09-01 00:00:00    0.0
2016-09-01 00:00:30    NaN
2016-09-01 00:01:00    1.0
2016-09-01 00:01:30    NaN
2016-09-01 00:02:00    2.0
Freq: 30S, dtype: float64

Custom function to use with resampling

def custom_arithmetic(array_like):
    temp = 3 * np.sum(array_like) + 5
    return temp

my_series.resample('3min').apply(custom_arithmetic)

2016-09-01 00:00:00    14
2016-09-01 00:03:00    41
2016-09-01 00:06:00    68
Freq: 3T, dtype: int64

Series

Create series

my_simple_series = pd.Series(np.random.randn(5), index=list('abcde'))
my_simple_series

a    1.186168
b    0.606623
c    1.862614
d   -1.180305
e    0.615774
dtype: float64

my_dictionary = dict(a=45, b=-19.5, c=4444)
my_second_series = pd.Series(my_dictionary)
my_second_series

a      45.0
b     -19.5
c    4444.0
dtype: float64

pd.Series(my_dictionary, index=list('bcda'))

b     -19.5
c    4444.0
d       NaN
a      45.0
dtype: float64

my_dictionary.get('a')

legit = my_dictionary.get('a')
type(legit)
unknown = my_dictionary.get('f')
type(unknown)

<class 'int'>
<class 'NoneType'>

Create a series from a scalar

pd.Series(5, index=list('abcd'))

a    5
b    5
c    5
d    5
dtype: int64

Vectorized operations

A key difference between series and ndarrays is that series operations automatically align data based on labels

my_series.head() + my_series.head()

2016-09-01 00:00:00    0
2016-09-01 00:01:00    2
2016-09-01 00:02:00    4
2016-09-01 00:03:00    6
2016-09-01 00:04:00    8
Freq: T, dtype: int64

np.array(my_series.head()) + np.array(my_series.head())

array([0, 2, 4, 6, 8])

Date arithmetic

from datetime import datetime
now = datetime.now()
now

datetime.datetime(2017, 9, 22, 14, 30, 18, 504458)

delta

delta = now - datetime(2001, 1, 1)
delta

datetime.timedelta(6108, 52218, 504458)

delta.days

pd.Timedelta(6108, unit='d')

Timedelta('6108 days 00:00:00')

Range from timedelta

us_memorial_day = datetime(2016, 5, 30)
us_labor_day = datetime(2016, 9, 5)
us_summer_2016 = us_labor_day - us_memorial_day
us_summer_2016

datetime.timedelta(98)

summer_2016_days = pd.date_range(
    us_memorial_day, periods=us_summer_2016.days, freq='D')
summer_2016_days[:4]
summer_2016_days[-4:]

DatetimeIndex(['2016-05-30', '2016-05-31', '2016-06-01', '2016-06-02'], dtype='datetime64[ns]', freq='D')
DatetimeIndex(['2016-09-01', '2016-09-02', '2016-09-03', '2016-09-04'], dtype='datetime64[ns]', freq='D')

Data Frames and Panels

Creating data frames from various source types

vals = dict(a=40, b=29, c=292, d=-5.03)
pd.DataFrame(vals, index='first again'.split())

        a   b    c     d
first  40  29  292 -5.03
again  40  29  292 -5.03

Without an explicit index

series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
pd.DataFrame(series_dict)

   a    b   c
0  4    9   3
1  5  322  45
2  6  455  22

dictionary of tuples, with multi index

dict_of_tuples = {
    ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'c'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('b', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('b', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}
}
pd.DataFrame(dict_of_tuples)

     a        b
     a  b  c  a  b
A B  1  1  1  1  1
  C  2  2  2  2  2

Create panels

3D analogues of DataFrames

Initialized natively

pd.Panel(np.random.randn(2, 5, 4),
         items='item1 item2'.split(),
         major_axis=pd.date_range('9/6/2016', periods=5),
         minor_axis=list('ABCD'))

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 2016-09-06 00:00:00 to 2016-09-10 00:00:00
Minor_axis axis: A to D

Initialized from a dictionary of data frames

series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
df1 = pd.DataFrame(series_dict)
df2 = pd.DataFrame(series_dict) + 10
dict_of_dfs = dict(df1=df1, df2=df2)
pd.Panel(dict_of_dfs)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c

from_dict factory method

panel = pd.Panel.from_dict(dict_of_dfs, orient='minor')

pd.Panel.from_dict(dict_of_dfs, orient='items')

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c

panel.ix[:, 0,: ]

      a   b   c
df1   4   9   3
df2  14  19  13

panel.ix['a':,  0, :'df1']

     a  b  c
df1  4  9  3

SciPy

[WIP]

scikit-learn

[WIP]

Python: scientific libraries

NumPy

Version

Print options

Creating from Python data structures

Creating using NumPy methods

Operations are performed element-wise

linspace, zeros, ones, data types

linspace

zeros

ones

ndarray

Slicing and iterating

Arrays

Multidimensional arrays: shape

3-D Arrays

Boolean mask arrays

Broadcasting

Matrix attributes

Scalars

Vectors

Structured and record arrays

Structured arrays for data definition

Record arrays: A wrapper around structured arrays

Views and copies

Views

Copies

Array attributes

Add and remove elements

append

append to a specific axis

Horizontal stacking: hstack

insert

delete

Joining and splitting

concatenate

stack

split

Rearrange elements

fliplr

flipud

roll

rot90

Transpose-like operations

transpose

swapaxes

rollaxes

Applications

Universal functions

Linear algebra

Pattern detection

Statistics

Pandas

Object creation

Data Frames

Selecting values

Pandas in production

Selection using column name

Selection using slice

Selection by datetime index

Selection by label

Selection (multi-axis) by label

Label slicing, including both endpoints

Reduce dimensions of returned object

Working with result objects

Selecting scalars

Selecting by position: iloc

Boolean indexing

Missing data

Operations

Merging data frames

concat

join

append

merge

Categoricals

Convert String data to categorical data

Grouping

Time series resampling

Create a date range to use as an index: pandas.date_range

`linspace`

`zeros`

`ones`

`ndarray`

Multidimensional arrays: `shape`

`append`

`append` to a specific axis

Horizontal stacking: `hstack`

`insert`

`delete`

`concatenate`

`stack`

`split`

`fliplr`

`flipud`

`roll`

`rot90`

`transpose`

`swapaxes`

`rollaxes`

Selecting by position: `iloc`

`concat`

`join`

`append`

`merge`

Create a date range to use as an index: `pandas.date_range`

Create a time series that includes a simple pattern: `pandas.Series`

Downsampling: `pandas.resample`