Python: scientific libraries

23 May 2024 - Houston

Contents
  1. NumPy
    1. Version
    2. Print options
    3. Creating from Python data structures
    4. Creating using NumPy methods
    5. Operations are performed element-wise
    6. linspace, zeros, ones, data types
      1. linspace
      2. zeros
      3. ones
      4. ndarray
    7. Slicing and iterating
      1. Arrays
      2. Multidimensional arrays: shape
      3. 3-D Arrays
    8. Boolean mask arrays
    9. Broadcasting
      1. Matrix attributes
      2. Scalars
      3. Vectors
    10. Structured and record arrays
      1. Structured arrays for data definition
      2. Record arrays: A wrapper around structured arrays
    11. Views and copies
      1. Views
      2. Copies
    12. Array attributes
    13. Add and remove elements
      1. append
      2. append to a specific axis
      3. Horizontal stacking: hstack
      4. insert
      5. delete
    14. Joining and splitting
      1. concatenate
      2. stack
      3. split
    15. Rearrange elements
      1. fliplr
      2. flipud
      3. roll
      4. rot90
    16. Transpose-like operations
      1. transpose
      2. swapaxes
      3. rollaxes
    17. Applications
      1. Universal functions
      2. Linear algebra
      3. Pattern detection
      4. Statistics
  2. Pandas
    1. Object creation
    2. Data Frames
    3. Selecting values
      1. Pandas in production
      2. Selection using column name
      3. Selection using slice
      4. Selection by datetime index
      5. Selection by label
      6. Selection (multi-axis) by label
      7. Label slicing, including both endpoints
      8. Reduce dimensions of returned object
      9. Working with result objects
      10. Selecting scalars
      11. Selecting by position: iloc
      12. Boolean indexing
      13. Missing data
      14. Operations
    4. Merging data frames
      1. concat
      2. join
      3. append
      4. merge
    5. Categoricals
      1. Convert String data to categorical data
    6. Grouping
    7. Time series resampling
      1. Create a date range to use as an index: pandas.date_range
      2. Create a time series that includes a simple pattern: pandas.Series
      3. Downsampling: pandas.resample
      4. Upsampling
    8. Series
      1. Create series
      2. Vectorized operations
      3. Date arithmetic
    9. Data Frames and Panels
      1. Creating data frames from various source types
      2. Create panels
  3. SciPy
  4. scikit-learn

NumPy

Version

import numpy as np
np.__version__
'1.12.1'

Print options

np.set_printoptions(precision=4)

Creating from Python data structures

Every element in an np array must have the same type.

np.array([1, 2, 3, 4, 5])
array([1, 2, 3, 4, 5])

Data type promotion (all elements converted to consistent type):

mixed_nums = (14, -3.54, 5+7j)
np.array(mixed_nums)
array([ 14.00+0.j,  -3.54+0.j,   5.00+7.j])

Creating using NumPy methods

np.arange(10, step=2)
array([0, 2, 4, 6, 8])
np.arange(5, 10) + 1
array([ 6,  7,  8,  9, 10])
len(np.arange(0, 10, 2))
np.arange(10).size
5
10
np.arange(24, 25)
array([24])

Operations are performed element-wise

np.array([1, 2, 3, 4]) * 10
array([10, 20, 30, 40])

linspace, zeros, ones, data types

linspace

Return n (default 50) evenly spaced nums over the given interval. Closed interval: Stop parameter is included in range.

np.linspace(5, 10, 9).size
9

return the step size between each entry

np.linspace(5, 15, 3, retstep=True)
(array([  5.,  10.,  15.]), 5.0)

zeros

np.zeros(5)
array([ 0.,  0.,  0.,  0.,  0.])
np.zeros((5, 3))
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
np.zeros(11, dtype='int64')
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

ones

np.ones(5)
array([ 1.,  1.,  1.,  1.,  1.])
np.ones((3, 2))
array([[ 1.,  1.],
       [ 1.,  1.],
       [ 1.,  1.]])

ndarray

np.ndarray(shape=(2, 4), dtype=float)
array([[  1.2882e-231,   1.2882e-231,   1.2882e-231,   1.2882e-231],
       [  1.2882e-231,   1.2882e-231,   1.7587e-310,   3.5098e+064]])
np.ndarray(shape=(3,), dtype=int, order=True)
array([4617315517961601024, 4621819117588971520, 4624633867356078080])

Slicing and iterating

Arrays

np_arr = np.array([-17, -4, 0, 2, 21])
np_arr[0]
np_arr[-1]
np_arr[-1] = 33
np_arr
-17
21
array([-17,  -4,   0,   2,  33])

Multidimensional arrays: shape

matr = np.arange(35)
matr.shape = (7, 5)
matr[2]
matr[2, 3]
matr[2][3]
array([10, 11, 12, 13, 14])
13
13

3-D Arrays

array_3d = np.arange(70)
array_3d.shape = (2, 7, 5)
array_3d

array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29],
        [30, 31, 32, 33, 34]],

       [[35, 36, 37, 38, 39],
        [40, 41, 42, 43, 44],
        [45, 46, 47, 48, 49],
        [50, 51, 52, 53, 54],
        [55, 56, 57, 58, 59],
        [60, 61, 62, 63, 64],
        [65, 66, 67, 68, 69]]])
array_3d[1][4][3]
58
array_3d[1][4][3] = 1111
array_3d

array([[[   0,    1,    2,    3,    4],
        [   5,    6,    7,    8,    9],
        [  10,   11,   12,   13,   14],
        [  15,   16,   17,   18,   19],
        [  20,   21,   22,   23,   24],
        [  25,   26,   27,   28,   29],
        [  30,   31,   32,   33,   34]],

       [[  35,   36,   37,   38,   39],
        [  40,   41,   42,   43,   44],
        [  45,   46,   47,   48,   49],
        [  50,   51,   52,   53,   54],
        [  55,   56,   57, 1111,   59],
        [  60,   61,   62,   63,   64],
        [  65,   66,   67,   68,   69]]])

Boolean mask arrays

vector = np.array([-17, -4, 0, 21, 37])
divisible_by_7_mask = (vector % 7) == 0
divisible_by_7_mask
array([False, False,  True,  True, False], dtype=bool)
vector[vector % 7 == 0]
array([ 0, 21])
div_by_3_test = vector % 3 == 0
positive_test = vector > 0
combined_test = np.logical_and(div_by_3_test, positive_test)
vector[combined_test]
array([21])

Broadcasting

How numpy handles operations between arrays of different sizes.

Matrix attributes

my_3d_array = np.arange(70)
my_3d_array.shape = (2, 7, 5)
my_3d_array.ndim
my_3d_array.size
my_3d_array.dtype
3
70
dtype('int64')

Scalars

5 * my_3d_array - 2
array([[[ -2,   3,   8,  13,  18],
        [ 23,  28,  33,  38,  43],
        [ 48,  53,  58,  63,  68],
        [ 73,  78,  83,  88,  93],
        [ 98, 103, 108, 113, 118],
        [123, 128, 133, 138, 143],
        [148, 153, 158, 163, 168]],

       [[173, 178, 183, 188, 193],
        [198, 203, 208, 213, 218],
        [223, 228, 233, 238, 243],
        [248, 253, 258, 263, 268],
        [273, 278, 283, 288, 293],
        [298, 303, 308, 313, 318],
        [323, 328, 333, 338, 343]]])

Vectors

  • inner product: np.inner and np.dot

    np.inner: inner product of two arrays. For 1D arrays, inner product of vectors.

    left_matrix = np.arange(6).reshape((2, 3))
    right_matrix = np.arange(15).reshape((3, 5))
    np.inner(left_matrix, right_matrix)
    
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: shapes (2,3) and (5,3) not aligned: 3 (dim 1) != 5 (dim 0)
    

    For 2D arrays, need to use matrix product. For 1D arrays, inner product of vectors.

    np.dot(left_matrix, right_matrix)
    
    array([[ 25,  28,  31,  34,  37],
           [ 70,  82,  94, 106, 118]])
    
  • Operations along axes

    one_to_three = np.arange(3) + 1
    matrix = [one_to_three, one_to_three, one_to_three]
    np.array(one_to_three).sum()
    np.array(matrix).sum() # sum all elements
    np.array(matrix).sum(axis=0) # rows cross-section
    np.array(matrix).sum(axis=1) # cols cross-section
    
    6
    18
    array([3, 6, 9])
    array([6, 6, 6])
    
    my_3d_array.sum(axis=0)
    
    array([[ 35,  37,  39,  41,  43],
           [ 45,  47,  49,  51,  53],
           [ 55,  57,  59,  61,  63],
           [ 65,  67,  69,  71,  73],
           [ 75,  77,  79,  81,  83],
           [ 85,  87,  89,  91,  93],
           [ 95,  97,  99, 101, 103]])
    

Structured and record arrays

Structured arrays for data definition

person_data_def = [('name', 'S6'), ('height', 'f8'), ('weight', 'f8'), ('age', 'i8')]
people_array = np.zeros((4), dtype=person_data_def)
people_array[0] = ('Alpha', 65, 112, 25)
people_array[1] = ('Beta', 43, 128, 33)
people_array[2] = ('Gamma', 29, 188, 35)
people_array[3] = ('Delta', 73, 205, 34)
people_array
array([(b'Alpha',  65.,  112., 25), (b'Beta',  43.,  128., 33),
       (b'Gamma',  29.,  188., 35), (b'Delta',  73.,  205., 34)],
      dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
  • Accessing data in structured arrays

    people_array[2:]
    
    array([(b'Gamma',  29.,  188., 35), (b'Delta',  73.,  205., 34)],
          dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
    
    ages = people_array['age']
    ages
    
    array([25, 33, 35, 34])
    

Record arrays: A wrapper around structured arrays

Instead of using indexes, use attributes

person_record_array = np.rec.array(people_array)
person_record_array
rec.array([
 (b'Alpha',  65.,  112., 25),
 (b'Beta',  43.,  128., 33),
 (b'Gamma',  29.,  188., 35),
 (b'Delta',  73.,  205., 34)
],
 dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
person_record_array[0].age
25

Views and copies

Assigning to a new variable creates a new reference. Same NumPy object / location in memory, same underlying data.

import numpy as np
mi_casa = np.array([-45, -31, -12, 0])
su_casa = mi_casa
# same object
     id(mi_casa)
     id(su_casa)
     mi_casa is su_casa

     # equal values
     mi_casa == su_casa

     # values remain in sync when mutated
     su_casa[0] = 100
     mi_casa is su_casa
     mi_casa == su_casa
4642127504
4642127504
True
array([ True,  True,  True,  True], dtype=bool)
True
array([ True,  True,  True,  True], dtype=bool)

Views

Returns a shallow copy of the receiver.

dog_house = mi_casa.view()

      dog_house is mi_casa # new object at different location
      dog_house == mi_casa # same values as original

      mi_casa[0] = 345
      dog_house is mi_casa # still a new object
      dog_house == mi_casa # values remain in sync
False
array([ True,  True,  True,  True], dtype=bool)
False
array([ True,  True,  True,  True], dtype=bool)

Copies

Provides a deep copy.

tree_house = mi_casa.copy()

      # different object, same values
      tree_house is mi_casa
      tree_house == mi_casa

      # values are distinct
      tree_house[0] = 983798739
      tree_house is mi_casa
      tree_house == mi_casa
False
array([ True,  True,  True,  True], dtype=bool)
False
array([False,  True,  True,  True], dtype=bool)

Array attributes

import numpy as np
arr = np.array(np.arange(24)).reshape((2, 3, 4))
arr
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
  • ndim

    arr.ndim
    
    3
    
  • shape

    arr.shape
    
    (2, 3, 4)
    
  • size

    arr.size
    
    24
    
  • dtype

    arr.dtype
    
    dtype('int64')
    
  • itemsize

    arr.itemsize
    
    8
    

Add and remove elements

append

Append to the given array. Shape of the given array is not maintained. Returns a copy not a view.

arr
ay([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
arr2 = np.append(arr, [5, 6, 7, 8])
arr2.shape

(28,)

Use reshape to reshape:

arr2.reshape((7, 4))
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [ 5,  6,  7,  8]])

append to a specific axis

matrix = np.array(np.arange(9)).reshape((3, 3))
matrix

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

Append new matrix as new rows:

new_matrix = np.array(np.arange(9) + 10).reshape((3, 3))
np.append(matrix, new_matrix, axis=0)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

Append new matrix as new columns:

hstack = np.append(matrix, new_matrix, axis=1)
hstack

array([[ 0,  1,  2, 10, 11, 12],
       [ 3,  4,  5, 13, 14, 15],
       [ 6,  7,  8, 16, 17, 18]])

Horizontal stacking: hstack

Convenience method for appending to the last axis. Returns a copy not a view. (Same as append.)

a = np.array(np.arange(9)).reshape((3, 3))
b = np.array(np.arange(9) + 10).reshape((3, 3))
haystack = np.hstack((a, b))
haystack

array([[ 0,  1,  2, 10, 11, 12],
       [ 3,  4,  5, 13, 14, 15],
       [ 6,  7,  8, 16, 17, 18]])

insert

Interpolates data in between existing data. Creates a new array with new data.

arr
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
np.insert(arr, 1, 444, axis=0)
array([[[  0,   1,   2,   3],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[444, 444, 444, 444],
        [444, 444, 444, 444],
        [444, 444, 444, 444]],

       [[ 12,  13,  14,  15],
        [ 16,  17,  18,  19],
        [ 20,  21,  22,  23]]])
np.insert(arr, 1, 444, axis=1)
array([[[  0,   1,   2,   3],
        [444, 444, 444, 444],
        [  4,   5,   6,   7],
        [  8,   9,  10,  11]],

       [[ 12,  13,  14,  15],
        [444, 444, 444, 444],
        [ 16,  17,  18,  19],
        [ 20,  21,  22,  23]]])

delete

arr
ay([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

Delete element 1 at axis 0:

np.delete(arr, 1, axis=0)
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]]])

Delete element 0 at axis 1:

np.delete(arr, 0, axis=1)
array([[[ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[16, 17, 18, 19],
        [20, 21, 22, 23]]])

Delete element 2 at axis 2:

np.delete(arr, 2, axis=2)
array([[[ 0,  1,  3],
        [ 4,  5,  7],
        [ 8,  9, 11]],

       [[12, 13, 15],
        [16, 17, 19],
        [20, 21, 23]]])

Joining and splitting

concatenate

Returns a copy not a view

import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
np.concatenate((arr1, arr2), axis=0)
np.concatenate((arr1, arr2), axis=1)
array([[1, 2], [3, 4],
       [5, 6], [7, 8]])
array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

stack

np.stack([arr1, arr2], axis=0)
array([[[1, 2], [3, 4]],
       [[5, 6], [7, 8]]])

split

temp = np.arange(9).reshape((3, 3))
np.split(temp, 3, axis=0)
[array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])]

Rearrange elements

fliplr

Reverse order of elements along the second axis

orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.fliplr(orig_array)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
array([[ 4,  3,  2,  1,  0],
       [ 9,  8,  7,  6,  5],
       [14, 13, 12, 11, 10]])
orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.fliplr(orig_array)

array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])

array([[[2, 3], [0, 1]],
       [[6, 7], [4, 5]]])

flipud

Reverse order of elements along the first axis

orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.flipud(orig_array)
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
array([[10, 11, 12, 13, 14],
       [ 5,  6,  7,  8,  9],
       [ 0,  1,  2,  3,  4]])
orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.flipud(orig_array)

array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])

array([[[4, 5], [6, 7]],
       [[0, 1], [2, 3]]])

roll

Rotate elements n times along the second dimension. CW for n > 0, CCW for n < 0.

arr
np.roll(arr, 4)
print('------')
np.roll(arr, 5)
np.roll(arr, 1)
print('------')
np.roll(arr, -1)
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
array([[[20, 21, 22, 23],
        [ 0,  1,  2,  3],
        [ 4,  5,  6,  7]],

       [[ 8,  9, 10, 11],
        [12, 13, 14, 15],
        [16, 17, 18, 19]]])
------
array([[[19, 20, 21, 22],
        [23,  0,  1,  2],
        [ 3,  4,  5,  6]],

       [[ 7,  8,  9, 10],
        [11, 12, 13, 14],
        [15, 16, 17, 18]]])
array([[[23,  0,  1,  2],
        [ 3,  4,  5,  6],
        [ 7,  8,  9, 10]],

       [[11, 12, 13, 14],
        [15, 16, 17, 18],
        [19, 20, 21, 22]]])
------
array([[[ 1,  2,  3,  4],
        [ 5,  6,  7,  8],
        [ 9, 10, 11, 12]],

       [[13, 14, 15, 16],
        [17, 18, 19, 20],
        [21, 22, 23,  0]]])

rot90

Rotate 90 degrees.

orig_array
array([[[0, 1], [2, 3]],
       [[4, 5], [6, 7]]])
np.rot90(orig_array)
array([[[2, 3], [6, 7]],
       [[0, 1], [4, 5]]])
np.rot90(orig_array, k=-1)
array([[[4, 5], [0, 1]],
       [[6, 7], [2, 3]]])

Transpose-like operations

arr_3x8 = np.array(np.arange(24)).reshape((3, 8))
arr_3x8
array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20, 21, 22, 23]])

transpose

np.transpose(arr_3x8)
array([[ 0,  8, 16],
       [ 1,  9, 17],
       [ 2, 10, 18],
       [ 3, 11, 19],
       [ 4, 12, 20],
       [ 5, 13, 21],
       [ 6, 14, 22],
       [ 7, 15, 23]])
np.transpose(arr_3x8, axes=(1, 0))
array([[ 0,  8, 16],
       [ 1,  9, 17],
       [ 2, 10, 18],
       [ 3, 11, 19],
       [ 4, 12, 20],
       [ 5, 13, 21],
       [ 6, 14, 22],
       [ 7, 15, 23]])

swapaxes

arr_3x2x4 = np.array(np.arange(24)).reshape((3, 2, 4))
arr_3x2x4
np.swapaxes(arr_3x2x4, axis1=0, axis2=2)

array([[[ 0,  1,  2,  3], [ 4,  5,  6,  7]],
       [[ 8,  9, 10, 11], [12, 13, 14, 15]],
       [[16, 17, 18, 19], [20, 21, 22, 23]]])
array([[[ 0,  8, 16], [ 4, 12, 20]],
       [[ 1,  9, 17], [ 5, 13, 21]],
       [[ 2, 10, 18], [ 6, 14, 22]],
       [[ 3, 11, 19], [ 7, 15, 23]]])

rollaxes

A view is returned.

mat_4d = np.ones((3, 4, 5, 6))
rolled = np.rollaxis(mat_4d, 1)
rolled.shape
(4, 3, 5, 6)

Applications

Universal functions

  • np.frompyfunc

    ufunc: a function that operates on ~ndarray~s element-wise, supporting array broadcasting, type casting, and other standard features. Vectorized wrapper for a Python function.

    import numpy as np
    
           def truncated_binomial(x):
               return (x + 1)**3 - x**3
    
           truncated_binomial(4)
    
    61
    

    args: func name, number of args, number of scalars to return

    nums = np.ones(6).reshape((2, 3)) * 4
    nums
    trunc_binom = np.frompyfunc(truncated_binomial, 1, 1)
    trunc_binom(nums)
    
    array([[ 4.,  4.,  4.],
           [ 4.,  4.,  4.]])
    array([[61.0, 61.0, 61.0],
           [61.0, 61.0, 61.0]], dtype=object)
    

Linear algebra

  • Matrices

    Common functions are accessed via properties instead of functions

    my_matrix = np.matrix([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
    my_matrix
    
    matrix([[3, 1, 4],
            [1, 5, 9],
            [2, 6, 5]])
    
    • Transpose:

      my_matrix.T
      
      matrix([[3, 1, 2],
              [1, 5, 6],
              [4, 9, 5]])
      
    • Inverse:

      my_matrix.I
      
      matrix([[ 0.3222, -0.2111,  0.1222],
              [-0.1444, -0.0778,  0.2556],
              [ 0.0444,  0.1778, -0.1556]])
      
  • Identity matrices

    np.eye(3, dtype=int)
    
    array([[1, 0, 0],
           [0, 1, 0],
           [0, 0, 1]])
    
  • Solving systems of linear equations

    \begin{align} A \mathbf{x} &= \mathbf{b} \\ A^{-1} A \mathbf{x} &= A^{-1}\mathbf{b} \\ \mathbf{x} &= A^{-1}\mathbf{b} \end{align}

    my_matrix
    rhs = np.matrix([[11], [22], [33]])
    inverse = my_matrix.I
    solution = inverse * rhs
    solution
    
    matrix([[3, 1, 4],
            [1, 5, 9],
            [2, 6, 5]])
    matrix([[ 2.9333],
            [ 5.1333],
            [-0.7333]])
    

    Optimized version:

    from numpy.linalg import solve
    solve(my_matrix, rhs)
    
    matrix([[ 2.9333],
            [ 5.1333],
            [-0.7333]])
    
  • Computing eigenvalues / eigenvectors

    Use eig to compute the eigenvalues and right eigenvectors of the given matrix

    from numpy.linalg import eig
    eigvals, eigvects = eig(my_matrix)
    eigvals
    eigvects
    
    
    array([ 13.0858,   2.58  ,  -2.6658])
    matrix([[-0.3154, -0.9512, -0.3237],
            [-0.7231,  0.3078, -0.7022],
            [-0.6146,  0.0229,  0.6341]])
    

Pattern detection

Given a sequence of numbers as an array, find the next number in the sequence.

import numpy as np
seq_array = np.array([1, 7, 19, 37, 61, 91, 127, 169, 217, 271, 331])
np.diff(seq_array) # calculate first differences
np.diff(seq_array, n=2) # second differences
np.diff(seq_array, n=3) # third differences
array([ 6, 12, 18, 24, 30, 36, 42, 48, 54, 60])
array([6, 6, 6, 6, 6, 6, 6, 6, 6])
array([0, 0, 0, 0, 0, 0, 0, 0])
  • Symbolic Python

    Use Jupyter notebooks. Output like Wolfram or Matlab.

    from sympy import init_session
    init_session()
    

Statistics

  • Basic statistics: mean, median, min, max, std, var

    import scipy as sp
    import numpy as np
    from scipy.stats import norm
    

    Generate a data set from normally distributed data points.

    number_of_data_points = 10000
    data_set = sp.randn(number_of_data_points)
    type(data_set)
    
    <class 'numpy.ndarray'>
    
    • mean

      data_set.mean()
      
      0.0016312105909250228
      
    • sp.median

      sp.median(data_set)
      
      0.0028936738357498988
      
    • min

      data_set.min()
      
      -3.4733354484199768
      
    • max

      data_set.max()
      
      3.5011182300650168
      
    • sp.std

      sp.std(data_set)
      
      1.0046141873536045
      
    • sp.var

      sp.var(data_set)
      
      1.0092496654321432
      
  • Probability distributions

    • Continuous
      • Normal: norm
      • Chi squared: chi2
      • Student’s T: t
      • Uniform: uniform
    • Discrete
      • Poisson: poisson
      • Binomial: binomial
  • Example: Normal Distribution

    Print random variates from the IQ distribution

    iq_mean = 100
    iq_std_dev = 15
    iq_distribution = norm(loc=iq_mean, scale=iq_std_dev)
           for n in np.arange(8):
               print('{:6.2f}'.format(iq_distribution.rvs()))
    
     98.10
    112.58
    107.61
    111.16
     95.47
     71.72
    112.54
     87.38
    

    Print a histogram

    import numpy as np
    import matplotlib.pyplot as plt
    
           mu, sigma = 100, 15
           dataset = mu + sigma * np.random.randn(10_000)
    
           n, bins, patches = plt.hist(dataset, 50, normed=1, facecolor='g', alpha=0.75)
           plt.xlabel('IQ Score')
           plt.ylabel('Probability')
           plt.title('Histogram of IQ')
           plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
           plt.axis([40, 160, 0, 0.03])
           plt.grid(True)
           plt.show()
    

Pandas

Object creation

  • Integer index (default)

    import pandas as pd
    import numpy as np
    
           default_series = pd.Series([1, 3, 5, np.nan, 6, 8])
           print(default_series)
    
    0    1.0
    1    3.0
    2    5.0
    3    NaN
    4    6.0
    5    8.0
    dtype: float64
    
  • Datetime index

    dates_index = pd.date_range('20170101', periods=6)
    print(dates_index)
    
    DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                   '2017-01-05', '2017-01-06'],
                  dtype='datetime64[ns]', freq='D')
    
  • Sample numpy data

    print(np.arange(5))
    print(np.array(np.arange(5)))
    
           sample_data = np.array(np.arange(24)).reshape((6, 4))
           print(sample_data)
    
    [0 1 2 3 4]
    [0 1 2 3 4]
    [[ 0  1  2  3]
     [ 4  5  6  7]
     [ 8  9 10 11]
     [12 13 14 15]
     [16 17 18 19]
     [20 21 22 23]]
    

Data Frames

  • With specified index and column headers

    sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD'))
    print(sample_df)
    
                 A   B   C   D
    2017-01-01   0   1   2   3
    2017-01-02   4   5   6   7
    2017-01-03   8   9  10  11
    2017-01-04  12  13  14  15
    2017-01-05  16  17  18  19
    2017-01-06  20  21  22  23
    
  • From a python dictionary

    py_dict_to_df = pd.DataFrame(
        dict(
            float=1.0,
            time=pd.Timestamp('20170101'),
            series=pd.Series(1, index=list(range(4)), dtype='float32'),
            array=np.array([3] * 4, dtype='int32'),
            categories=pd.Categorical(['test', 'train', 'taxes', 'tools']),
            dull='boring data'))
    
           print(py_dict_to_df)
    
       array categories         dull  float  series       time
    0      3       test  boring data    1.0     1.0 2017-01-01
    1      3      train  boring data    1.0     1.0 2017-01-01
    2      3      taxes  boring data    1.0     1.0 2017-01-01
    3      3      tools  boring data    1.0     1.0 2017-01-01
    
  • Attributes info: dtypes

    print(py_dict_to_df.dtypes)
    
    array                  int32
    categories          category
    dull                  object
    float                float64
    series               float32
    time          datetime64[ns]
    dtype: object
    
  • Peeking: head and tail

    print(py_dict_to_df.head())
    print(py_dict_to_df.tail(2))
    
       array categories         dull  float  series       time
    0      3       test  boring data    1.0     1.0 2017-01-01
    1      3      train  boring data    1.0     1.0 2017-01-01
    2      3      taxes  boring data    1.0     1.0 2017-01-01
    3      3      tools  boring data    1.0     1.0 2017-01-01
    array categories         dull  float  series       time
    2      3      taxes  boring data    1.0     1.0 2017-01-01
    3      3      tools  boring data    1.0     1.0 2017-01-01
    
  • Underlying data: values, index, columns

    • values

      print(py_dict_to_df.values)
      print(sample_df.values)
      
      [[3 'test' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
       [3 'train' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
       [3 'taxes' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]
       [3 'tools' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]]
      [[ 0  1  2  3]
       [ 4  5  6  7]
       [ 8  9 10 11]
       [12 13 14 15]
       [16 17 18 19]
       [20 21 22 23]]
      
    • index

      print(py_dict_to_df.index)
      print(sample_df.index)
      
      Int64Index([0, 1, 2, 3], dtype='int64')
      DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
                     '2017-01-05', '2017-01-06'],
                    dtype='datetime64[ns]', freq='D')
      
    • columns

      print(py_dict_to_df.columns)
      print(sample_df.columns)
      
      Index(['array', 'categories', 'dull', 'float', 'series', 'time'], dtype='object')
      Index(['A', 'B', 'C', 'D'], dtype='object')
      
  • Statistical summary: describe

    print(py_dict_to_df.describe())
    
    array  float  series
    count    4.0    4.0     4.0
    mean     3.0    1.0     1.0
    std      0.0    0.0     0.0
    min      3.0    1.0     1.0
    25%      3.0    1.0     1.0
    50%      3.0    1.0     1.0
    75%      3.0    1.0     1.0
    max      3.0    1.0     1.0
    
    print(sample_df.describe())
    
                   A          B          C          D
    count   6.000000   6.000000   6.000000   6.000000
    mean   10.000000  11.000000  12.000000  13.000000
    std     7.483315   7.483315   7.483315   7.483315
    min     0.000000   1.000000   2.000000   3.000000
    25%     5.000000   6.000000   7.000000   8.000000
    50%    10.000000  11.000000  12.000000  13.000000
    75%    15.000000  16.000000  17.000000  18.000000
    max    20.000000  21.000000  22.000000  23.000000
    
  • Control floating-point display precision (and other options): set_options

    pd.set_option('display.precision', 2)
    print(sample_df.describe())
    
               A      B      C      D
    count   6.00   6.00   6.00   6.00
    mean   10.00  11.00  12.00  13.00
    std     7.48   7.48   7.48   7.48
    min     0.00   1.00   2.00   3.00
    25%     5.00   6.00   7.00   8.00
    50%    10.00  11.00  12.00  13.00
    75%    15.00  16.00  17.00  18.00
    max    20.00  21.00  22.00  23.00
    
  • Transpose: dataframe.T

    print(sample_df.T)
    
    2017-01-01  2017-01-02  2017-01-03  2017-01-04  2017-01-05  2017-01-06
    A           0           4           8          12          16          20
    B           1           5           9          13          17          21
    C           2           6          10          14          18          22
    D           3           7          11          15          19          23
    
  • Sort by an axis: sort_index

    axis: 0 for rows, 1 for cols

    # sort columns in descending order
           print(sample_df.sort_index(axis=1, ascending=False))
    
                 D   C   B   A
    2017-01-01   3   2   1   0
    2017-01-02   7   6   5   4
    2017-01-03  11  10   9   8
    2017-01-04  15  14  13  12
    2017-01-05  19  18  17  16
    2017-01-06  23  22  21  20
    
    # sort rows in descending order
           print(sample_df.sort_index(axis=0, ascending=False))
    
                 A   B   C   D
    2017-01-06  20  21  22  23
    2017-01-05  16  17  18  19
    2017-01-04  12  13  14  15
    2017-01-03   8   9  10  11
    2017-01-02   4   5   6   7
    2017-01-01   0   1   2   3
    
  • Sort by data within a given column: sort_values

    # sort the 'B' column in descending order, adjust others to match
           print(sample_df.sort_values(by='B', ascending=False))
    
                 A   B   C   D
    2017-01-06  20  21  22  23
    2017-01-05  16  17  18  19
    2017-01-04  12  13  14  15
    2017-01-03   8   9  10  11
    2017-01-02   4   5   6   7
    2017-01-01   0   1   2   3
    

Selecting values

Pandas in production

For production (as opposed to interactive) work, the pandas team recommends the optimized data access methods: .at .iat .loc .iloc .ix.

  • .at: Fast label-based scalar accessor

  • .iat: Fast integer location scalar accessor.

  • .loc: Purely label-location based indexer for selection by label.

  • .iloc: Purely integer-location based indexing for selection by position.

  • .ix: A primarily label-location based indexer, with integer position fallback.

    See the docs for more details.

    import numpy as np
    import pandas as pd
    
            sample_numpy_data = np.array(np.arange(24)).reshape((6, 4))
            dates_index = pd.date_range('20160601', periods=6)
            sample_df = pd.DataFrame(
                sample_numpy_data, index=dates_index, columns=list('ABCD'))
            print(sample_df.head())
    
                 A   B   C   D
    2016-06-01   0   1   2   3
    2016-06-02   4   5   6   7
    2016-06-03   8   9  10  11
    2016-06-04  12  13  14  15
    2016-06-05  16  17  18  19
    

Selection using column name

col_c = sample_df['C']
print(col_c)
2016-06-01     2
2016-06-02     6
2016-06-03    10
2016-06-04    14
2016-06-05    18
2016-06-06    22
Freq: D, Name: C, dtype: int64

Selection using slice

first_4_rows = sample_df[:4]
print(first_4_rows)
             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15

Selection by datetime index

first_four_periods = sample_df['2016-06-01':'2016-06-04']
print(first_four_periods)
             A   B   C   D
2016-06-01   0   1   2   3
2016-06-02   4   5   6   7
2016-06-03   8   9  10  11
2016-06-04  12  13  14  15

Selection by label

print(dates_index[1:3])
date_selection = sample_df.loc[dates_index[1:3]]
print(date_selection)
DatetimeIndex(['2016-06-02', '2016-06-03'], dtype='datetime64[ns]', freq='D')
            A  B   C   D
2016-06-02  4  5   6   7
2016-06-03  8  9  10  11

Selection (multi-axis) by label

all_rows_of_cols_a_and_b = sample_df.loc[:, ['A', 'B']]
print(all_rows_of_cols_a_and_b)
             A   B
2016-06-01   0   1
2016-06-02   4   5
2016-06-03   8   9
2016-06-04  12  13
2016-06-05  16  17
2016-06-06  20  21

Label slicing, including both endpoints

a_and_b_between_dates = sample_df.loc['2016-06-01':'2016-06-04', ['A', 'B']]
print(a_and_b_between_dates)
             A   B
2016-06-01   0   1
2016-06-02   4   5
2016-06-03   8   9
2016-06-04  12  13

Reduce dimensions of returned object

print(sample_df.loc['2016-06-03', ['D', 'B']])
print(sample_df.loc['2016-06-03', ['B', 'D']])
D    11
B     9
Name: 2016-06-03 00:00:00, dtype: int64
B     9
D    11
Name: 2016-06-03 00:00:00, dtype: int64

Working with result objects

result = sample_df.loc['2016-06-03', ['D', 'B']]
print(result[0] * 4)
44

Selecting scalars

print(sample_df.loc[:, 'C'])
print('------------')
print(dates_index[2])
print('------------')
print(sample_df.loc[dates_index[2], 'C'])
2016-06-01     2
2016-06-02     6
2016-06-03    10
2016-06-04    14
2016-06-05    18
2016-06-06    22
Freq: D, Name: C, dtype: int64
------------
2016-06-03 00:00:00
------------
10

Selecting by position: iloc

sample_numpy_data[3]
array([12, 13, 14, 15])
sample_df.iloc[3]
A    12
B    13
C    14
D    15
Name: 2016-06-04 00:00:00, dtype: int64
  • Selecting using integer slices with iloc

    sample_df.iloc[1:3, 2:4]
    
                 C   D
    2016-06-02   6   7
    2016-06-03  10  11
    
  • Selecting lists of rows with iloc

    sample_df.iloc[[0, 1, 3], [0, 2]]
    
                 A   C
    2016-06-01   0   2
    2016-06-02   4   6
    2016-06-04  12  14
    
  • Slicing rows explicitly (selecting all cols implicitly)

    sample_df.iloc[1:3, :]
    
                A  B   C   D
    2016-06-02  4  5   6   7
    2016-06-03  8  9  10  11
    
  • Slicing cols explicitly, all rows implicitly

    sample_df.iloc[:, 1:3]
    
                 B   C
    2016-06-01   1   2
    2016-06-02   5   6
    2016-06-03   9  10
    2016-06-04  13  14
    2016-06-05  17  18
    2016-06-06  21  22
    

Boolean indexing

  • Test based upon one column’s data

    sample_df.C >= 14
    
    2016-06-01    False
    2016-06-02    False
    2016-06-03    False
    2016-06-04     True
    2016-06-05     True
    2016-06-06     True
    Freq: D, Name: C, dtype: bool
    
  • Test based upon the entire data set

    sample_df
    sample_df[sample_df >= 14]
    
                 A   B   C   D
    2016-06-01   0   1   2   3
    2016-06-02   4   5   6   7
    2016-06-03   8   9  10  11
    2016-06-04  12  13  14  15
    2016-06-05  16  17  18  19
    2016-06-06  20  21  22  23
                   A     B     C     D
    2016-06-01   NaN   NaN   NaN   NaN
    2016-06-02   NaN   NaN   NaN   NaN
    2016-06-03   NaN   NaN   NaN   NaN
    2016-06-04   NaN   NaN  14.0  15.0
    2016-06-05  16.0  17.0  18.0  19.0
    2016-06-06  20.0  21.0  22.0  23.0
    
  • isin method

    Returns a boolean series showing whether each element in the series is exactly contained in the passed sequence of values.

    sample_df_2 = sample_df.copy()
    sample_df_2['Fruits'] = [
        'apple', 'orange', 'banana', 'strawberry', 'blueberry', 'pineapple'
    ]
    sample_df_2
    
                 A   B   C   D      Fruits
    2016-06-01   0   1   2   3       apple
    2016-06-02   4   5   6   7      orange
    2016-06-03   8   9  10  11      banana
    2016-06-04  12  13  14  15  strawberry
    2016-06-05  16  17  18  19   blueberry
    2016-06-06  20  21  22  23   pineapple
    

    Generate a boolean vector describing whether or not any of the given set of values isin the given column.

    selection = sample_df_2['Fruits'].isin(['banana', 'pineapple', 'smoothy'])
    print(selection)
    
    2016-06-01    False
    2016-06-02    False
    2016-06-03     True
    2016-06-04    False
    2016-06-05    False
    2016-06-06     True
    Freq: D, Name: Fruits, dtype: bool
    

    Select all rows where any of the given set of values isin the given column.

    sample_df_2[selection]
    
                 A   B   C   D     Fruits
    2016-06-03   8   9  10  11     banana
    2016-06-06  20  21  22  23  pineapple
    

Missing data

import numpy as np
import pandas as pd

      start_date = '20160101'
      dates_index = pd.date_range(start_date, periods=6)
      sample_data = np.array(np.arange(24)).reshape((6, 4))
      sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD'))

      sample_df_2 = sample_df.copy()
      sample_df_2[
          'Fruits'] = 'apple orange banana strawberry blueberry pineapple'.split()

      sample_series = pd.Series(
          np.arange(6) + 1, index=pd.date_range(start_date, periods=6))
      sample_df_2['Extra Data'] = sample_series * 3 + 1

      second_numpy_array = np.array(np.arange(len(sample_df_2))) * 100 + 7
      sample_df_2['G'] = second_numpy_array

      sample_df_2
             A   B   C   D      Fruits  Extra Data    G
2016-01-01   0   1   2   3       apple           4    7
2016-01-02   4   5   6   7      orange           7  107
2016-01-03   8   9  10  11      banana          10  207
2016-01-04  12  13  14  15  strawberry          13  307
2016-01-05  16  17  18  19   blueberry          16  407
2016-01-06  20  21  22  23   pineapple          19  507
  • reindex

    Creates a copy rather than a view

    browser_index = 'Firefox Chrome Safari IE10 Konqueror'.split()
    
           browser_df = pd.DataFrame(
               dict(
                   http_status=[200, 200, 404, 404, 301],
                   response_time=[0.04, 0.02, 0.07, 0.08, 1.0]),
               index=browser_index)
    
           browser_df
    
               http_status  response_time
    Firefox            200           0.04
    Chrome             200           0.02
    Safari             404           0.07
    IE10               404           0.08
    Konqueror          301           1.00
    
  • Created a =reindex=ed copy

    new_index = 'Safari Iceweasel ComodoDragon IE10 Chrome'.split()
    browser_df_2 = browser_df.reindex(new_index)
    browser_df_2
    
                  http_status  response_time
    Safari              404.0           0.07
    Iceweasel             NaN            NaN
    ComodoDragon          NaN            NaN
    IE10                404.0           0.08
    Chrome              200.0           0.02
    
  • Drop rows with missing data

    browser_df_3 = browser_df_2.dropna(how='any')
    browser_df_3
    
            http_status  response_time
    Safari        404.0           0.07
    IE10          404.0           0.08
    Chrome        200.0           0.02
    
  • Fill in missing data

    browser_df_2.fillna(value=-0.05555)
    
                  http_status  response_time
    Safari          404.00000        0.07000
    Iceweasel        -0.05555       -0.05555
    ComodoDragon     -0.05555       -0.05555
    IE10            404.00000        0.08000
    Chrome          200.00000        0.02000
    
  • Boolean mask for NA values

    pd.isnull(browser_df_2)
    
                  http_status  response_time
    Safari              False          False
    Iceweasel            True           True
    ComodoDragon         True           True
    IE10                False          False
    Chrome              False          False
    
  • NaN s propagate during calculations

    browser_df_2 * 3 + 10
    
    http_status  response_time
    Safari             1222.0          10.21
    Iceweasel             NaN            NaN
    ComodoDragon          NaN            NaN
    IE10               1222.0          10.24
    Chrome              610.0          10.06
    

Operations

  • Descriptive statistics: describe

    pd.set_option('display.precision', 2)
    sample_df_2.describe()
    
               A      B      C      D  Extra Data       G
    count   6.00   6.00   6.00   6.00        6.00    6.00
    mean   10.00  11.00  12.00  13.00       11.50  257.00
    std     7.48   7.48   7.48   7.48        5.61  187.08
    min     0.00   1.00   2.00   3.00        4.00    7.00
    25%     5.00   6.00   7.00   8.00        7.75  132.00
    50%    10.00  11.00  12.00  13.00       11.50  257.00
    75%    15.00  16.00  17.00  18.00       15.25  382.00
    max    20.00  21.00  22.00  23.00       19.00  507.00
    
  • Column mean

    sample_df_2.mean()
    
    A              10.0
    B              11.0
    C              12.0
    D              13.0
    Extra Data     11.5
    G             257.0
    dtype: float64
    
  • Row mean

    sample_df_2.mean(axis=1)
    
    2016-01-01      2.83
    2016-01-02     22.67
    2016-01-03     42.50
    2016-01-04     62.33
    2016-01-05     82.17
    2016-01-06    102.00
    Freq: D, dtype: float64
    
  • apply a function to a data frame

    sample_df_2[['A', 'B', 'C', 'Fruits']]
    
                 A   B   C      Fruits
    2016-01-01   0   1   2       apple
    2016-01-02   4   5   6      orange
    2016-01-03   8   9  10      banana
    2016-01-04  12  13  14  strawberry
    2016-01-05  16  17  18   blueberry
    2016-01-06  20  21  22   pineapple
    
    sample_df_2[['A', 'B', 'Fruits']].apply(np.cumsum, axis=0)
    
                 A   B                                         Fruits
    2016-01-01   0   1                                          apple
    2016-01-02   4   6                                    appleorange
    2016-01-03  12  15                              appleorangebanana
    2016-01-04  24  28                    appleorangebananastrawberry
    2016-01-05  40  45           appleorangebananastrawberryblueberry
    2016-01-06  60  66  appleorangebananastrawberryblueberrypineapple
    
    sample_df_2[['A', 'B', 'C']].apply(np.cumsum, axis=1)
    
                 A   B   C
    2016-01-01   0   1   3
    2016-01-02   4   9  15
    2016-01-03   8  17  27
    2016-01-04  12  25  39
    2016-01-05  16  33  51
    2016-01-06  20  41  63
    
  • String methods

    series = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
    series.str.lower()
    series.str.len()
    
    
    0       a
    1       b
    2       c
    3    aaba
    4    baca
    5     NaN
    6    caba
    7     dog
    8     cat
    dtype: object
    
    0    1.0
    1    1.0
    2    1.0
    3    4.0
    4    4.0
    5    NaN
    6    4.0
    7    3.0
    8    3.0
    dtype: float64
    

Merging data frames

import numpy as np
import pandas as pd
import random as rand

     index = np.arange(1, 7)
     attrs = 'clicks height score time'.split()
     values = rand.sample(range(50), 24)
     sample_data = np.array(values).reshape((6, 4))
     sample_df = pd.DataFrame(sample_data, index=index, columns=attrs)
     sample_df
   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
4      40      48     36     0
5       6      47      1     3
6      14      43     13    34

concat

Concatenate pandas objects along a particular axis with optional set logic along the other axes.

pd.concat([sample_df[0:3], sample_df[0:3]])
   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
pd.concat([sample_df.iloc[0:3], sample_df.iloc[0:3]], axis=1)
   clicks  height  score  time  clicks  height  score  time
1      19      39     11    17      19      39     11    17
2       7      25     26    44       7      25     26    44
3       2       8     21    35       2       8     21    35

join

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

sample_df.join(sample_df.iloc[0:3], how='inner', rsuffix='_r')
   clicks  height  score  time  clicks_r  height_r  score_r  time_r
1      19      39     11    17        19        39       11      17
2       7      25     26    44         7        25       26      44
3       2       8     21    35         2         8       21      35

append

Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.

new_row = pd.DataFrame(dict(clicks=10, height=20, score=30, time=40), index=[10])
sample_df.iloc[0:3].append(new_row)
    clicks  height  score  time
1       19      39     11    17
2        7      25     26    44
3        2       8     21    35
10      10      20     30    40

merge

Merge DataFrame objects by performing a database-style join operation by columns or indexes.

If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.

sample_df
   clicks  height  score  time
1      19      39     11    17
2       7      25     26    44
3       2       8     21    35
4      40      48     36     0
5       6      47      1     3
6      14      43     13    34
entries = { 1: dict(height=10, width=20), 2: dict(height=34, width=35), 3: dict(height=5, width=80), 4: dict(height=39, width=32) }
related_df = pd.DataFrame(entries)
related_df.T
   height  width
1      10     20
2      34     35
3       5     80
4      39     32
sample_df.merge(related_df.T)
   clicks  height  score  time  width
0      19      39     11    17     32
sample_df.merge(related_df.T, how='left')
   clicks  height  score  time  width
0      19      39     11    17   32.0
1       7      25     26    44    NaN
2       2       8     21    35    NaN
3      40      48     36     0    NaN
4       6      47      1     3    NaN
5      14      43     13    34    NaN
sample_df.merge(related_df.T, how='outer')
   clicks  height  score  time  width
0    19.0      39   11.0  17.0   32.0
1     7.0      25   26.0  44.0    NaN
2     2.0       8   21.0  35.0    NaN
3    40.0      48   36.0   0.0    NaN
4     6.0      47    1.0   3.0    NaN
5    14.0      43   13.0  34.0    NaN
6     NaN      10    NaN   NaN   20.0
7     NaN      34    NaN   NaN   35.0
8     NaN       5    NaN   NaN   80.0

Categoricals

import numpy as np
import pandas as pd
from io import StringIO

     csv_data = """
     Department,Name,YearsOfService,Grade\n0,Marketing,Able,4,a\n1,Engineering,Baker,7,b\n2,Accounting,Charlie,12,c\n3,Marketing,Delta,1,d\n4,Engineering,Echo,15,f\n5,Accounting,Foxtrot,9,a\n6,Marketing,Golf,3,b\n7,Engineering,Hotel,1,c\n8,Accounting,India,2,d\n9,Marketing,Juliet,5,f\n10,Engineering,Kilo,7,a\n11,Accounting,Lima,11,b\n12,Marketing,Mike,2,c\n13,Engineering,November,3,d\n14,Accounting,Oscar,4,f\n15,Marketing,Papa,9,a\n16,Engineering,Quebec,1,b\n17,Accounting,Romeo,1,c\n18,Marketing,Sierra,1,d\n19,Engineering,Tango,7,f\n20,Accounting,Uniform,5,a\n21,Marketing,Victor,19,b\n22,Engineering,Whiskey,2,c\n23,Accounting,Xray,3,d\n24,Marketing,Yankee,8,f\n25,Engineering,Zulu,17,a\n
     """

     employees = pd.read_csv(StringIO(csv_data))
     employees.head()
    Department     Name  YearsOfService Grade
0    Marketing     Able               4     a
1  Engineering    Baker               7     b
2   Accounting  Charlie              12     c
3    Marketing    Delta               1     d
4  Engineering     Echo              15     f

Convert String data to categorical data

employees.dtypes
Department        object
Name              object
YearsOfService     int64
Grade             object
dtype: object
employees['Department'] = employees['Department'].astype('category')
employees.dtypes
Department        category
Name                object
YearsOfService       int64
Grade               object
dtype: object
  • Rename categories

    employees['Grade'] = employees['Grade'].astype('category')
    employees['Grade'].cat.categories = 'excellent good acceptable poor unacceptable'.split()
    employees.head()
    
        Department     Name  YearsOfService         Grade
    0    Marketing     Able               4     excellent
    1  Engineering    Baker               7          good
    2   Accounting  Charlie              12    acceptable
    3    Marketing    Delta               1          poor
    4  Engineering     Echo              15  unacceptable
    

    Categories before and after renaming:

    # Index(['a', 'b', 'c', 'd', 'f'], dtype='object')
    # Index(['excellent', 'good', 'acceptable', 'poor', 'unacceptable'], dtype='object')
    

Grouping

Cumulative length of service by employees in each department.

employees.groupby('Department').sum()
             YearsOfService
Department
Accounting               47
Engineering              60
Marketing                52

Number of employees per grade.

employees.groupby('Grade').count()['Name']
Grade
excellent       6
good            5
acceptable      5
poor            5
unacceptable    5
Name: Name, dtype: int64

Number of employees, by department, obtaining each grade.

employees.groupby(['Grade', 'Department']).count()['Name']
Grade         Department

excellent     Accounting     2
              Engineering    2
              Marketing      2

good          Accounting     1
              Engineering    2
              Marketing      2

acceptable    Accounting     2
              Engineering    2
              Marketing      1

poor          Accounting     2
              Engineering    1
              Marketing      2

unacceptable  Accounting     1
              Engineering    2
              Marketing      2

Time series resampling

Create a date range to use as an index: pandas.date_range

my_index = pd.date_range('9/1/2016', periods=9, freq='min')
my_index
DatetimeIndex(['2016-09-01 00:00:00', '2016-09-01 00:01:00',
               '2016-09-01 00:02:00', '2016-09-01 00:03:00',
               '2016-09-01 00:04:00', '2016-09-01 00:05:00',
               '2016-09-01 00:06:00', '2016-09-01 00:07:00',
               '2016-09-01 00:08:00'],
              dtype='datetime64[ns]', freq='T')

Create a time series that includes a simple pattern: pandas.Series

my_series = pd.Series(np.arange(9), index=my_index)
my_series
2016-09-01 00:00:00    0
2016-09-01 00:01:00    1
2016-09-01 00:02:00    2
2016-09-01 00:03:00    3
2016-09-01 00:04:00    4
2016-09-01 00:05:00    5
2016-09-01 00:06:00    6
2016-09-01 00:07:00    7
2016-09-01 00:08:00    8
Freq: T, dtype: int64

Downsampling: pandas.resample

my_series.resample('3min')
DatetimeIndexResampler [freq=<3 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]
my_series.resample('3min').sum()
2016-09-01 00:00:00     3
2016-09-01 00:03:00    12
2016-09-01 00:06:00    21
Freq: 3T, dtype: int64

Use upper bound for each time period as the label.

my_series.resample('3min', label='right').sum()
2016-09-01 00:03:00     3
2016-09-01 00:06:00    12
2016-09-01 00:09:00    21
Freq: 3T, dtype: int64

Close the right side of the bin interval.

my_series.resample('3min', label='right', closed='right').sum()
2016-09-01 00:00:00     0
2016-09-01 00:03:00     6
2016-09-01 00:06:00    15
2016-09-01 00:09:00    15
Freq: 3T, dtype: int64

Upsampling

my_series.resample('30s').asfreq().head()
2016-09-01 00:00:00    0.0
2016-09-01 00:00:30    NaN
2016-09-01 00:01:00    1.0
2016-09-01 00:01:30    NaN
2016-09-01 00:02:00    2.0
Freq: 30S, dtype: float64
  • Custom function to use with resampling

    def custom_arithmetic(array_like):
        temp = 3 * np.sum(array_like) + 5
        return temp
    
    my_series.resample('3min').apply(custom_arithmetic)
    
    2016-09-01 00:00:00    14
    2016-09-01 00:03:00    41
    2016-09-01 00:06:00    68
    Freq: 3T, dtype: int64
    

Series

Create series

my_simple_series = pd.Series(np.random.randn(5), index=list('abcde'))
my_simple_series
a    1.186168
b    0.606623
c    1.862614
d   -1.180305
e    0.615774
dtype: float64
my_dictionary = dict(a=45, b=-19.5, c=4444)
my_second_series = pd.Series(my_dictionary)
my_second_series
a      45.0
b     -19.5
c    4444.0
dtype: float64
pd.Series(my_dictionary, index=list('bcda'))
b     -19.5
c    4444.0
d       NaN
a      45.0
dtype: float64
my_dictionary.get('a')
45
legit = my_dictionary.get('a')
type(legit)
unknown = my_dictionary.get('f')
type(unknown)
<class 'int'>
<class 'NoneType'>

Create a series from a scalar

pd.Series(5, index=list('abcd'))
a    5
b    5
c    5
d    5
dtype: int64

Vectorized operations

A key difference between series and ndarrays is that series operations automatically align data based on labels

my_series.head() + my_series.head()
2016-09-01 00:00:00    0
2016-09-01 00:01:00    2
2016-09-01 00:02:00    4
2016-09-01 00:03:00    6
2016-09-01 00:04:00    8
Freq: T, dtype: int64
np.array(my_series.head()) + np.array(my_series.head())
array([0, 2, 4, 6, 8])

Date arithmetic

from datetime import datetime
now = datetime.now()
now
datetime.datetime(2017, 9, 22, 14, 30, 18, 504458)
  • delta

    delta = now - datetime(2001, 1, 1)
    delta
    
    datetime.timedelta(6108, 52218, 504458)
    
    delta.days
    
    6108
    
    pd.Timedelta(6108, unit='d')
    
    Timedelta('6108 days 00:00:00')
    
  • Range from timedelta

    us_memorial_day = datetime(2016, 5, 30)
    us_labor_day = datetime(2016, 9, 5)
    us_summer_2016 = us_labor_day - us_memorial_day
    us_summer_2016
    
    datetime.timedelta(98)
    
    summer_2016_days = pd.date_range(
        us_memorial_day, periods=us_summer_2016.days, freq='D')
    summer_2016_days[:4]
    summer_2016_days[-4:]
    
    DatetimeIndex(['2016-05-30', '2016-05-31', '2016-06-01', '2016-06-02'], dtype='datetime64[ns]', freq='D')
    DatetimeIndex(['2016-09-01', '2016-09-02', '2016-09-03', '2016-09-04'], dtype='datetime64[ns]', freq='D')
    

Data Frames and Panels

Creating data frames from various source types

vals = dict(a=40, b=29, c=292, d=-5.03)
pd.DataFrame(vals, index='first again'.split())
        a   b    c     d
first  40  29  292 -5.03
again  40  29  292 -5.03

Without an explicit index

series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
pd.DataFrame(series_dict)
   a    b   c
0  4    9   3
1  5  322  45
2  6  455  22

dictionary of tuples, with multi index

dict_of_tuples = {
    ('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('a', 'c'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('b', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
    ('b', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}
}
pd.DataFrame(dict_of_tuples)
     a        b
     a  b  c  a  b
A B  1  1  1  1  1
  C  2  2  2  2  2

Create panels

3D analogues of DataFrames

Initialized natively

pd.Panel(np.random.randn(2, 5, 4),
         items='item1 item2'.split(),
         major_axis=pd.date_range('9/6/2016', periods=5),
         minor_axis=list('ABCD'))
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 2016-09-06 00:00:00 to 2016-09-10 00:00:00
Minor_axis axis: A to D

Initialized from a dictionary of data frames

series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
df1 = pd.DataFrame(series_dict)
df2 = pd.DataFrame(series_dict) + 10
dict_of_dfs = dict(df1=df1, df2=df2)
pd.Panel(dict_of_dfs)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c

from_dict factory method

panel = pd.Panel.from_dict(dict_of_dfs, orient='minor')
pd.Panel.from_dict(dict_of_dfs, orient='items')
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c
panel.ix[:, 0,: ]
      a   b   c
df1   4   9   3
df2  14  19  13
panel.ix['a':,  0, :'df1']
     a  b  c
df1  4  9  3

SciPy

[WIP]

scikit-learn

[WIP]