Python: scientific libraries
NumPy
Version
import numpy as np
np.__version__
'1.12.1'
Print options
np.set_printoptions(precision=4)
Creating from Python data structures
Every element in an np array must have the same type.
np.array([1, 2, 3, 4, 5])
array([1, 2, 3, 4, 5])
Data type promotion (all elements converted to consistent type):
mixed_nums = (14, -3.54, 5+7j)
np.array(mixed_nums)
array([ 14.00+0.j, -3.54+0.j, 5.00+7.j])
Creating using NumPy methods
np.arange(10, step=2)
array([0, 2, 4, 6, 8])
np.arange(5, 10) + 1
array([ 6, 7, 8, 9, 10])
len(np.arange(0, 10, 2))
np.arange(10).size
5
10
np.arange(24, 25)
array([24])
Operations are performed element-wise
np.array([1, 2, 3, 4]) * 10
array([10, 20, 30, 40])
linspace, zeros, ones, data types
linspace
Return n (default 50) evenly spaced nums over the given interval. Closed interval: Stop parameter is included in range.
np.linspace(5, 10, 9).size
9
return the step size between each entry
np.linspace(5, 15, 3, retstep=True)
(array([ 5., 10., 15.]), 5.0)
zeros
np.zeros(5)
array([ 0., 0., 0., 0., 0.])
np.zeros((5, 3))
array([[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
np.zeros(11, dtype='int64')
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
ones
np.ones(5)
array([ 1., 1., 1., 1., 1.])
np.ones((3, 2))
array([[ 1., 1.],
[ 1., 1.],
[ 1., 1.]])
ndarray
np.ndarray(shape=(2, 4), dtype=float)
array([[ 1.2882e-231, 1.2882e-231, 1.2882e-231, 1.2882e-231],
[ 1.2882e-231, 1.2882e-231, 1.7587e-310, 3.5098e+064]])
np.ndarray(shape=(3,), dtype=int, order=True)
array([4617315517961601024, 4621819117588971520, 4624633867356078080])
Slicing and iterating
Arrays
np_arr = np.array([-17, -4, 0, 2, 21])
np_arr[0]
np_arr[-1]
np_arr[-1] = 33
np_arr
-17
21
array([-17, -4, 0, 2, 33])
Multidimensional arrays: shape
matr = np.arange(35)
matr.shape = (7, 5)
matr[2]
matr[2, 3]
matr[2][3]
array([10, 11, 12, 13, 14])
13
13
3-D Arrays
array_3d = np.arange(70)
array_3d.shape = (2, 7, 5)
array_3d
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34]],
[[35, 36, 37, 38, 39],
[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49],
[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59],
[60, 61, 62, 63, 64],
[65, 66, 67, 68, 69]]])
array_3d[1][4][3]
58
array_3d[1][4][3] = 1111
array_3d
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
[ 20, 21, 22, 23, 24],
[ 25, 26, 27, 28, 29],
[ 30, 31, 32, 33, 34]],
[[ 35, 36, 37, 38, 39],
[ 40, 41, 42, 43, 44],
[ 45, 46, 47, 48, 49],
[ 50, 51, 52, 53, 54],
[ 55, 56, 57, 1111, 59],
[ 60, 61, 62, 63, 64],
[ 65, 66, 67, 68, 69]]])
Boolean mask arrays
vector = np.array([-17, -4, 0, 21, 37])
divisible_by_7_mask = (vector % 7) == 0
divisible_by_7_mask
array([False, False, True, True, False], dtype=bool)
vector[vector % 7 == 0]
array([ 0, 21])
div_by_3_test = vector % 3 == 0
positive_test = vector > 0
combined_test = np.logical_and(div_by_3_test, positive_test)
vector[combined_test]
array([21])
Broadcasting
How numpy handles operations between arrays of different sizes.
Matrix attributes
my_3d_array = np.arange(70)
my_3d_array.shape = (2, 7, 5)
my_3d_array.ndim
my_3d_array.size
my_3d_array.dtype
3
70
dtype('int64')
Scalars
5 * my_3d_array - 2
array([[[ -2, 3, 8, 13, 18],
[ 23, 28, 33, 38, 43],
[ 48, 53, 58, 63, 68],
[ 73, 78, 83, 88, 93],
[ 98, 103, 108, 113, 118],
[123, 128, 133, 138, 143],
[148, 153, 158, 163, 168]],
[[173, 178, 183, 188, 193],
[198, 203, 208, 213, 218],
[223, 228, 233, 238, 243],
[248, 253, 258, 263, 268],
[273, 278, 283, 288, 293],
[298, 303, 308, 313, 318],
[323, 328, 333, 338, 343]]])
Vectors
inner product:
np.inner
andnp.dot
np.inner
: inner product of two arrays. For 1D arrays, inner product of vectors.left_matrix = np.arange(6).reshape((2, 3)) right_matrix = np.arange(15).reshape((3, 5)) np.inner(left_matrix, right_matrix)
Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: shapes (2,3) and (5,3) not aligned: 3 (dim 1) != 5 (dim 0)
For 2D arrays, need to use matrix product. For 1D arrays, inner product of vectors.
np.dot(left_matrix, right_matrix)
array([[ 25, 28, 31, 34, 37], [ 70, 82, 94, 106, 118]])
Operations along axes
one_to_three = np.arange(3) + 1 matrix = [one_to_three, one_to_three, one_to_three] np.array(one_to_three).sum() np.array(matrix).sum() # sum all elements np.array(matrix).sum(axis=0) # rows cross-section np.array(matrix).sum(axis=1) # cols cross-section
6 18 array([3, 6, 9]) array([6, 6, 6])
my_3d_array.sum(axis=0)
array([[ 35, 37, 39, 41, 43], [ 45, 47, 49, 51, 53], [ 55, 57, 59, 61, 63], [ 65, 67, 69, 71, 73], [ 75, 77, 79, 81, 83], [ 85, 87, 89, 91, 93], [ 95, 97, 99, 101, 103]])
Structured and record arrays
Structured arrays for data definition
person_data_def = [('name', 'S6'), ('height', 'f8'), ('weight', 'f8'), ('age', 'i8')]
people_array = np.zeros((4), dtype=person_data_def)
people_array[0] = ('Alpha', 65, 112, 25)
people_array[1] = ('Beta', 43, 128, 33)
people_array[2] = ('Gamma', 29, 188, 35)
people_array[3] = ('Delta', 73, 205, 34)
people_array
array([(b'Alpha', 65., 112., 25), (b'Beta', 43., 128., 33),
(b'Gamma', 29., 188., 35), (b'Delta', 73., 205., 34)],
dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
Accessing data in structured arrays
people_array[2:]
array([(b'Gamma', 29., 188., 35), (b'Delta', 73., 205., 34)], dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
ages = people_array['age'] ages
array([25, 33, 35, 34])
Record arrays: A wrapper around structured arrays
Instead of using indexes, use attributes
person_record_array = np.rec.array(people_array)
person_record_array
rec.array([
(b'Alpha', 65., 112., 25),
(b'Beta', 43., 128., 33),
(b'Gamma', 29., 188., 35),
(b'Delta', 73., 205., 34)
],
dtype=[('name', 'S6'), ('height', '<f8'), ('weight', '<f8'), ('age', '<i8')])
person_record_array[0].age
25
Views and copies
Assigning to a new variable creates a new reference. Same NumPy object / location in memory, same underlying data.
import numpy as np
mi_casa = np.array([-45, -31, -12, 0])
su_casa = mi_casa
# same object
id(mi_casa)
id(su_casa)
mi_casa is su_casa
# equal values
mi_casa == su_casa
# values remain in sync when mutated
su_casa[0] = 100
mi_casa is su_casa
mi_casa == su_casa
4642127504
4642127504
True
array([ True, True, True, True], dtype=bool)
True
array([ True, True, True, True], dtype=bool)
Views
Returns a shallow copy of the receiver.
dog_house = mi_casa.view()
dog_house is mi_casa # new object at different location
dog_house == mi_casa # same values as original
mi_casa[0] = 345
dog_house is mi_casa # still a new object
dog_house == mi_casa # values remain in sync
False
array([ True, True, True, True], dtype=bool)
False
array([ True, True, True, True], dtype=bool)
Copies
Provides a deep copy.
tree_house = mi_casa.copy()
# different object, same values
tree_house is mi_casa
tree_house == mi_casa
# values are distinct
tree_house[0] = 983798739
tree_house is mi_casa
tree_house == mi_casa
False
array([ True, True, True, True], dtype=bool)
False
array([False, True, True, True], dtype=bool)
Array attributes
import numpy as np
arr = np.array(np.arange(24)).reshape((2, 3, 4))
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
ndim
arr.ndim
3
shape
arr.shape
(2, 3, 4)
size
arr.size
24
dtype
arr.dtype
dtype('int64')
itemsize
arr.itemsize
8
Add and remove elements
append
Append to the given array. Shape of the given array is not maintained. Returns a copy not a view.
arr
ay([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
arr2 = np.append(arr, [5, 6, 7, 8])
arr2.shape
(28,)
Use reshape
to reshape:
arr2.reshape((7, 4))
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[ 5, 6, 7, 8]])
append
to a specific axis
matrix = np.array(np.arange(9)).reshape((3, 3))
matrix
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
Append new matrix as new rows:
new_matrix = np.array(np.arange(9) + 10).reshape((3, 3))
np.append(matrix, new_matrix, axis=0)
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[10, 11, 12],
[13, 14, 15],
[16, 17, 18]])
Append new matrix as new columns:
hstack = np.append(matrix, new_matrix, axis=1)
hstack
array([[ 0, 1, 2, 10, 11, 12],
[ 3, 4, 5, 13, 14, 15],
[ 6, 7, 8, 16, 17, 18]])
Horizontal stacking: hstack
Convenience method for appending to the last axis. Returns a copy not a view. (Same as append.)
a = np.array(np.arange(9)).reshape((3, 3))
b = np.array(np.arange(9) + 10).reshape((3, 3))
haystack = np.hstack((a, b))
haystack
array([[ 0, 1, 2, 10, 11, 12],
[ 3, 4, 5, 13, 14, 15],
[ 6, 7, 8, 16, 17, 18]])
insert
Interpolates data in between existing data. Creates a new array with new data.
arr
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
np.insert(arr, 1, 444, axis=0)
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[444, 444, 444, 444],
[444, 444, 444, 444],
[444, 444, 444, 444]],
[[ 12, 13, 14, 15],
[ 16, 17, 18, 19],
[ 20, 21, 22, 23]]])
np.insert(arr, 1, 444, axis=1)
array([[[ 0, 1, 2, 3],
[444, 444, 444, 444],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[ 12, 13, 14, 15],
[444, 444, 444, 444],
[ 16, 17, 18, 19],
[ 20, 21, 22, 23]]])
delete
arr
ay([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
Delete element 1 at axis 0:
np.delete(arr, 1, axis=0)
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]]])
Delete element 0 at axis 1:
np.delete(arr, 0, axis=1)
array([[[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[16, 17, 18, 19],
[20, 21, 22, 23]]])
Delete element 2 at axis 2:
np.delete(arr, 2, axis=2)
array([[[ 0, 1, 3],
[ 4, 5, 7],
[ 8, 9, 11]],
[[12, 13, 15],
[16, 17, 19],
[20, 21, 23]]])
Joining and splitting
concatenate
Returns a copy not a view
import numpy as np
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
np.concatenate((arr1, arr2), axis=0)
np.concatenate((arr1, arr2), axis=1)
array([[1, 2], [3, 4],
[5, 6], [7, 8]])
array([[1, 2, 5, 6],
[3, 4, 7, 8]])
stack
np.stack([arr1, arr2], axis=0)
array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]]])
split
temp = np.arange(9).reshape((3, 3))
np.split(temp, 3, axis=0)
[array([[0, 1, 2]]), array([[3, 4, 5]]), array([[6, 7, 8]])]
Rearrange elements
fliplr
Reverse order of elements along the second axis
orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.fliplr(orig_array)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
array([[ 4, 3, 2, 1, 0],
[ 9, 8, 7, 6, 5],
[14, 13, 12, 11, 10]])
orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.fliplr(orig_array)
array([[[0, 1], [2, 3]],
[[4, 5], [6, 7]]])
array([[[2, 3], [0, 1]],
[[6, 7], [4, 5]]])
flipud
Reverse order of elements along the first axis
orig_array = np.array(np.arange(15)).reshape((3, 5))
orig_array
np.flipud(orig_array)
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
array([[10, 11, 12, 13, 14],
[ 5, 6, 7, 8, 9],
[ 0, 1, 2, 3, 4]])
orig_array = np.array(np.arange(8)).reshape((2, 2, 2))
orig_array
np.flipud(orig_array)
array([[[0, 1], [2, 3]],
[[4, 5], [6, 7]]])
array([[[4, 5], [6, 7]],
[[0, 1], [2, 3]]])
roll
Rotate elements n times along the second dimension. CW for n > 0, CCW for n < 0.
arr
np.roll(arr, 4)
print('------')
np.roll(arr, 5)
np.roll(arr, 1)
print('------')
np.roll(arr, -1)
array([[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])
array([[[20, 21, 22, 23],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7]],
[[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]]])
------
array([[[19, 20, 21, 22],
[23, 0, 1, 2],
[ 3, 4, 5, 6]],
[[ 7, 8, 9, 10],
[11, 12, 13, 14],
[15, 16, 17, 18]]])
array([[[23, 0, 1, 2],
[ 3, 4, 5, 6],
[ 7, 8, 9, 10]],
[[11, 12, 13, 14],
[15, 16, 17, 18],
[19, 20, 21, 22]]])
------
array([[[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]],
[[13, 14, 15, 16],
[17, 18, 19, 20],
[21, 22, 23, 0]]])
rot90
Rotate 90 degrees.
orig_array
array([[[0, 1], [2, 3]],
[[4, 5], [6, 7]]])
np.rot90(orig_array)
array([[[2, 3], [6, 7]],
[[0, 1], [4, 5]]])
np.rot90(orig_array, k=-1)
array([[[4, 5], [0, 1]],
[[6, 7], [2, 3]]])
Transpose-like operations
arr_3x8 = np.array(np.arange(24)).reshape((3, 8))
arr_3x8
array([[ 0, 1, 2, 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23]])
transpose
np.transpose(arr_3x8)
array([[ 0, 8, 16],
[ 1, 9, 17],
[ 2, 10, 18],
[ 3, 11, 19],
[ 4, 12, 20],
[ 5, 13, 21],
[ 6, 14, 22],
[ 7, 15, 23]])
np.transpose(arr_3x8, axes=(1, 0))
array([[ 0, 8, 16],
[ 1, 9, 17],
[ 2, 10, 18],
[ 3, 11, 19],
[ 4, 12, 20],
[ 5, 13, 21],
[ 6, 14, 22],
[ 7, 15, 23]])
swapaxes
arr_3x2x4 = np.array(np.arange(24)).reshape((3, 2, 4))
arr_3x2x4
np.swapaxes(arr_3x2x4, axis1=0, axis2=2)
array([[[ 0, 1, 2, 3], [ 4, 5, 6, 7]],
[[ 8, 9, 10, 11], [12, 13, 14, 15]],
[[16, 17, 18, 19], [20, 21, 22, 23]]])
array([[[ 0, 8, 16], [ 4, 12, 20]],
[[ 1, 9, 17], [ 5, 13, 21]],
[[ 2, 10, 18], [ 6, 14, 22]],
[[ 3, 11, 19], [ 7, 15, 23]]])
rollaxes
A view is returned.
mat_4d = np.ones((3, 4, 5, 6))
rolled = np.rollaxis(mat_4d, 1)
rolled.shape
(4, 3, 5, 6)
Applications
Universal functions
np.frompyfunc
ufunc: a function that operates on ~ndarray~s element-wise, supporting array broadcasting, type casting, and other standard features. Vectorized wrapper for a Python function.
import numpy as np def truncated_binomial(x): return (x + 1)**3 - x**3 truncated_binomial(4)
61
args: func name, number of args, number of scalars to return
nums = np.ones(6).reshape((2, 3)) * 4 nums trunc_binom = np.frompyfunc(truncated_binomial, 1, 1) trunc_binom(nums)
array([[ 4., 4., 4.], [ 4., 4., 4.]]) array([[61.0, 61.0, 61.0], [61.0, 61.0, 61.0]], dtype=object)
Linear algebra
Matrices
Common functions are accessed via properties instead of functions
my_matrix = np.matrix([[3, 1, 4], [1, 5, 9], [2, 6, 5]]) my_matrix
matrix([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
Transpose:
my_matrix.T
matrix([[3, 1, 2], [1, 5, 6], [4, 9, 5]])
Inverse:
my_matrix.I
matrix([[ 0.3222, -0.2111, 0.1222], [-0.1444, -0.0778, 0.2556], [ 0.0444, 0.1778, -0.1556]])
Identity matrices
np.eye(3, dtype=int)
array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
Solving systems of linear equations
\begin{align} A \mathbf{x} &= \mathbf{b} \\ A^{-1} A \mathbf{x} &= A^{-1}\mathbf{b} \\ \mathbf{x} &= A^{-1}\mathbf{b} \end{align}
my_matrix rhs = np.matrix([[11], [22], [33]]) inverse = my_matrix.I solution = inverse * rhs solution
matrix([[3, 1, 4], [1, 5, 9], [2, 6, 5]]) matrix([[ 2.9333], [ 5.1333], [-0.7333]])
Optimized version:
from numpy.linalg import solve solve(my_matrix, rhs)
matrix([[ 2.9333], [ 5.1333], [-0.7333]])
Computing eigenvalues / eigenvectors
Use
eig
to compute the eigenvalues and right eigenvectors of the given matrixfrom numpy.linalg import eig eigvals, eigvects = eig(my_matrix) eigvals eigvects
array([ 13.0858, 2.58 , -2.6658]) matrix([[-0.3154, -0.9512, -0.3237], [-0.7231, 0.3078, -0.7022], [-0.6146, 0.0229, 0.6341]])
Pattern detection
Given a sequence of numbers as an array, find the next number in the sequence.
import numpy as np
seq_array = np.array([1, 7, 19, 37, 61, 91, 127, 169, 217, 271, 331])
np.diff(seq_array) # calculate first differences
np.diff(seq_array, n=2) # second differences
np.diff(seq_array, n=3) # third differences
array([ 6, 12, 18, 24, 30, 36, 42, 48, 54, 60])
array([6, 6, 6, 6, 6, 6, 6, 6, 6])
array([0, 0, 0, 0, 0, 0, 0, 0])
Symbolic Python
Use Jupyter notebooks. Output like Wolfram or Matlab.
from sympy import init_session init_session()
Statistics
Basic statistics:
mean, median, min, max, std, var
import scipy as sp import numpy as np from scipy.stats import norm
Generate a data set from normally distributed data points.
number_of_data_points = 10000 data_set = sp.randn(number_of_data_points) type(data_set)
<class 'numpy.ndarray'>
mean
data_set.mean()
0.0016312105909250228
sp.median
sp.median(data_set)
0.0028936738357498988
min
data_set.min()
-3.4733354484199768
max
data_set.max()
3.5011182300650168
sp.std
sp.std(data_set)
1.0046141873536045
sp.var
sp.var(data_set)
1.0092496654321432
Probability distributions
- Continuous
- Normal:
norm
- Chi squared:
chi2
- Student’s T:
t
- Uniform:
uniform
- Normal:
- Discrete
- Poisson:
poisson
- Binomial:
binomial
- Poisson:
- Continuous
Example: Normal Distribution
Print random variates from the IQ distribution
iq_mean = 100 iq_std_dev = 15 iq_distribution = norm(loc=iq_mean, scale=iq_std_dev) for n in np.arange(8): print('{:6.2f}'.format(iq_distribution.rvs()))
98.10 112.58 107.61 111.16 95.47 71.72 112.54 87.38
Print a histogram
import numpy as np import matplotlib.pyplot as plt mu, sigma = 100, 15 dataset = mu + sigma * np.random.randn(10_000) n, bins, patches = plt.hist(dataset, 50, normed=1, facecolor='g', alpha=0.75) plt.xlabel('IQ Score') plt.ylabel('Probability') plt.title('Histogram of IQ') plt.text(60, .025, r'$\mu=100,\ \sigma=15$') plt.axis([40, 160, 0, 0.03]) plt.grid(True) plt.show()
Pandas
Object creation
Integer index (default)
import pandas as pd import numpy as np default_series = pd.Series([1, 3, 5, np.nan, 6, 8]) print(default_series)
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
Datetime index
dates_index = pd.date_range('20170101', periods=6) print(dates_index)
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq='D')
Sample numpy data
print(np.arange(5)) print(np.array(np.arange(5))) sample_data = np.array(np.arange(24)).reshape((6, 4)) print(sample_data)
[0 1 2 3 4] [0 1 2 3 4] [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15] [16 17 18 19] [20 21 22 23]]
Data Frames
With specified index and column headers
sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD')) print(sample_df)
A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23
From a python dictionary
py_dict_to_df = pd.DataFrame( dict( float=1.0, time=pd.Timestamp('20170101'), series=pd.Series(1, index=list(range(4)), dtype='float32'), array=np.array([3] * 4, dtype='int32'), categories=pd.Categorical(['test', 'train', 'taxes', 'tools']), dull='boring data')) print(py_dict_to_df)
array categories dull float series time 0 3 test boring data 1.0 1.0 2017-01-01 1 3 train boring data 1.0 1.0 2017-01-01 2 3 taxes boring data 1.0 1.0 2017-01-01 3 3 tools boring data 1.0 1.0 2017-01-01
Attributes info:
dtypes
print(py_dict_to_df.dtypes)
array int32 categories category dull object float float64 series float32 time datetime64[ns] dtype: object
Peeking:
head
andtail
print(py_dict_to_df.head()) print(py_dict_to_df.tail(2))
array categories dull float series time 0 3 test boring data 1.0 1.0 2017-01-01 1 3 train boring data 1.0 1.0 2017-01-01 2 3 taxes boring data 1.0 1.0 2017-01-01 3 3 tools boring data 1.0 1.0 2017-01-01 array categories dull float series time 2 3 taxes boring data 1.0 1.0 2017-01-01 3 3 tools boring data 1.0 1.0 2017-01-01
Underlying data:
values
,index
,columns
values
print(py_dict_to_df.values) print(sample_df.values)
[[3 'test' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')] [3 'train' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')] [3 'taxes' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')] [3 'tools' 'boring data' 1.0 1.0 Timestamp('2017-01-01 00:00:00')]] [[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11] [12 13 14 15] [16 17 18 19] [20 21 22 23]]
index
print(py_dict_to_df.index) print(sample_df.index)
Int64Index([0, 1, 2, 3], dtype='int64') DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], dtype='datetime64[ns]', freq='D')
columns
print(py_dict_to_df.columns) print(sample_df.columns)
Index(['array', 'categories', 'dull', 'float', 'series', 'time'], dtype='object') Index(['A', 'B', 'C', 'D'], dtype='object')
Statistical summary:
describe
print(py_dict_to_df.describe())
array float series count 4.0 4.0 4.0 mean 3.0 1.0 1.0 std 0.0 0.0 0.0 min 3.0 1.0 1.0 25% 3.0 1.0 1.0 50% 3.0 1.0 1.0 75% 3.0 1.0 1.0 max 3.0 1.0 1.0
print(sample_df.describe())
A B C D count 6.000000 6.000000 6.000000 6.000000 mean 10.000000 11.000000 12.000000 13.000000 std 7.483315 7.483315 7.483315 7.483315 min 0.000000 1.000000 2.000000 3.000000 25% 5.000000 6.000000 7.000000 8.000000 50% 10.000000 11.000000 12.000000 13.000000 75% 15.000000 16.000000 17.000000 18.000000 max 20.000000 21.000000 22.000000 23.000000
Control floating-point display precision (and other options):
set_options
pd.set_option('display.precision', 2) print(sample_df.describe())
A B C D count 6.00 6.00 6.00 6.00 mean 10.00 11.00 12.00 13.00 std 7.48 7.48 7.48 7.48 min 0.00 1.00 2.00 3.00 25% 5.00 6.00 7.00 8.00 50% 10.00 11.00 12.00 13.00 75% 15.00 16.00 17.00 18.00 max 20.00 21.00 22.00 23.00
Transpose:
dataframe.T
print(sample_df.T)
2017-01-01 2017-01-02 2017-01-03 2017-01-04 2017-01-05 2017-01-06 A 0 4 8 12 16 20 B 1 5 9 13 17 21 C 2 6 10 14 18 22 D 3 7 11 15 19 23
Sort by an axis:
sort_index
axis: 0 for rows, 1 for cols
# sort columns in descending order print(sample_df.sort_index(axis=1, ascending=False))
D C B A 2017-01-01 3 2 1 0 2017-01-02 7 6 5 4 2017-01-03 11 10 9 8 2017-01-04 15 14 13 12 2017-01-05 19 18 17 16 2017-01-06 23 22 21 20
# sort rows in descending order print(sample_df.sort_index(axis=0, ascending=False))
A B C D 2017-01-06 20 21 22 23 2017-01-05 16 17 18 19 2017-01-04 12 13 14 15 2017-01-03 8 9 10 11 2017-01-02 4 5 6 7 2017-01-01 0 1 2 3
Sort by data within a given column:
sort_values
# sort the 'B' column in descending order, adjust others to match print(sample_df.sort_values(by='B', ascending=False))
A B C D 2017-01-06 20 21 22 23 2017-01-05 16 17 18 19 2017-01-04 12 13 14 15 2017-01-03 8 9 10 11 2017-01-02 4 5 6 7 2017-01-01 0 1 2 3
Selecting values
Pandas in production
For production (as opposed to interactive) work, the pandas team recommends
the optimized data access methods: .at .iat .loc .iloc .ix
.
.at
: Fast label-based scalar accessor.iat
: Fast integer location scalar accessor..loc
: Purely label-location based indexer for selection by label..iloc
: Purely integer-location based indexing for selection by position..ix
: A primarily label-location based indexer, with integer position fallback.See the docs for more details.
import numpy as np import pandas as pd sample_numpy_data = np.array(np.arange(24)).reshape((6, 4)) dates_index = pd.date_range('20160601', periods=6) sample_df = pd.DataFrame( sample_numpy_data, index=dates_index, columns=list('ABCD')) print(sample_df.head())
A B C D 2016-06-01 0 1 2 3 2016-06-02 4 5 6 7 2016-06-03 8 9 10 11 2016-06-04 12 13 14 15 2016-06-05 16 17 18 19
Selection using column name
col_c = sample_df['C']
print(col_c)
2016-06-01 2
2016-06-02 6
2016-06-03 10
2016-06-04 14
2016-06-05 18
2016-06-06 22
Freq: D, Name: C, dtype: int64
Selection using slice
first_4_rows = sample_df[:4]
print(first_4_rows)
A B C D
2016-06-01 0 1 2 3
2016-06-02 4 5 6 7
2016-06-03 8 9 10 11
2016-06-04 12 13 14 15
Selection by datetime index
first_four_periods = sample_df['2016-06-01':'2016-06-04']
print(first_four_periods)
A B C D
2016-06-01 0 1 2 3
2016-06-02 4 5 6 7
2016-06-03 8 9 10 11
2016-06-04 12 13 14 15
Selection by label
print(dates_index[1:3])
date_selection = sample_df.loc[dates_index[1:3]]
print(date_selection)
DatetimeIndex(['2016-06-02', '2016-06-03'], dtype='datetime64[ns]', freq='D')
A B C D
2016-06-02 4 5 6 7
2016-06-03 8 9 10 11
Selection (multi-axis) by label
all_rows_of_cols_a_and_b = sample_df.loc[:, ['A', 'B']]
print(all_rows_of_cols_a_and_b)
A B
2016-06-01 0 1
2016-06-02 4 5
2016-06-03 8 9
2016-06-04 12 13
2016-06-05 16 17
2016-06-06 20 21
Label slicing, including both endpoints
a_and_b_between_dates = sample_df.loc['2016-06-01':'2016-06-04', ['A', 'B']]
print(a_and_b_between_dates)
A B
2016-06-01 0 1
2016-06-02 4 5
2016-06-03 8 9
2016-06-04 12 13
Reduce dimensions of returned object
print(sample_df.loc['2016-06-03', ['D', 'B']])
print(sample_df.loc['2016-06-03', ['B', 'D']])
D 11
B 9
Name: 2016-06-03 00:00:00, dtype: int64
B 9
D 11
Name: 2016-06-03 00:00:00, dtype: int64
Working with result objects
result = sample_df.loc['2016-06-03', ['D', 'B']]
print(result[0] * 4)
44
Selecting scalars
print(sample_df.loc[:, 'C'])
print('------------')
print(dates_index[2])
print('------------')
print(sample_df.loc[dates_index[2], 'C'])
2016-06-01 2
2016-06-02 6
2016-06-03 10
2016-06-04 14
2016-06-05 18
2016-06-06 22
Freq: D, Name: C, dtype: int64
------------
2016-06-03 00:00:00
------------
10
Selecting by position: iloc
sample_numpy_data[3]
array([12, 13, 14, 15])
sample_df.iloc[3]
A 12
B 13
C 14
D 15
Name: 2016-06-04 00:00:00, dtype: int64
Selecting using integer slices with
iloc
sample_df.iloc[1:3, 2:4]
C D 2016-06-02 6 7 2016-06-03 10 11
Selecting lists of rows with
iloc
sample_df.iloc[[0, 1, 3], [0, 2]]
A C 2016-06-01 0 2 2016-06-02 4 6 2016-06-04 12 14
Slicing rows explicitly (selecting all cols implicitly)
sample_df.iloc[1:3, :]
A B C D 2016-06-02 4 5 6 7 2016-06-03 8 9 10 11
Slicing cols explicitly, all rows implicitly
sample_df.iloc[:, 1:3]
B C 2016-06-01 1 2 2016-06-02 5 6 2016-06-03 9 10 2016-06-04 13 14 2016-06-05 17 18 2016-06-06 21 22
Boolean indexing
Test based upon one column’s data
sample_df.C >= 14
2016-06-01 False 2016-06-02 False 2016-06-03 False 2016-06-04 True 2016-06-05 True 2016-06-06 True Freq: D, Name: C, dtype: bool
Test based upon the entire data set
sample_df sample_df[sample_df >= 14]
A B C D 2016-06-01 0 1 2 3 2016-06-02 4 5 6 7 2016-06-03 8 9 10 11 2016-06-04 12 13 14 15 2016-06-05 16 17 18 19 2016-06-06 20 21 22 23 A B C D 2016-06-01 NaN NaN NaN NaN 2016-06-02 NaN NaN NaN NaN 2016-06-03 NaN NaN NaN NaN 2016-06-04 NaN NaN 14.0 15.0 2016-06-05 16.0 17.0 18.0 19.0 2016-06-06 20.0 21.0 22.0 23.0
isin
methodReturns a boolean series showing whether each element in the series is exactly contained in the passed sequence of values.
sample_df_2 = sample_df.copy() sample_df_2['Fruits'] = [ 'apple', 'orange', 'banana', 'strawberry', 'blueberry', 'pineapple' ] sample_df_2
A B C D Fruits 2016-06-01 0 1 2 3 apple 2016-06-02 4 5 6 7 orange 2016-06-03 8 9 10 11 banana 2016-06-04 12 13 14 15 strawberry 2016-06-05 16 17 18 19 blueberry 2016-06-06 20 21 22 23 pineapple
Generate a boolean vector describing whether or not any of the given set of values
isin
the given column.selection = sample_df_2['Fruits'].isin(['banana', 'pineapple', 'smoothy']) print(selection)
2016-06-01 False 2016-06-02 False 2016-06-03 True 2016-06-04 False 2016-06-05 False 2016-06-06 True Freq: D, Name: Fruits, dtype: bool
Select all rows where any of the given set of values
isin
the given column.sample_df_2[selection]
A B C D Fruits 2016-06-03 8 9 10 11 banana 2016-06-06 20 21 22 23 pineapple
Missing data
import numpy as np
import pandas as pd
start_date = '20160101'
dates_index = pd.date_range(start_date, periods=6)
sample_data = np.array(np.arange(24)).reshape((6, 4))
sample_df = pd.DataFrame(sample_data, index=dates_index, columns=list('ABCD'))
sample_df_2 = sample_df.copy()
sample_df_2[
'Fruits'] = 'apple orange banana strawberry blueberry pineapple'.split()
sample_series = pd.Series(
np.arange(6) + 1, index=pd.date_range(start_date, periods=6))
sample_df_2['Extra Data'] = sample_series * 3 + 1
second_numpy_array = np.array(np.arange(len(sample_df_2))) * 100 + 7
sample_df_2['G'] = second_numpy_array
sample_df_2
A B C D Fruits Extra Data G
2016-01-01 0 1 2 3 apple 4 7
2016-01-02 4 5 6 7 orange 7 107
2016-01-03 8 9 10 11 banana 10 207
2016-01-04 12 13 14 15 strawberry 13 307
2016-01-05 16 17 18 19 blueberry 16 407
2016-01-06 20 21 22 23 pineapple 19 507
reindex
Creates a copy rather than a view
browser_index = 'Firefox Chrome Safari IE10 Konqueror'.split() browser_df = pd.DataFrame( dict( http_status=[200, 200, 404, 404, 301], response_time=[0.04, 0.02, 0.07, 0.08, 1.0]), index=browser_index) browser_df
http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Created a =reindex=ed copy
new_index = 'Safari Iceweasel ComodoDragon IE10 Chrome'.split() browser_df_2 = browser_df.reindex(new_index) browser_df_2
http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN ComodoDragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
Drop rows with missing data
browser_df_3 = browser_df_2.dropna(how='any') browser_df_3
http_status response_time Safari 404.0 0.07 IE10 404.0 0.08 Chrome 200.0 0.02
Fill in missing data
browser_df_2.fillna(value=-0.05555)
http_status response_time Safari 404.00000 0.07000 Iceweasel -0.05555 -0.05555 ComodoDragon -0.05555 -0.05555 IE10 404.00000 0.08000 Chrome 200.00000 0.02000
Boolean mask for NA values
pd.isnull(browser_df_2)
http_status response_time Safari False False Iceweasel True True ComodoDragon True True IE10 False False Chrome False False
NaN
s propagate during calculationsbrowser_df_2 * 3 + 10
http_status response_time Safari 1222.0 10.21 Iceweasel NaN NaN ComodoDragon NaN NaN IE10 1222.0 10.24 Chrome 610.0 10.06
Operations
Descriptive statistics:
describe
pd.set_option('display.precision', 2) sample_df_2.describe()
A B C D Extra Data G count 6.00 6.00 6.00 6.00 6.00 6.00 mean 10.00 11.00 12.00 13.00 11.50 257.00 std 7.48 7.48 7.48 7.48 5.61 187.08 min 0.00 1.00 2.00 3.00 4.00 7.00 25% 5.00 6.00 7.00 8.00 7.75 132.00 50% 10.00 11.00 12.00 13.00 11.50 257.00 75% 15.00 16.00 17.00 18.00 15.25 382.00 max 20.00 21.00 22.00 23.00 19.00 507.00
Column mean
sample_df_2.mean()
A 10.0 B 11.0 C 12.0 D 13.0 Extra Data 11.5 G 257.0 dtype: float64
Row mean
sample_df_2.mean(axis=1)
2016-01-01 2.83 2016-01-02 22.67 2016-01-03 42.50 2016-01-04 62.33 2016-01-05 82.17 2016-01-06 102.00 Freq: D, dtype: float64
apply
a function to a data framesample_df_2[['A', 'B', 'C', 'Fruits']]
A B C Fruits 2016-01-01 0 1 2 apple 2016-01-02 4 5 6 orange 2016-01-03 8 9 10 banana 2016-01-04 12 13 14 strawberry 2016-01-05 16 17 18 blueberry 2016-01-06 20 21 22 pineapple
sample_df_2[['A', 'B', 'Fruits']].apply(np.cumsum, axis=0)
A B Fruits 2016-01-01 0 1 apple 2016-01-02 4 6 appleorange 2016-01-03 12 15 appleorangebanana 2016-01-04 24 28 appleorangebananastrawberry 2016-01-05 40 45 appleorangebananastrawberryblueberry 2016-01-06 60 66 appleorangebananastrawberryblueberrypineapple
sample_df_2[['A', 'B', 'C']].apply(np.cumsum, axis=1)
A B C 2016-01-01 0 1 3 2016-01-02 4 9 15 2016-01-03 8 17 27 2016-01-04 12 25 39 2016-01-05 16 33 51 2016-01-06 20 41 63
String methods
series = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) series.str.lower() series.str.len()
0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object 0 1.0 1 1.0 2 1.0 3 4.0 4 4.0 5 NaN 6 4.0 7 3.0 8 3.0 dtype: float64
Merging data frames
import numpy as np
import pandas as pd
import random as rand
index = np.arange(1, 7)
attrs = 'clicks height score time'.split()
values = rand.sample(range(50), 24)
sample_data = np.array(values).reshape((6, 4))
sample_df = pd.DataFrame(sample_data, index=index, columns=attrs)
sample_df
clicks height score time
1 19 39 11 17
2 7 25 26 44
3 2 8 21 35
4 40 48 36 0
5 6 47 1 3
6 14 43 13 34
concat
Concatenate pandas objects along a particular axis with optional set logic along the other axes.
pd.concat([sample_df[0:3], sample_df[0:3]])
clicks height score time
1 19 39 11 17
2 7 25 26 44
3 2 8 21 35
1 19 39 11 17
2 7 25 26 44
3 2 8 21 35
pd.concat([sample_df.iloc[0:3], sample_df.iloc[0:3]], axis=1)
clicks height score time clicks height score time
1 19 39 11 17 19 39 11 17
2 7 25 26 44 7 25 26 44
3 2 8 21 35 2 8 21 35
join
Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.
sample_df.join(sample_df.iloc[0:3], how='inner', rsuffix='_r')
clicks height score time clicks_r height_r score_r time_r
1 19 39 11 17 19 39 11 17
2 7 25 26 44 7 25 26 44
3 2 8 21 35 2 8 21 35
append
Append rows of other to the end of this frame, returning a new object. Columns not in this frame are added as new columns.
new_row = pd.DataFrame(dict(clicks=10, height=20, score=30, time=40), index=[10])
sample_df.iloc[0:3].append(new_row)
clicks height score time
1 19 39 11 17
2 7 25 26 44
3 2 8 21 35
10 10 20 30 40
merge
Merge DataFrame objects by performing a database-style join operation by columns or indexes.
If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on.
sample_df
clicks height score time
1 19 39 11 17
2 7 25 26 44
3 2 8 21 35
4 40 48 36 0
5 6 47 1 3
6 14 43 13 34
entries = { 1: dict(height=10, width=20), 2: dict(height=34, width=35), 3: dict(height=5, width=80), 4: dict(height=39, width=32) }
related_df = pd.DataFrame(entries)
related_df.T
height width
1 10 20
2 34 35
3 5 80
4 39 32
sample_df.merge(related_df.T)
clicks height score time width
0 19 39 11 17 32
sample_df.merge(related_df.T, how='left')
clicks height score time width
0 19 39 11 17 32.0
1 7 25 26 44 NaN
2 2 8 21 35 NaN
3 40 48 36 0 NaN
4 6 47 1 3 NaN
5 14 43 13 34 NaN
sample_df.merge(related_df.T, how='outer')
clicks height score time width
0 19.0 39 11.0 17.0 32.0
1 7.0 25 26.0 44.0 NaN
2 2.0 8 21.0 35.0 NaN
3 40.0 48 36.0 0.0 NaN
4 6.0 47 1.0 3.0 NaN
5 14.0 43 13.0 34.0 NaN
6 NaN 10 NaN NaN 20.0
7 NaN 34 NaN NaN 35.0
8 NaN 5 NaN NaN 80.0
Categoricals
import numpy as np
import pandas as pd
from io import StringIO
csv_data = """
Department,Name,YearsOfService,Grade\n0,Marketing,Able,4,a\n1,Engineering,Baker,7,b\n2,Accounting,Charlie,12,c\n3,Marketing,Delta,1,d\n4,Engineering,Echo,15,f\n5,Accounting,Foxtrot,9,a\n6,Marketing,Golf,3,b\n7,Engineering,Hotel,1,c\n8,Accounting,India,2,d\n9,Marketing,Juliet,5,f\n10,Engineering,Kilo,7,a\n11,Accounting,Lima,11,b\n12,Marketing,Mike,2,c\n13,Engineering,November,3,d\n14,Accounting,Oscar,4,f\n15,Marketing,Papa,9,a\n16,Engineering,Quebec,1,b\n17,Accounting,Romeo,1,c\n18,Marketing,Sierra,1,d\n19,Engineering,Tango,7,f\n20,Accounting,Uniform,5,a\n21,Marketing,Victor,19,b\n22,Engineering,Whiskey,2,c\n23,Accounting,Xray,3,d\n24,Marketing,Yankee,8,f\n25,Engineering,Zulu,17,a\n
"""
employees = pd.read_csv(StringIO(csv_data))
employees.head()
Department Name YearsOfService Grade
0 Marketing Able 4 a
1 Engineering Baker 7 b
2 Accounting Charlie 12 c
3 Marketing Delta 1 d
4 Engineering Echo 15 f
Convert String data to categorical data
employees.dtypes
Department object
Name object
YearsOfService int64
Grade object
dtype: object
employees['Department'] = employees['Department'].astype('category')
employees.dtypes
Department category
Name object
YearsOfService int64
Grade object
dtype: object
Rename categories
employees['Grade'] = employees['Grade'].astype('category') employees['Grade'].cat.categories = 'excellent good acceptable poor unacceptable'.split() employees.head()
Department Name YearsOfService Grade 0 Marketing Able 4 excellent 1 Engineering Baker 7 good 2 Accounting Charlie 12 acceptable 3 Marketing Delta 1 poor 4 Engineering Echo 15 unacceptable
Categories before and after renaming:
# Index(['a', 'b', 'c', 'd', 'f'], dtype='object') # Index(['excellent', 'good', 'acceptable', 'poor', 'unacceptable'], dtype='object')
Grouping
Cumulative length of service by employees in each department.
employees.groupby('Department').sum()
YearsOfService
Department
Accounting 47
Engineering 60
Marketing 52
Number of employees per grade.
employees.groupby('Grade').count()['Name']
Grade
excellent 6
good 5
acceptable 5
poor 5
unacceptable 5
Name: Name, dtype: int64
Number of employees, by department, obtaining each grade.
employees.groupby(['Grade', 'Department']).count()['Name']
Grade Department
excellent Accounting 2
Engineering 2
Marketing 2
good Accounting 1
Engineering 2
Marketing 2
acceptable Accounting 2
Engineering 2
Marketing 1
poor Accounting 2
Engineering 1
Marketing 2
unacceptable Accounting 1
Engineering 2
Marketing 2
Time series resampling
Create a date range to use as an index: pandas.date_range
my_index = pd.date_range('9/1/2016', periods=9, freq='min')
my_index
DatetimeIndex(['2016-09-01 00:00:00', '2016-09-01 00:01:00',
'2016-09-01 00:02:00', '2016-09-01 00:03:00',
'2016-09-01 00:04:00', '2016-09-01 00:05:00',
'2016-09-01 00:06:00', '2016-09-01 00:07:00',
'2016-09-01 00:08:00'],
dtype='datetime64[ns]', freq='T')
Create a time series that includes a simple pattern: pandas.Series
my_series = pd.Series(np.arange(9), index=my_index)
my_series
2016-09-01 00:00:00 0
2016-09-01 00:01:00 1
2016-09-01 00:02:00 2
2016-09-01 00:03:00 3
2016-09-01 00:04:00 4
2016-09-01 00:05:00 5
2016-09-01 00:06:00 6
2016-09-01 00:07:00 7
2016-09-01 00:08:00 8
Freq: T, dtype: int64
Downsampling: pandas.resample
my_series.resample('3min')
DatetimeIndexResampler [freq=<3 * Minutes>, axis=0, closed=left, label=left, convention=start, base=0]
my_series.resample('3min').sum()
2016-09-01 00:00:00 3
2016-09-01 00:03:00 12
2016-09-01 00:06:00 21
Freq: 3T, dtype: int64
Use upper bound for each time period as the label.
my_series.resample('3min', label='right').sum()
2016-09-01 00:03:00 3
2016-09-01 00:06:00 12
2016-09-01 00:09:00 21
Freq: 3T, dtype: int64
Close the right side of the bin interval.
my_series.resample('3min', label='right', closed='right').sum()
2016-09-01 00:00:00 0
2016-09-01 00:03:00 6
2016-09-01 00:06:00 15
2016-09-01 00:09:00 15
Freq: 3T, dtype: int64
Upsampling
my_series.resample('30s').asfreq().head()
2016-09-01 00:00:00 0.0
2016-09-01 00:00:30 NaN
2016-09-01 00:01:00 1.0
2016-09-01 00:01:30 NaN
2016-09-01 00:02:00 2.0
Freq: 30S, dtype: float64
Custom function to use with resampling
def custom_arithmetic(array_like): temp = 3 * np.sum(array_like) + 5 return temp my_series.resample('3min').apply(custom_arithmetic)
2016-09-01 00:00:00 14 2016-09-01 00:03:00 41 2016-09-01 00:06:00 68 Freq: 3T, dtype: int64
Series
Create series
my_simple_series = pd.Series(np.random.randn(5), index=list('abcde'))
my_simple_series
a 1.186168
b 0.606623
c 1.862614
d -1.180305
e 0.615774
dtype: float64
my_dictionary = dict(a=45, b=-19.5, c=4444)
my_second_series = pd.Series(my_dictionary)
my_second_series
a 45.0
b -19.5
c 4444.0
dtype: float64
pd.Series(my_dictionary, index=list('bcda'))
b -19.5
c 4444.0
d NaN
a 45.0
dtype: float64
my_dictionary.get('a')
45
legit = my_dictionary.get('a')
type(legit)
unknown = my_dictionary.get('f')
type(unknown)
<class 'int'>
<class 'NoneType'>
Create a series from a scalar
pd.Series(5, index=list('abcd'))
a 5
b 5
c 5
d 5
dtype: int64
Vectorized operations
A key difference between series and ndarrays is that series operations automatically align data based on labels
my_series.head() + my_series.head()
2016-09-01 00:00:00 0
2016-09-01 00:01:00 2
2016-09-01 00:02:00 4
2016-09-01 00:03:00 6
2016-09-01 00:04:00 8
Freq: T, dtype: int64
np.array(my_series.head()) + np.array(my_series.head())
array([0, 2, 4, 6, 8])
Date arithmetic
from datetime import datetime
now = datetime.now()
now
datetime.datetime(2017, 9, 22, 14, 30, 18, 504458)
delta
delta = now - datetime(2001, 1, 1) delta
datetime.timedelta(6108, 52218, 504458)
delta.days
6108
pd.Timedelta(6108, unit='d')
Timedelta('6108 days 00:00:00')
Range from timedelta
us_memorial_day = datetime(2016, 5, 30) us_labor_day = datetime(2016, 9, 5) us_summer_2016 = us_labor_day - us_memorial_day us_summer_2016
datetime.timedelta(98)
summer_2016_days = pd.date_range( us_memorial_day, periods=us_summer_2016.days, freq='D') summer_2016_days[:4] summer_2016_days[-4:]
DatetimeIndex(['2016-05-30', '2016-05-31', '2016-06-01', '2016-06-02'], dtype='datetime64[ns]', freq='D') DatetimeIndex(['2016-09-01', '2016-09-02', '2016-09-03', '2016-09-04'], dtype='datetime64[ns]', freq='D')
Data Frames and Panels
Creating data frames from various source types
vals = dict(a=40, b=29, c=292, d=-5.03)
pd.DataFrame(vals, index='first again'.split())
a b c d
first 40 29 292 -5.03
again 40 29 292 -5.03
Without an explicit index
series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
pd.DataFrame(series_dict)
a b c
0 4 9 3
1 5 322 45
2 6 455 22
dictionary of tuples, with multi index
dict_of_tuples = {
('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
('a', 'c'): {('A', 'B'): 1, ('A', 'C'): 2},
('b', 'a'): {('A', 'B'): 1, ('A', 'C'): 2},
('b', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}
}
pd.DataFrame(dict_of_tuples)
a b
a b c a b
A B 1 1 1 1 1
C 2 2 2 2 2
Create panels
3D analogues of DataFrames
Initialized natively
pd.Panel(np.random.randn(2, 5, 4),
items='item1 item2'.split(),
major_axis=pd.date_range('9/6/2016', periods=5),
minor_axis=list('ABCD'))
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: item1 to item2
Major_axis axis: 2016-09-06 00:00:00 to 2016-09-10 00:00:00
Minor_axis axis: A to D
Initialized from a dictionary of data frames
series_dict = dict(a=[4, 5, 6], b=[9, 322, 455], c=[3, 45, 22])
df1 = pd.DataFrame(series_dict)
df2 = pd.DataFrame(series_dict) + 10
dict_of_dfs = dict(df1=df1, df2=df2)
pd.Panel(dict_of_dfs)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c
from_dict
factory method
panel = pd.Panel.from_dict(dict_of_dfs, orient='minor')
pd.Panel.from_dict(dict_of_dfs, orient='items')
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 3 (minor_axis)
Items axis: df1 to df2
Major_axis axis: 0 to 2
Minor_axis axis: a to c
panel.ix[:, 0,: ]
a b c
df1 4 9 3
df2 14 19 13
panel.ix['a':, 0, :'df1']
a b c
df1 4 9 3
SciPy
[WIP]
scikit-learn
[WIP]