DS Notebook 03

Posted on Wed 04 December 2019 in Projects

Pandas review

And parsing thousands of recipes scraped from the internet, ex:
In [160]:
selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)
Out[160]:
2
In [161]:
recipes.name[selection.index]
Out[161]:
1965      Spinach, caramelised garlic, red pepper and po...
171726          Lamb shoulder with harissa and lamb stovies
Name: name, dtype: object

In [1]:
import numpy as np
import pandas as pd

Series as generalized NumPy array

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
Out[2]:
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
In [3]:
data.values
Out[3]:
array([0.25, 0.5 , 0.75, 1.  ])
In [4]:
data.index
Out[4]:
RangeIndex(start=0, stop=4, step=1)
In [5]:
data[1]
Out[5]:
0.5
In [6]:
data[1:3]
Out[6]:
1    0.50
2    0.75
dtype: float64

Pandas Series object has an explicitly defined index associated with the values (as opposed to NumPy arrays' implicit integer index).
Need not be an integer:

In [33]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data
Out[33]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
In [8]:
data['a']
Out[8]:
0.25
In [31]:
# non-contiguous or non-sequential:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data
Out[31]:
2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64
In [10]:
data[7]
Out[10]:
1.0

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [11]:
population_dict = {'CA': 38332521,
                   'TX': 26448193,
                   'NY': 19651127,
                   'FL': 19552860,
                   'IL': 12882135}
population = pd.Series(population_dict)
population
Out[11]:
CA    38332521
TX    26448193
NY    19651127
FL    19552860
IL    12882135
dtype: int64
In [12]:
population['CA']
Out[12]:
38332521

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [13]:
population['CA':'NY']
Out[13]:
CA    38332521
TX    26448193
NY    19651127
dtype: int64
In general..
>>> pd.Series(data, index=index)
In [14]:
pd.Series({2:'a', 1:'b', 3:'c'})
Out[14]:
2    a
1    b
3    c
dtype: object
In [15]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
Out[15]:
3    c
2    a
dtype: object

DataFrame as generalized NumPy array

In [16]:
area_dict = {'CA': 423967, 'TX': 695662, 'NY': 141297,
             'FL': 170312, 'IL': 149995}
area = pd.Series(area_dict)
area
Out[16]:
CA    423967
TX    695662
NY    141297
FL    170312
IL    149995
dtype: int64
In [17]:
states = pd.DataFrame({'pop': population,
                       'area': area})
states
Out[17]:
pop area
CA 38332521 423967
TX 26448193 695662
NY 19651127 141297
FL 19552860 170312
IL 12882135 149995
In [18]:
states.index
Out[18]:
Index(['CA', 'TX', 'NY', 'FL', 'IL'], dtype='object')
In [19]:
states.columns
Out[19]:
Index(['pop', 'area'], dtype='object')
In [20]:
states['area']  # returns a Series
Out[20]:
CA    423967
TX    695662
NY    141297
FL    170312
IL    149995
Name: area, dtype: int64

The Pandas Index Object

Immutable Array, Ordered Set

In [21]:
ind = pd.Index([2, 3, 5, 7, 11])
In [22]:
ind[0]
Out[22]:
2
In [23]:
ind[::2]
Out[23]:
Int64Index([2, 5, 11], dtype='int64')
In [24]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
In [25]:
# Immutable
ind[1] = 0
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-83a953ce5d2f> in <module>()
      1 # Immutable
----> 2 ind[1] = 0

~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   4258 
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261 
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic:

In [26]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
In [27]:
indA & indB # intersection
indA.intersection(indB)
Out[27]:
Int64Index([3, 5, 7], dtype='int64')
In [28]:
indA | indB # union
Out[28]:
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
In [29]:
indA ^ indB # xor
Out[29]:
Int64Index([1, 2, 9, 11], dtype='int64')

Selecting data

In [34]:
data['b']
Out[34]:
0.5
In [35]:
'b' in data
Out[35]:
True
In [36]:
data.keys()
Out[36]:
Index(['a', 'b', 'c', 'd'], dtype='object')
In [37]:
list(data.items())
Out[37]:
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
In [38]:
data['e'] = 1.25
data
Out[38]:
a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64
In [39]:
#masking
data[(data > 0.3) & (data < 0.8)]
Out[39]:
b    0.50
c    0.75
dtype: float64

Indexers loc, iloc

In [40]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
Out[40]:
1    a
3    b
5    c
dtype: object
In [41]:
# explicit
data[1]
Out[41]:
'a'
In [42]:
# implicit
data[1:4]
Out[42]:
3    b
5    c
dtype: object

.loc always references the explicit index:

In [43]:
data.loc[3]
Out[43]:
'b'

.iloc always references the implicit Python-style index:

In [44]:
data.iloc[2]
Out[44]:
'c'

DataFrame as a dict of Series:

In [45]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
Out[45]:
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135
In [46]:
#access the individual series:
data['area']
Out[46]:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
In [47]:
#Equivalently, use attribute-style access with column names that are strings:
data.area
Out[47]:
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
In [48]:
# they are exactly the same:
data.area is data['area']
Out[48]:
True

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [49]:
data['density'] = data['pop'] / data['area']
data
Out[49]:
area pop density
California 423967 38332521 90.413926
Texas 695662 26448193 38.018740
New York 141297 19651127 139.076746
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
In [50]:
# as enhanced 2D array:
data.values
Out[50]:
array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])
In [51]:
data.T
Out[51]:
California Texas New York Florida Illinois
area 4.239670e+05 6.956620e+05 1.412970e+05 1.703120e+05 1.499950e+05
pop 3.833252e+07 2.644819e+07 1.965113e+07 1.955286e+07 1.288214e+07
density 9.041393e+01 3.801874e+01 1.390767e+02 1.148061e+02 8.588376e+01
In [52]:
data.values[0]
Out[52]:
array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
In [53]:
data.iloc[:3, 1:3]
Out[53]:
pop density
California 38332521 90.413926
Texas 26448193 38.018740
New York 19651127 139.076746
In [54]:
data.loc[:'New York','pop':'density']
Out[54]:
pop density
California 38332521 90.413926
Texas 26448193 38.018740
New York 19651127 139.076746
In [55]:
data.loc[data.density > 100, ['pop', 'density']]
Out[55]:
pop density
New York 19651127 139.076746
Florida 19552860 114.806121
In [56]:
# modify values:
data.iloc[0, 2] = 90
data
Out[56]:
area pop density
California 423967 38332521 90.000000
Texas 695662 26448193 38.018740
New York 141297 19651127 139.076746
Florida 170312 19552860 114.806121
Illinois 149995 12882135 85.883763
In [57]:
data[data.density > 100]
Out[57]:
area pop density
New York 141297 19651127 139.076746
Florida 170312 19552860 114.806121
In [58]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
Out[58]:
0    6
1    3
2    7
3    4
dtype: int32
In [59]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df
Out[59]:
A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
In [60]:
np.exp(ser)
Out[60]:
0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64
In [61]:
np.sin(df * np.pi/4)
Out[61]:
A B C D
0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
In [64]:
# Pandas aligns indices, in case data is incomplete:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
population / area
Out[64]:
Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64
In [66]:
area.index | population.index  # union of indices of two input arrays
Out[66]:
Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
In [69]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
print(A)
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(B)
print(A + B)
0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:

In [70]:
A.add(B, fill_value=0)
Out[70]:
0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64
In [74]:
# similarly for dfs:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('12'))
A
Out[74]:
1 2
0 8 1
1 19 14
In [79]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('213'))
B
Out[79]:
2 1 3
0 3 7 6
1 8 7 4
2 1 4 7
In [80]:
A + B
Out[80]:
1 2 3
0 15.0 4.0 NaN
1 26.0 22.0 NaN
2 NaN NaN NaN
In [82]:
A.stack()
Out[82]:
0  1     8
   2     1
1  1    19
   2    14
dtype: int32
In [83]:
A.stack().mean()
Out[83]:
10.5
In [84]:
A.add(B, fill_value=A.stack().mean())
Out[84]:
1 2 3
0 15.0 4.0 16.5
1 26.0 22.0 14.5
2 14.5 11.5 17.5
Python Operator Pandas Method(s)
+ add()
- sub(), subtract()
* mul(), multiply()
/ truediv(), div(), divide()
// floordiv()
% mod()
** pow()
In [85]:
A = rng.randint(10, size=(3, 4))
A
Out[85]:
array([[9, 8, 8, 0],
       [8, 6, 8, 7],
       [0, 7, 7, 2]])
In [86]:
A[0]
Out[86]:
array([9, 8, 8, 0])
In [87]:
A-A[0]
Out[87]:
array([[ 0,  0,  0,  0],
       [-1, -2,  0,  7],
       [-9, -1, -1,  2]])
In [90]:
#pandas similarly operates row-wise:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Out[90]:
Q R S T
0 0 0 0 0
1 -1 -2 0 7
2 -9 -1 -1 2
In [91]:
#column-wise goes like this:
df.subtract(df['R'], axis=0)
Out[91]:
Q R S T
0 1 0 0 -8
1 2 0 2 1
2 -7 0 0 -5
In [92]:
df
Out[92]:
Q R S T
0 9 8 8 0
1 8 6 8 7
2 0 7 7 2
In [93]:
df.iloc[0, ::2]
Out[93]:
Q    9
S    8
Name: 0, dtype: int32
In [94]:
df - df.iloc[0, ::2]
Out[94]:
Q R S T
0 0.0 NaN 0.0 NaN
1 -1.0 NaN 0.0 NaN
2 -9.0 NaN -1.0 NaN

Missing Data

In [98]:
# sentinel values
vals1 = np.array([1, None, 3, 4])
vals1  # <-- common type = object
Out[98]:
array([1, None, 3, 4], dtype=object)
In [100]:
vals1.sum()  #addition between int and None is undefined
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-100-7de6b8507c78> in <module>()
----> 1 vals1.sum()  #addition between int and None is undefined

~\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
In [101]:
# NaN is different:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype
Out[101]:
dtype('float64')
In [104]:
1 + np.nan  #data virus, infects any object it touches:
Out[104]:
nan
In [105]:
0 * np.nan
Out[105]:
nan
In [106]:
vals2.sum(), vals2.min(), vals2.max()
Out[106]:
(nan, nan, nan)
In [108]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Out[108]:
(8.0, 1.0, 4.0)
In [109]:
pd.Series([1, np.nan, 2, None])
Out[109]:
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64
In [110]:
# detection
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
Out[110]:
0    False
1     True
2    False
3     True
dtype: bool
In [111]:
data[data.notnull()]
Out[111]:
0        1
2    hello
dtype: object
In [112]:
data.dropna()
Out[112]:
0        1
2    hello
dtype: object
In [113]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df
Out[113]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

By default, dropna() will drop all rows in which any null value is present:

In [115]:
df.dropna()
Out[115]:
0 1 2
1 2.0 3.0 5
In [116]:
df.dropna(axis='columns')
Out[116]:
2
0 2
1 5
2 6

But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through.

The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:

In [117]:
df[3] = np.nan
df
Out[117]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In [118]:
df.dropna(axis='columns', how='all')
Out[118]:
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6

For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [119]:
df.dropna(axis='rows', thresh=3)
Out[119]:
0 1 2 3
1 2.0 3.0 5 NaN
In [120]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
Out[120]:
a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64
In [121]:
data.fillna(0)
Out[121]:
a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64
In [122]:
#forward-fill
data.fillna(method='ffill')
Out[122]:
a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64
In [123]:
#back-fill
data.fillna(method='bfill')
Out[123]:
a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64
In [124]:
df
Out[124]:
0 1 2 3
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
In [126]:
df.fillna(method='ffill', axis=1)
Out[126]:
0 1 2 3
0 1.0 1.0 2.0 2.0
1 2.0 3.0 5.0 5.0
2 NaN 4.0 6.0 6.0

Hierarchical Indexing (MultiIndex)

like always, but keep in mind can pass level= to many methods. ex.:

In [3]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
Out[3]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 47.0 36.2 30.0 37.9 51.0 36.6
2 20.0 37.8 50.0 37.0 49.0 36.5
2014 1 40.0 36.6 19.0 37.5 37.0 37.5
2 60.0 37.4 26.0 37.4 37.0 37.2
In [4]:
health_data.index.levels
Out[4]:
FrozenList([[2013, 2014], [1, 2]])
In [10]:
health_data['Bob']
Out[10]:
type HR Temp
year visit
2013 1 47.0 36.2
2 20.0 37.8
2014 1 40.0 36.6
2 60.0 37.4
In [11]:
health_data['Bob', 'HR']
Out[11]:
year  visit
2013  1        47.0
      2        20.0
2014  1        40.0
      2        60.0
Name: (Bob, HR), dtype: float64
In [12]:
health_data.loc[:, ('Bob', 'HR')]
Out[12]:
year  visit
2013  1        47.0
      2        20.0
2014  1        40.0
      2        60.0
Name: (Bob, HR), dtype: float64
In [13]:
health_data.sort_index()
Out[13]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 47.0 36.2 30.0 37.9 51.0 36.6
2 20.0 37.8 50.0 37.0 49.0 36.5
2014 1 40.0 36.6 19.0 37.5 37.0 37.5
2 60.0 37.4 26.0 37.4 37.0 37.2
In [14]:
health_data.unstack(level=0)
Out[14]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year 2013 2014 2013 2014 2013 2014 2013 2014 2013 2014 2013 2014
visit
1 47.0 40.0 36.2 36.6 30.0 19.0 37.9 37.5 51.0 37.0 36.6 37.5
2 20.0 60.0 37.8 37.4 50.0 26.0 37.0 37.4 49.0 37.0 36.5 37.2
In [15]:
health_data.unstack(level=1)
Out[15]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
visit 1 2 1 2 1 2 1 2 1 2 1 2
year
2013 47.0 20.0 36.2 37.8 30.0 50.0 37.9 37.0 51.0 49.0 36.6 36.5
2014 40.0 60.0 36.6 37.4 19.0 26.0 37.5 37.4 37.0 37.0 37.5 37.2
In [16]:
health_data.stack(level=0)
Out[16]:
type HR Temp
year visit subject
2013 1 Bob 47.0 36.2
Guido 30.0 37.9
Sue 51.0 36.6
2 Bob 20.0 37.8
Guido 50.0 37.0
Sue 49.0 36.5
2014 1 Bob 40.0 36.6
Guido 19.0 37.5
Sue 37.0 37.5
2 Bob 60.0 37.4
Guido 26.0 37.4
Sue 37.0 37.2
In [17]:
health_data.stack(level=1)
Out[17]:
subject Bob Guido Sue
year visit type
2013 1 HR 47.0 30.0 51.0
Temp 36.2 37.9 36.6
2 HR 20.0 50.0 49.0
Temp 37.8 37.0 36.5
2014 1 HR 40.0 19.0 37.0
Temp 36.6 37.5 37.5
2 HR 60.0 26.0 37.0
Temp 37.4 37.4 37.2
In [24]:
health_data.reset_index()  #turn index into columns
Out[24]:
subject year visit Bob Guido Sue
type HR Temp HR Temp HR Temp
0 2013 1 47.0 36.2 30.0 37.9 51.0 36.6
1 2013 2 20.0 37.8 50.0 37.0 49.0 36.5
2 2014 1 40.0 36.6 19.0 37.5 37.0 37.5
3 2014 2 60.0 37.4 26.0 37.4 37.0 37.2
In [27]:
data_mean = health_data.mean(level='year')
data_mean
Out[27]:
subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year
2013 33.5 37.0 40.0 37.45 50.0 36.55
2014 50.0 37.0 22.5 37.45 37.0 37.35
In [31]:
data_mean.mean(axis=1, level='type')
Out[31]:
type HR Temp
year
2013 41.166667 37.000000
2014 36.500000 37.266667

Merge and Join with pd.merge()

In [165]:
# For visualization purposes, to see two dfs side-by-side
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

1-to-1

Like column-wise concatenation. Finds the common column between the two dfs ("employee"), and combines. Order does not matter, and the index is discarded:

In [168]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
Out[168]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df2

employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
In [170]:
df3 = pd.merge(df1, df2)
df3
Out[170]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

N-to-1

When one of the two key columns (in this case, "group") contains duplicate entries (in this case, "Engineering"). So, "supervisor" column has repeated information, where necessary.

In [171]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
Out[171]:

df3

employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

df4

group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve

pd.merge(df3, df4)

employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve

N-to-N

When key column in both L and R dfs contains duplicates:

In [172]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
Out[172]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df5

group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization

pd.merge(df1, df5)

employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization

Options for pd.merge()

Specify merge key with on

In [175]:
pd.merge(df1, df2, on='employee')
Out[175]:
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

left_on and right_on

we may have a dataset in which the employee name is labeled as "name" rather than "employee". In this case, we can use the left_on and right_on keywords to specify the two column names:

In [176]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3')
Out[176]:

df1

employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR

df3

name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
In [178]:
pd.merge(df1, df3, left_on='employee', right_on='name').drop('name', axis=1)
Out[178]:
employee group salary
0 Bob Accounting 70000
1 Jake Engineering 80000
2 Lisa Engineering 120000
3 Sue HR 90000

left_index and right_index

In [185]:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
Out[185]:

df1a

group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR

df2a

hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
In [183]:
pd.merge(df1a, df2a, left_index=True, right_index=True)
Out[183]:
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
For convenience, merging on indices is join()
In [184]:
df1a.join(df2a)
Out[184]:
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
In [182]:
pd.merge(df1a, df3, left_index=True, right_on='name')
Out[182]:
group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000

Set Arithmetic (inner(), outer(), etc)

When a value appears in one key column but not the other:

In [186]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
Out[186]:

df6

name food
0 Peter fish
1 Paul beans
2 Mary bread

df7

name drink
0 Mary wine
1 Joseph beer

pd.merge(df6, df7)

name food drink
0 Mary bread wine

By default, pd.merge() uses inner(), which finds the intersection:

In [187]:
pd.merge(df6, df7, how='inner')
Out[187]:
name food drink
0 Mary bread wine
In [191]:
# outer is union, filled with NAs:
pd.merge(df6, df7, how='outer')
Out[191]:
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
In [193]:
# left keeps all left entries, ignoring Joseph
pd.merge(df6, df7, how='left')
Out[193]:
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
In [194]:
# Right keeps all right entries, ignoring Peter and Paul:
pd.merge(df6, df7, how='right')
Out[194]:
name food drink
0 Mary bread wine
1 Joseph NaN beer

suffixes

For conflicting column names you wish to keep separate

In [195]:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
Out[195]:

df8

name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4

df9

name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2

pd.merge(df8, df9, on="name")

name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
In [198]:
pd.merge(df8, df9, on='name', suffixes=('_L', '_R'))
Out[198]:
name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2

Example: US States Data

Q: rank US states and territories by their 2010 population density

In [199]:
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 57935  100 57935    0     0  57935      0  0:00:01 --:--:--  0:00:01  190k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   835  100   835    0     0    835      0  0:00:01 --:--:--  0:00:01  3568
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   872  100   872    0     0    872      0  0:00:01 --:--:--  0:00:01  4000
In [2]:
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
In [3]:
pop.head()
Out[3]:
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
In [4]:
areas.head()
Out[4]:
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
In [5]:
abbrevs.head()
Out[5]:
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [6]:
merged = pd.merge(areas, abbrevs, on='state')
merged.head()
Out[6]:
state area (sq. mi) abbreviation
0 Alabama 52423 AL
1 Alaska 656425 AK
2 Arizona 114006 AZ
3 Arkansas 53182 AR
4 California 163707 CA
In [7]:
final = pd.merge(merged, pop, left_on='abbreviation', right_on='state/region').drop('state/region', axis=1)
final.head()
Out[7]:
state area (sq. mi) abbreviation ages year population
0 Alabama 52423 AL under18 2012 1117489.0
1 Alabama 52423 AL total 2012 4817528.0
2 Alabama 52423 AL under18 2010 1130966.0
3 Alabama 52423 AL total 2010 4785570.0
4 Alabama 52423 AL under18 2011 1125763.0
In [8]:
final['density'] = final['population'] / final['area (sq. mi)']
final.head()
Out[8]:
state area (sq. mi) abbreviation ages year population density
0 Alabama 52423 AL under18 2012 1117489.0 21.316769
1 Alabama 52423 AL total 2012 4817528.0 91.897221
2 Alabama 52423 AL under18 2010 1130966.0 21.573851
3 Alabama 52423 AL total 2010 4785570.0 91.287603
4 Alabama 52423 AL under18 2011 1125763.0 21.474601
In [9]:
data2010 = final[(final['year'] == 2010) & (final['ages'] == 'total')]
data2010
Out[9]:
state area (sq. mi) abbreviation ages year population density
3 Alabama 52423 AL total 2010 4785570.0 91.287603
91 Alaska 656425 AK total 2010 713868.0 1.087509
101 Arizona 114006 AZ total 2010 6408790.0 56.214497
189 Arkansas 53182 AR total 2010 2922280.0 54.948667
197 California 163707 CA total 2010 37333601.0 228.051342
283 Colorado 104100 CO total 2010 5048196.0 48.493718
293 Connecticut 5544 CT total 2010 3579210.0 645.600649
379 Delaware 1954 DE total 2010 899711.0 460.445752
427 Florida 65758 FL total 2010 18846054.0 286.597129
437 Georgia 59441 GA total 2010 9713248.0 163.409902
522 Hawaii 10932 HI total 2010 1363731.0 124.746707
533 Idaho 83574 ID total 2010 1570718.0 18.794338
618 Illinois 57918 IL total 2010 12839695.0 221.687472
629 Indiana 36420 IN total 2010 6489965.0 178.197831
714 Iowa 56276 IA total 2010 3050314.0 54.202751
725 Kansas 82282 KS total 2010 2858910.0 34.745266
810 Kentucky 40411 KY total 2010 4347698.0 107.586994
821 Louisiana 51843 LA total 2010 4545392.0 87.676099
906 Maine 35387 ME total 2010 1327366.0 37.509990
917 Maryland 12407 MD total 2010 5787193.0 466.445797
1002 Massachusetts 10555 MA total 2010 6563263.0 621.815538
1013 Michigan 96810 MI total 2010 9876149.0 102.015794
1098 Minnesota 86943 MN total 2010 5310337.0 61.078373
1109 Mississippi 48434 MS total 2010 2970047.0 61.321530
1194 Missouri 69709 MO total 2010 5996063.0 86.015622
1205 Montana 147046 MT total 2010 990527.0 6.736171
1290 Nebraska 77358 NE total 2010 1829838.0 23.654153
1301 Nevada 110567 NV total 2010 2703230.0 24.448796
1386 New Hampshire 9351 NH total 2010 1316614.0 140.799273
1397 New Jersey 8722 NJ total 2010 8802707.0 1009.253268
1482 New Mexico 121593 NM total 2010 2064982.0 16.982737
1493 New York 54475 NY total 2010 19398228.0 356.094135
1578 North Carolina 53821 NC total 2010 9559533.0 177.617157
1589 North Dakota 70704 ND total 2010 674344.0 9.537565
1674 Ohio 44828 OH total 2010 11545435.0 257.549634
1685 Oklahoma 69903 OK total 2010 3759263.0 53.778278
1770 Oregon 98386 OR total 2010 3837208.0 39.001565
1781 Pennsylvania 46058 PA total 2010 12710472.0 275.966651
1866 Rhode Island 1545 RI total 2010 1052669.0 681.339159
1877 South Carolina 32007 SC total 2010 4636361.0 144.854594
1962 South Dakota 77121 SD total 2010 816211.0 10.583512
1973 Tennessee 42146 TN total 2010 6356683.0 150.825298
2058 Texas 268601 TX total 2010 25245178.0 93.987655
2069 Utah 84904 UT total 2010 2774424.0 32.677188
2154 Vermont 9615 VT total 2010 625793.0 65.085075
2165 Virginia 42769 VA total 2010 8024417.0 187.622273
2250 Washington 71303 WA total 2010 6742256.0 94.557817
2261 West Virginia 24231 WV total 2010 1854146.0 76.519582
2346 Wisconsin 65503 WI total 2010 5689060.0 86.851900
2357 Wyoming 97818 WY total 2010 564222.0 5.768079
2405 District of Columbia 68 DC total 2010 605125.0 8898.897059
In [10]:
data2010.set_index('abbreviation' , inplace=True)
data2010.head()
Out[10]:
state area (sq. mi) ages year population density
abbreviation
AL Alabama 52423 total 2010 4785570.0 91.287603
AK Alaska 656425 total 2010 713868.0 1.087509
AZ Arizona 114006 total 2010 6408790.0 56.214497
AR Arkansas 53182 total 2010 2922280.0 54.948667
CA California 163707 total 2010 37333601.0 228.051342
In [11]:
data2010.density.sort_values(ascending=False)
Out[11]:
abbreviation
DC    8898.897059
NJ    1009.253268
RI     681.339159
CT     645.600649
MA     621.815538
MD     466.445797
DE     460.445752
NY     356.094135
FL     286.597129
PA     275.966651
OH     257.549634
CA     228.051342
IL     221.687472
VA     187.622273
IN     178.197831
NC     177.617157
GA     163.409902
TN     150.825298
SC     144.854594
NH     140.799273
HI     124.746707
KY     107.586994
MI     102.015794
WA      94.557817
TX      93.987655
AL      91.287603
LA      87.676099
WI      86.851900
MO      86.015622
WV      76.519582
VT      65.085075
MS      61.321530
MN      61.078373
AZ      56.214497
AR      54.948667
IA      54.202751
OK      53.778278
CO      48.493718
OR      39.001565
ME      37.509990
KS      34.745266
UT      32.677188
NV      24.448796
NE      23.654153
ID      18.794338
NM      16.982737
SD      10.583512
ND       9.537565
MT       6.736171
WY       5.768079
AK       1.087509
Name: density, dtype: float64
In [12]:
merged.isnull().any()
Out[12]:
state            False
area (sq. mi)    False
abbreviation     False
dtype: bool

A different approach, with more munging:

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

GroupBy

In [1]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
Out[1]:
(1035, 6)
In [2]:
planets.head()
Out[2]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [3]:
planets.dropna().describe()
Out[3]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

The following table summarizes some other built-in Pandas aggregations:

Aggregation Description
count() Total number of items
first(), last() First and last item
mean(), median() Mean and median
min(), max() Minimum and maximum
std(), var() Standard deviation and variance
mad() Mean absolute deviation
prod() Product of all items
sum() Sum of all items

These are all methods of DataFrame and Series objects.

In [6]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df
Out[6]:
key data
0 A 0
1 B 1
2 C 2
3 A 3
4 B 4
5 C 5
In [9]:
df.groupby('key')  # think of this as a special view of the df.
                   # ready to perform some aggregation
Out[9]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027F216AA978>
In [10]:
df.groupby('key').sum()
Out[10]:
data
key
A 3
B 5
C 7
In [11]:
planets.head()
Out[11]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
In [15]:
planets.groupby('method')['orbital_period'].median()
Out[15]:
method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64
In [16]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))
Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)
In [18]:
planets.groupby('method')['year'].describe()
Out[18]:
count mean std min 25% 50% 75% max
method
Astrometry 2.0 2011.500000 2.121320 2010.0 2010.75 2011.5 2012.25 2013.0
Eclipse Timing Variations 9.0 2010.000000 1.414214 2008.0 2009.00 2010.0 2011.00 2012.0
Imaging 38.0 2009.131579 2.781901 2004.0 2008.00 2009.0 2011.00 2013.0
Microlensing 23.0 2009.782609 2.859697 2004.0 2008.00 2010.0 2012.00 2013.0
Orbital Brightness Modulation 3.0 2011.666667 1.154701 2011.0 2011.00 2011.0 2012.00 2013.0
Pulsar Timing 5.0 1998.400000 8.384510 1992.0 1992.00 1994.0 2003.00 2011.0
Pulsation Timing Variations 1.0 2007.000000 NaN 2007.0 2007.00 2007.0 2007.00 2007.0
Radial Velocity 553.0 2007.518987 4.249052 1989.0 2005.00 2009.0 2011.00 2014.0
Transit 397.0 2011.236776 2.077867 2002.0 2010.00 2012.0 2013.00 2014.0
Transit Timing Variations 4.0 2012.500000 1.290994 2011.0 2011.75 2012.5 2013.25 2014.0

Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods. Notice that they are applied to each individual group, and the results are then combined within GroupBy and returned. Again, any valid DataFrame/Series method can be used on the corresponding GroupBy object, which allows for some very flexible and powerful operations!

GroupBy objects have aggregate(), filter(), transform(), and apply() methods

efficiently implement a variety of useful operations before combining the grouped data.

In [19]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df
Out[19]:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9

.aggregate()

can take a string, a function, or a list therof:

In [22]:
df.groupby('key').aggregate(['min', np.median, max])
Out[22]:
data1 data2
min median max min median max
key
A 0 1.5 3 3 4.0 5
B 1 2.5 4 0 3.5 7
C 2 3.5 5 3 6.0 9

Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

In [35]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})
Out[35]:
data1 data2
key
A 0 5
B 1 7
C 2 9

.filter()

A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [36]:
def filter_func(x):
    return x['data2'].std() > 4
In [37]:
df
Out[37]:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
In [38]:
df.groupby('key').std()
Out[38]:
data1 data2
key
A 2.12132 1.414214
B 2.12132 4.949747
C 2.12132 4.242641
In [44]:
df.groupby('key').filter(filter_func)
Out[44]:
key data1 data2
1 B 1 0
2 C 2 3
4 B 4 7
5 C 5 9

.apply()

The apply() method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.

For example, here is an apply() that normalizes the first column by the sum of the second:

In [46]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x
In [50]:
df
Out[50]:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
In [48]:
df.groupby('key').apply(norm_by_data2)
Out[48]:
key data1 data2
0 A 0.000000 5
1 B 0.142857 0
2 C 0.166667 3
3 A 0.375000 3
4 B 0.571429 7
5 C 0.416667 9

splitting by other methods

Another method is to provide a dictionary that maps index values to the group keys:

In [55]:
df
Out[55]:
key data1 data2
0 A 0 5
1 B 1 0
2 C 2 3
3 A 3 3
4 B 4 7
5 C 5 9
In [57]:
df2 = df.set_index('key')
df2
Out[57]:
data1 data2
key
A 0 5
B 1 0
C 2 3
A 3 3
B 4 7
C 5 9
In [58]:
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()
Out[58]:
data1 data2
consonant 12 19
vowel 3 8

Similar to mapping, you can pass any Python function that will input the index value and output the group:

In [59]:
df2.groupby(str.lower).mean()
Out[59]:
data1 data2
a 1.5 4.0
b 2.5 3.5
c 3.5 6.0
In [60]:
df2.groupby([str.lower, mapping]).mean()
Out[60]:
data1 data2
a vowel 1.5 4.0
b consonant 2.5 3.5
c consonant 3.5 6.0
In [65]:
# grouping example
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna('-')
Out[65]:
decade 1980s 1990s 2000s 2010s
method
Astrometry - - - 2
Eclipse Timing Variations - - 5 10
Imaging - - 29 21
Microlensing - - 12 15
Orbital Brightness Modulation - - - 5
Pulsar Timing - 9 1 1
Pulsation Timing Variations - - 1 -
Radial Velocity 1 52 475 424
Transit - - 64 712
Transit Timing Variations - - - 9

Pivot Tables

Like multidimentional GroupBys

In [66]:
titanic = sns.load_dataset('titanic')
titanic.head()
Out[66]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [67]:
titanic.groupby('sex')['survived'].mean()
Out[67]:
sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we

  • group by class and gender
  • select survival
  • apply a mean aggregate
  • combine the resulting groups
  • and then unstack the hierarchical index to reveal the hidden multidimensionality.

In code:

In [68]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
Out[68]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

While each step of this pipeline makes sense in light of the tools we've previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional GroupBy is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multi-dimensional aggregation.

In [69]:
titanic.pivot_table('survived', index='sex', columns='class')
Out[69]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

multi-level pivot tables

In [78]:
age = pd.cut(titanic['age'], [0, 18, 100])
titanic.pivot_table('survived', ['sex', age], 'class')
Out[78]:
class First Second Third
sex age
female (0, 18] 0.909091 1.000000 0.511628
(18, 100] 0.972973 0.900000 0.423729
male (0, 18] 0.800000 0.600000 0.215686
(18, 100] 0.375000 0.071429 0.133663

We can apply the same strategy when working with the columns as well; let's add info on the fare paid using pd.qcut to automatically compute quantiles:

In [79]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
Out[79]:
fare (-0.001, 14.454] (14.454, 512.329]
class First Second Third First Second Third
sex age
female (0, 18] NaN 1.000000 0.714286 0.909091 1.000000 0.318182
(18, 100] NaN 0.880000 0.444444 0.972973 0.914286 0.391304
male (0, 18] NaN 0.000000 0.260870 0.800000 0.818182 0.178571
(18, 100] 0.0 0.098039 0.125000 0.391304 0.030303 0.192308
# call signature as of Pandas 0.18
df.pivot_table(data, values=None, index=None, columns=None,
                      aggfunc='mean', fill_value=None,
                      margins=False, dropna=True, margins_name='All')
In [80]:
titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived':sum, 'fare':'mean'})
Out[80]:
fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47

At times it's useful to compute totals along each grouping. This can be done via the margins keyword:

In [81]:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[81]:
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
In [82]:
!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  258k  100  258k    0     0   258k      0  0:00:01 --:--:--  0:00:01  662k
In [85]:
births = pd.read_csv('births.csv')
births.head()
Out[85]:
year month day gender births
0 1969 1 1.0 F 4046
1 1969 1 1.0 M 4440
2 1969 1 2.0 F 4454
3 1969 1 2.0 M 4548
4 1969 1 3.0 F 4548
In [86]:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
Out[86]:
gender F M
decade
1960 1753634 1846572
1970 16263075 17121550
1980 18310351 19243452
1990 19479454 20420553
2000 18229309 19106428
In [87]:
import matplotlib.pyplot as plt
sns.set()  # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

Though this doesn't necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we'll do this via a robust sigma-clipping operation:

In [88]:
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
In [89]:
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
In [90]:
births['day'] = births['day'].astype(int)
In [91]:
births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')

births['dayofweek'] = births.index.dayofweek
In [92]:
births.pivot_table('births', index='dayofweek',
                    columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
In [94]:
births_by_date = births.pivot_table('births', 
                                    [births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
                        for (month, day) in births_by_date.index]
births_by_date.head()
Out[94]:
births
2012-01-01 4009.225
2012-01-02 4247.400
2012-01-03 4500.900
2012-01-04 4571.350
2012-01-05 4603.625
In [95]:
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);

Vectorized .str operations

In [97]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names.str.capitalize()
Out[97]:
0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object
In [98]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()
In [99]:
monte.str.lower()
Out[99]:
0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object
In [100]:
monte.str.len()
Out[100]:
0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64
In [101]:
monte.str.startswith('T')
Out[101]:
0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool
In [102]:
monte.str.startswith('t')
Out[102]:
0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool
In [103]:
monte.str.split()
Out[103]:
0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object
Method Description
match() Call re.match() on each element, returning a boolean.
extract() Call re.match() on each element, returning matched groups as strings.
findall() Call re.findall() on each element
replace() Replace occurrences of pattern with some other string
contains() Call re.search() on each element, returning a boolean
count() Count occurrences of pattern
split() Equivalent to str.split(), but accepts regexps
rsplit() Equivalent to str.rsplit(), but accepts regexps
In [104]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Out[104]:
0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object
Get Dummies
In [105]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte
Out[105]:
name info
0 Graham Chapman B|C|D
1 John Cleese B|D
2 Terry Gilliam A|C
3 Eric Idle B|D
4 Terry Jones B|C
5 Michael Palin B|C|D
In [106]:
full_monte['info'].str.get_dummies('|')
Out[106]:
A B C D
0 0 1 1 1
1 0 1 0 1
2 1 0 1 0
3 0 1 0 1
4 0 1 1 0
5 0 1 1 1

Recipe Recommendations

In [135]:
!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1 29.3M    1  407k    0     0   407k      0  0:01:13 --:--:--  0:01:13  636k
 50 29.3M   50 14.9M    0     0  14.9M      0  0:00:01  0:00:01 --:--:-- 9397k
100 29.3M  100 29.3M    0     0  14.6M      0  0:00:02  0:00:02 --:--:-- 11.7M
In [131]:
!gzip -d recipeitems-latest.json.gz
In [132]:
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)
ValueError: Trailing data
In [133]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape
Out[133]:
(2, 12)
In [137]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r', encoding="utf8") as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
In [138]:
recipes.shape
Out[138]:
(173278, 17)
In [139]:
recipes.iloc[0]
Out[139]:
_id                                {'$oid': '5160756b96cc62079cc2db15'}
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276011104}
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                             NaN
recipeCategory                                                      NaN
dateModified                                                        NaN
recipeInstructions                                                  NaN
Name: 0, dtype: object
In [140]:
recipes.ingredients.str.len().describe()
Out[140]:
count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64
In [144]:
recipes.name[np.argmax(np.array(recipes.ingredients.str.len()))]
Out[144]:
'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'
In [145]:
recipes.description.str.contains('[Bb]reakfast').sum()
Out[145]:
3524
In [146]:
recipes.ingredients.str.contains('[Cc]innamon').sum()
Out[146]:
10526
In [147]:
recipes.ingredients.str.contains('[Cc]inamon').sum()
Out[147]:
11
In [156]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
In [157]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()
Out[157]:
salt pepper oregano sage parsley rosemary tarragon thyme paprika cumin
0 False False False True False False False False False False
1 False False False False False False False False False False
2 True True False False False False False False False True
3 False False False False False False False False False False
4 False False False False False False False False False False
In [160]:
selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)
Out[160]:
2
In [161]:
recipes.name[selection.index]
Out[161]:
1965      Spinach, caramelised garlic, red pepper and po...
171726          Lamb shoulder with harissa and lamb stovies
Name: name, dtype: object

Time Series

query() and eval()

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an eval() or query() expression. You can check the approximate size of your array in bytes using this:

In [ ]:
df.values.nbytes
In [ ]:
 
In [ ]:
 
In [ ]: