Pandas review¶

And parsing thousands of recipes scraped from the internet, ex:¶

In [160]:

selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)

Out[160]:

In [161]:

recipes.name[selection.index]

Out[161]:

1965      Spinach, caramelised garlic, red pepper and po...
171726          Lamb shoulder with harissa and lamb stovies
Name: name, dtype: object

In [1]:

import numpy as np
import pandas as pd

`Series` as generalized NumPy array¶

In [2]:

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

Out[2]:

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:

data.values

Out[3]:

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:

data.index

Out[4]:

RangeIndex(start=0, stop=4, step=1)

In [5]:

data[1]

Out[5]:

0.5

In [6]:

data[1:3]

Out[6]:

1    0.50
2    0.75
dtype: float64

Pandas Series object has an explicitly defined index associated with the values (as opposed to NumPy arrays' implicit integer index).
Need not be an integer:

In [33]:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

Out[33]:

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:

data['a']

Out[8]:

0.25

In [31]:

# non-contiguous or non-sequential:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

Out[31]:

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [10]:

data[7]

Out[10]:

1.0

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [11]:

population_dict = {'CA': 38332521,
                   'TX': 26448193,
                   'NY': 19651127,
                   'FL': 19552860,
                   'IL': 12882135}
population = pd.Series(population_dict)
population

Out[11]:

CA    38332521
TX    26448193
NY    19651127
FL    19552860
IL    12882135
dtype: int64

In [12]:

population['CA']

Out[12]:

38332521

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [13]:

population['CA':'NY']

Out[13]:

CA    38332521
TX    26448193
NY    19651127
dtype: int64

In general..¶

>>> pd.Series(data, index=index)

In [14]:

pd.Series({2:'a', 1:'b', 3:'c'})

Out[14]:

2    a
1    b
3    c
dtype: object

In [15]:

pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

Out[15]:

3    c
2    a
dtype: object

`DataFrame` as generalized NumPy array¶

In [16]:

area_dict = {'CA': 423967, 'TX': 695662, 'NY': 141297,
             'FL': 170312, 'IL': 149995}
area = pd.Series(area_dict)
area

Out[16]:

CA    423967
TX    695662
NY    141297
FL    170312
IL    149995
dtype: int64

In [17]:

states = pd.DataFrame({'pop': population,
                       'area': area})
states

Out[17]:

	pop	area
CA	38332521	423967
TX	26448193	695662
NY	19651127	141297
FL	19552860	170312
IL	12882135	149995

In [18]:

states.index

Out[18]:

Index(['CA', 'TX', 'NY', 'FL', 'IL'], dtype='object')

In [19]:

states.columns

Out[19]:

Index(['pop', 'area'], dtype='object')

In [20]:

states['area']  # returns a Series

Out[20]:

CA    423967
TX    695662
NY    141297
FL    170312
IL    149995
Name: area, dtype: int64

The Pandas Index Object¶

Immutable Array, Ordered Set

In [21]:

ind = pd.Index([2, 3, 5, 7, 11])

In [22]:

ind[0]

Out[22]:

In [23]:

ind[::2]

Out[23]:

Int64Index([2, 5, 11], dtype='int64')

In [24]:

print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64

In [25]:

# Immutable
ind[1] = 0

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-83a953ce5d2f> in <module>()
      1 # Immutable
----> 2 ind[1] = 0

~\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   4258 
   4259     def __setitem__(self, key, value):
-> 4260         raise TypeError("Index does not support mutable operations")
   4261 
   4262     def __getitem__(self, key):

TypeError: Index does not support mutable operations

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic:

In [26]:

indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [27]:

indA & indB # intersection
indA.intersection(indB)

Out[27]:

Int64Index([3, 5, 7], dtype='int64')

In [28]:

indA | indB # union

Out[28]:

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [29]:

indA ^ indB # xor

Out[29]:

Int64Index([1, 2, 9, 11], dtype='int64')

Selecting data¶

In [34]:

data['b']

Out[34]:

0.5

In [35]:

'b' in data

Out[35]:

True

In [36]:

data.keys()

Out[36]:

Index(['a', 'b', 'c', 'd'], dtype='object')

In [37]:

list(data.items())

Out[37]:

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [38]:

data['e'] = 1.25
data

Out[38]:

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [39]:

#masking
data[(data > 0.3) & (data < 0.8)]

Out[39]:

b    0.50
c    0.75
dtype: float64

Indexers `loc`, `iloc`¶

In [40]:

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

Out[40]:

1    a
3    b
5    c
dtype: object

In [41]:

# explicit
data[1]

Out[41]:

'a'

In [42]:

# implicit
data[1:4]

Out[42]:

3    b
5    c
dtype: object

.loc always references the explicit index:

In [43]:

data.loc[3]

Out[43]:

'b'

.iloc always references the implicit Python-style index:

In [44]:

data.iloc[2]

Out[44]:

'c'

DataFrame as a dict of Series:

In [45]:

area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Out[45]:

	area	pop
California	423967	38332521
Texas	695662	26448193
New York	141297	19651127
Florida	170312	19552860
Illinois	149995	12882135

In [46]:

#access the individual series:
data['area']

Out[46]:

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [47]:

#Equivalently, use attribute-style access with column names that are strings:
data.area

Out[47]:

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [48]:

# they are exactly the same:
data.area is data['area']

Out[48]:

True

In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z rather than data.pop = z).

Like with the Series objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:

In [49]:

data['density'] = data['pop'] / data['area']
data

Out[49]:

	area	pop	density
California	423967	38332521	90.413926
Texas	695662	26448193	38.018740
New York	141297	19651127	139.076746
Florida	170312	19552860	114.806121
Illinois	149995	12882135	85.883763

In [50]:

# as enhanced 2D array:
data.values

Out[50]:

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [51]:

data.T

Out[51]:

	California	Texas	New York	Florida	Illinois
area	4.239670e+05	6.956620e+05	1.412970e+05	1.703120e+05	1.499950e+05
pop	3.833252e+07	2.644819e+07	1.965113e+07	1.955286e+07	1.288214e+07
density	9.041393e+01	3.801874e+01	1.390767e+02	1.148061e+02	8.588376e+01

In [52]:

data.values[0]

Out[52]:

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [53]:

data.iloc[:3, 1:3]

Out[53]:

	pop	density
California	38332521	90.413926
Texas	26448193	38.018740
New York	19651127	139.076746

In [54]:

data.loc[:'New York','pop':'density']

Out[54]:

	pop	density
California	38332521	90.413926
Texas	26448193	38.018740
New York	19651127	139.076746

In [55]:

data.loc[data.density > 100, ['pop', 'density']]

Out[55]:

	pop	density
New York	19651127	139.076746
Florida	19552860	114.806121

In [56]:

# modify values:
data.iloc[0, 2] = 90
data

Out[56]:

	area	pop	density
California	423967	38332521	90.000000
Texas	695662	26448193	38.018740
New York	141297	19651127	139.076746
Florida	170312	19552860	114.806121
Illinois	149995	12882135	85.883763

In [57]:

data[data.density > 100]

Out[57]:

	area	pop	density
New York	141297	19651127	139.076746
Florida	170312	19552860	114.806121

In [58]:

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

Out[58]:

0    6
1    3
2    7
3    4
dtype: int32

In [59]:

df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Out[59]:

	A	B	C	D
0	6	9	2	6
1	7	4	3	7
2	7	2	5	4

In [60]:

np.exp(ser)

Out[60]:

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [61]:

np.sin(df * np.pi/4)

Out[61]:

	A	B	C	D
0	-1.000000	7.071068e-01	1.000000	-1.000000e+00
1	-0.707107	1.224647e-16	0.707107	-7.071068e-01
2	-0.707107	1.000000e+00	-0.707107	1.224647e-16

In [64]:

# Pandas aligns indices, in case data is incomplete:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
population / area

Out[64]:

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [66]:

area.index | population.index  # union of indices of two input arrays

Out[66]:

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [69]:

A = pd.Series([2, 4, 6], index=[0, 1, 2])
print(A)
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(B)
print(A + B)

0    2
1    4
2    6
dtype: int64
1    1
2    3
3    5
dtype: int64
0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:

In [70]:

A.add(B, fill_value=0)

Out[70]:

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

In [74]:

# similarly for dfs:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('12'))
A

Out[74]:

	1	2
0	8	1
1	19	14

In [79]:

B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('213'))
B

Out[79]:

	2	1	3
0	3	7	6
1	8	7	4
2	1	4	7

In [80]:

A + B

Out[80]:

	1	2	3
0	15.0	4.0	NaN
1	26.0	22.0	NaN
2	NaN	NaN	NaN

In [82]:

A.stack()

Out[82]:

0  1     8
   2     1
1  1    19
   2    14
dtype: int32

In [83]:

A.stack().mean()

Out[83]:

10.5

In [84]:

A.add(B, fill_value=A.stack().mean())

Out[84]:

	1	2	3
0	15.0	4.0	16.5
1	26.0	22.0	14.5
2	14.5	11.5	17.5

Python Operator	Pandas Method(s)
`+`	`add()`
`-`	`sub()`, `subtract()`
`*`	`mul()`, `multiply()`
`/`	`truediv()`, `div()`, `divide()`
`//`	`floordiv()`
`%`	`mod()`
`**`	`pow()`

In [85]:

A = rng.randint(10, size=(3, 4))
A

Out[85]:

array([[9, 8, 8, 0],
       [8, 6, 8, 7],
       [0, 7, 7, 2]])

In [86]:

A[0]

Out[86]:

array([9, 8, 8, 0])

In [87]:

A-A[0]

Out[87]:

array([[ 0,  0,  0,  0],
       [-1, -2,  0,  7],
       [-9, -1, -1,  2]])

In [90]:

#pandas similarly operates row-wise:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Out[90]:

	Q	R	S	T
0	0	0	0	0
1	-1	-2	0	7
2	-9	-1	-1	2

In [91]:

#column-wise goes like this:
df.subtract(df['R'], axis=0)

Out[91]:

	Q	S	T
0	1	0	-8
1	2	2	1
2	-7	0	-5

In [92]:

df

Out[92]:

	Q	R	S	T
0	9	8	8	0
1	8	6	8	7
2	0	7	7	2

In [93]:

df.iloc[0, ::2]

Out[93]:

Q    9
S    8
Name: 0, dtype: int32

In [94]:

df - df.iloc[0, ::2]

Out[94]:

	Q	R	S	T
0	0.0	NaN	0.0	NaN
1	-1.0	NaN	0.0	NaN
2	-9.0	NaN	-1.0	NaN

Missing Data¶

In [98]:

# sentinel values
vals1 = np.array([1, None, 3, 4])
vals1  # <-- common type = object

Out[98]:

array([1, None, 3, 4], dtype=object)

In [100]:

vals1.sum()  #addition between int and None is undefined

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-100-7de6b8507c78> in <module>()
----> 1 vals1.sum()  #addition between int and None is undefined

~\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where)
     36 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     37          initial=_NoValue, where=True):
---> 38     return umr_sum(a, axis, dtype, out, keepdims, initial, where)
     39 
     40 def _prod(a, axis=None, dtype=None, out=None, keepdims=False,

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

In [101]:

# NaN is different:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

Out[101]:

dtype('float64')

In [104]:

1 + np.nan  #data virus, infects any object it touches:

Out[104]:

nan

In [105]:

0 * np.nan

Out[105]:

nan

In [106]:

vals2.sum(), vals2.min(), vals2.max()

Out[106]:

(nan, nan, nan)

In [108]:

np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

Out[108]:

(8.0, 1.0, 4.0)

In [109]:

pd.Series([1, np.nan, 2, None])

Out[109]:

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [110]:

# detection
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()

Out[110]:

0    False
1     True
2    False
3     True
dtype: bool

In [111]:

data[data.notnull()]

Out[111]:

0        1
2    hello
dtype: object

In [112]:

data.dropna()

Out[112]:

0        1
2    hello
dtype: object

In [113]:

df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Out[113]:

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

By default, dropna() will drop all rows in which any null value is present:

In [115]:

df.dropna()

Out[115]:

	0	1	2
1	2.0	3.0	5

In [116]:

df.dropna(axis='columns')

Out[116]:

	2
0	2
1	5
2	6

But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through.

The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:

In [117]:

df[3] = np.nan
df

Out[117]:

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

In [118]:

df.dropna(axis='columns', how='all')

Out[118]:

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [119]:

df.dropna(axis='rows', thresh=3)

Out[119]:

	0	1	2	3
1	2.0	3.0	5	NaN

In [120]:

data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

Out[120]:

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [121]:

data.fillna(0)

Out[121]:

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

In [122]:

#forward-fill
data.fillna(method='ffill')

Out[122]:

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

In [123]:

#back-fill
data.fillna(method='bfill')

Out[123]:

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

In [124]:

df

Out[124]:

	0	1	2	3
0	1.0	NaN	2	NaN
1	2.0	3.0	5	NaN
2	NaN	4.0	6	NaN

In [126]:

df.fillna(method='ffill', axis=1)

Out[126]:

	0	1	2	3
0	1.0	1.0	2.0	2.0
1	2.0	3.0	5.0	5.0
2	NaN	4.0	6.0	6.0

Hierarchical Indexing (MultiIndex)¶

like always, but keep in mind can pass level= to many methods. ex.:

In [3]:

# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Out[3]:

	subject	Bob		Guido		Sue
	type	HR	Temp	HR	Temp	HR	Temp
year	visit
2013	1	47.0	36.2	30.0	37.9	51.0	36.6
2013	2	20.0	37.8	50.0	37.0	49.0	36.5
2014	1	40.0	36.6	19.0	37.5	37.0	37.5
2014	2	60.0	37.4	26.0	37.4	37.0	37.2

In [4]:

health_data.index.levels

Out[4]:

FrozenList([[2013, 2014], [1, 2]])

In [10]:

health_data['Bob']

Out[10]:

	type	HR	Temp
year	visit
2013	1	47.0	36.2
2013	2	20.0	37.8
2014	1	40.0	36.6
2014	2	60.0	37.4

In [11]:

health_data['Bob', 'HR']

Out[11]:

year  visit
2013  1        47.0
      2        20.0
2014  1        40.0
      2        60.0
Name: (Bob, HR), dtype: float64

In [12]:

health_data.loc[:, ('Bob', 'HR')]

Out[12]:

year  visit
2013  1        47.0
      2        20.0
2014  1        40.0
      2        60.0
Name: (Bob, HR), dtype: float64

In [13]:

health_data.sort_index()

Out[13]:

	subject	Bob		Guido		Sue
	type	HR	Temp	HR	Temp	HR	Temp
year	visit
2013	1	47.0	36.2	30.0	37.9	51.0	36.6
2013	2	20.0	37.8	50.0	37.0	49.0	36.5
2014	1	40.0	36.6	19.0	37.5	37.0	37.5
2014	2	60.0	37.4	26.0	37.4	37.0	37.2

In [14]:

health_data.unstack(level=0)

Out[14]:

subject	Bob				Guido				Sue
type	HR		Temp		HR		Temp		HR		Temp
year	2013	2014	2013	2014	2013	2014	2013	2014	2013	2014	2013	2014
visit
1	47.0	40.0	36.2	36.6	30.0	19.0	37.9	37.5	51.0	37.0	36.6	37.5
2	20.0	60.0	37.8	37.4	50.0	26.0	37.0	37.4	49.0	37.0	36.5	37.2

In [15]:

health_data.unstack(level=1)

Out[15]:

subject	Bob				Guido				Sue
type	HR		Temp		HR		Temp		HR		Temp
visit	1	2	1	2	1	2	1	2	1	2	1	2
year
2013	47.0	20.0	36.2	37.8	30.0	50.0	37.9	37.0	51.0	49.0	36.6	36.5
2014	40.0	60.0	36.6	37.4	19.0	26.0	37.5	37.4	37.0	37.0	37.5	37.2

In [16]:

health_data.stack(level=0)

Out[16]:

		type	HR	Temp
year	visit	subject
2013	1	Bob	47.0	36.2
		Guido	30.0	37.9
		Sue	51.0	36.6
	2	Bob	20.0	37.8
		Guido	50.0	37.0
		Sue	49.0	36.5
2014	1	Bob	40.0	36.6
		Guido	19.0	37.5
		Sue	37.0	37.5
	2	Bob	60.0	37.4
		Guido	26.0	37.4
		Sue	37.0	37.2

In [17]:

health_data.stack(level=1)

Out[17]:

		subject	Bob	Guido	Sue
year	visit	type
2013	1	HR	47.0	30.0	51.0
	1	Temp	36.2	37.9	36.6
	2	HR	20.0	50.0	49.0
	2	Temp	37.8	37.0	36.5
2014	1	HR	40.0	19.0	37.0
	1	Temp	36.6	37.5	37.5
	2	HR	60.0	26.0	37.0
	2	Temp	37.4	37.4	37.2

In [24]:

health_data.reset_index()  #turn index into columns

Out[24]:

subject	year	visit	Bob		Guido		Sue
type			HR	Temp	HR	Temp	HR	Temp
0	2013	1	47.0	36.2	30.0	37.9	51.0	36.6
1	2013	2	20.0	37.8	50.0	37.0	49.0	36.5
2	2014	1	40.0	36.6	19.0	37.5	37.0	37.5
3	2014	2	60.0	37.4	26.0	37.4	37.0	37.2

In [27]:

data_mean = health_data.mean(level='year')
data_mean

Out[27]:

subject	Bob		Guido		Sue
type	HR	Temp	HR	Temp	HR	Temp
year
2013	33.5	37.0	40.0	37.45	50.0	36.55
2014	50.0	37.0	22.5	37.45	37.0	37.35

In [31]:

data_mean.mean(axis=1, level='type')

Out[31]:

type	HR	Temp
year
2013	41.166667	37.000000
2014	36.500000	37.266667

Merge and Join with `pd.merge()`¶

good ref: https://colab.research.google.com/github/devinrourke/PythonDataScienceHandbook/blob/8a34a4f653bdbdc01415a94dc20d4e9b97438965/notebooks/03.07-Merge-and-Join.ipynb#scrollTo=2n0aVXf9iuv9

In [165]:

# For visualization purposes, to see two dfs side-by-side
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

1-to-1¶

Like column-wise concatenation. Finds the common column between the two dfs ("employee"), and combines. Order does not matter, and the index is discarded:

In [168]:

df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')

Out[168]:

df1

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue	HR

df2

	employee	hire_date
0	Lisa	2004
1	Bob	2008
2	Jake	2012
3	Sue	2014

In [170]:

df3 = pd.merge(df1, df2)
df3

Out[170]:

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue	HR	2014

N-to-1¶

When one of the two key columns (in this case, "group") contains duplicate entries (in this case, "Engineering"). So, "supervisor" column has repeated information, where necessary.

In [171]:

df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

Out[171]:

df3

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue	HR	2014

df4

	group	supervisor
0	Accounting	Carly
1	Engineering	Guido
2	HR	Steve

pd.merge(df3, df4)

	employee	group	hire_date	supervisor
0	Bob	Accounting	2008	Carly
1	Jake	Engineering	2012	Guido
2	Lisa	Engineering	2004	Guido
3	Sue	HR	2014	Steve

N-to-N¶

When key column in both L and R dfs contains duplicates:

In [172]:

df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

Out[172]:

df1

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue	HR

df5

	group	skills
0	Accounting	math
1	Accounting	spreadsheets
2	Engineering	coding
3	Engineering	linux
4	HR	spreadsheets
5	HR	organization

pd.merge(df1, df5)

	employee	group	skills
0	Bob	Accounting	math
1	Bob	Accounting	spreadsheets
2	Jake	Engineering	coding
3	Jake	Engineering	linux
4	Lisa	Engineering	coding
5	Lisa	Engineering	linux
6	Sue	HR	spreadsheets
7	Sue	HR	organization

Options for `pd.merge()`¶

Specify merge key with `on`¶

In [175]:

pd.merge(df1, df2, on='employee')

Out[175]:

	employee	group	hire_date
0	Bob	Accounting	2008
1	Jake	Engineering	2012
2	Lisa	Engineering	2004
3	Sue	HR	2014

`left_on` and `right_on`¶

we may have a dataset in which the employee name is labeled as "name" rather than "employee". In this case, we can use the left_on and right_on keywords to specify the two column names:

In [176]:

df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3')

Out[176]:

df1

	employee	group
0	Bob	Accounting
1	Jake	Engineering
2	Lisa	Engineering
3	Sue	HR

df3

	name	salary
0	Bob	70000
1	Jake	80000
2	Lisa	120000
3	Sue	90000

In [178]:

pd.merge(df1, df3, left_on='employee', right_on='name').drop('name', axis=1)

Out[178]:

	employee	group	salary
0	Bob	Accounting	70000
1	Jake	Engineering	80000
2	Lisa	Engineering	120000
3	Sue	HR	90000

`left_index` and `right_index`¶

In [185]:

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')

Out[185]:

df1a

	group
employee
Bob	Accounting
Jake	Engineering
Lisa	Engineering
Sue	HR

df2a

	hire_date
employee
Lisa	2004
Bob	2008
Jake	2012
Sue	2014

In [183]:

pd.merge(df1a, df2a, left_index=True, right_index=True)

Out[183]:

	group	hire_date
employee
Bob	Accounting	2008
Jake	Engineering	2012
Lisa	Engineering	2004
Sue	HR	2014

For convenience, merging on indices is `join()`¶

In [184]:

df1a.join(df2a)

Out[184]:

	group	hire_date
employee
Bob	Accounting	2008
Jake	Engineering	2012
Lisa	Engineering	2004
Sue	HR	2014

In [182]:

pd.merge(df1a, df3, left_index=True, right_on='name')

Out[182]:

	group	name	salary
0	Accounting	Bob	70000
1	Engineering	Jake	80000
2	Engineering	Lisa	120000
3	HR	Sue	90000

Set Arithmetic (`inner()`, `outer()`, etc)¶

When a value appears in one key column but not the other:

In [186]:

df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')

Out[186]:

df6

	name	food
0	Peter	fish
1	Paul	beans
2	Mary	bread

df7

	name	drink
0	Mary	wine
1	Joseph	beer

pd.merge(df6, df7)

	name	food	drink
0	Mary	bread	wine

By default, `pd.merge()` uses `inner()`, which finds the intersection:¶

In [187]:

pd.merge(df6, df7, how='inner')

Out[187]:

	name	food	drink
0	Mary	bread	wine

In [191]:

# outer is union, filled with NAs:
pd.merge(df6, df7, how='outer')

Out[191]:

	name	food	drink
0	Peter	fish	NaN
1	Paul	beans	NaN
2	Mary	bread	wine
3	Joseph	NaN	beer

In [193]:

# left keeps all left entries, ignoring Joseph
pd.merge(df6, df7, how='left')

Out[193]:

	name	food	drink
0	Peter	fish	NaN
1	Paul	beans	NaN
2	Mary	bread	wine

In [194]:

# Right keeps all right entries, ignoring Peter and Paul:
pd.merge(df6, df7, how='right')

Out[194]:

	name	food	drink
0	Mary	bread	wine
1	Joseph	NaN	beer

`suffixes`¶

For conflicting column names you wish to keep separate

In [195]:

df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

Out[195]:

df8

	name	rank
0	Bob	1
1	Jake	2
2	Lisa	3
3	Sue	4

df9

	name	rank
0	Bob	3
1	Jake	1
2	Lisa	4
3	Sue	2

pd.merge(df8, df9, on="name")

	name	rank_x	rank_y
0	Bob	1	3
1	Jake	2	1
2	Lisa	3	4
3	Sue	4	2

In [198]:

pd.merge(df8, df9, on='name', suffixes=('_L', '_R'))

Out[198]:

	name	rank_L	rank_R
0	Bob	1	3
1	Jake	2	1
2	Lisa	3	4
3	Sue	4	2

Example: US States Data¶

Q: rank US states and territories by their 2010 population density

In [199]:

!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 57935  100 57935    0     0  57935      0  0:00:01 --:--:--  0:00:01  190k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   835  100   835    0     0    835      0  0:00:01 --:--:--  0:00:01  3568
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   872  100   872    0     0    872      0  0:00:01 --:--:--  0:00:01  4000

In [2]:

pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')

In [3]:

pop.head()

Out[3]:

	state/region	ages	year	population
0	AL	under18	2012	1117489.0
1	AL	total	2012	4817528.0
2	AL	under18	2010	1130966.0
3	AL	total	2010	4785570.0
4	AL	under18	2011	1125763.0

In [4]:

areas.head()

Out[4]:

	state	area (sq. mi)
0	Alabama	52423
1	Alaska	656425
2	Arizona	114006
3	Arkansas	53182
4	California	163707

In [5]:

abbrevs.head()

Out[5]:

	state	abbreviation
0	Alabama	AL
1	Alaska	AK
2	Arizona	AZ
3	Arkansas	AR
4	California	CA

In [6]:

merged = pd.merge(areas, abbrevs, on='state')
merged.head()

Out[6]:

	state	area (sq. mi)	abbreviation
0	Alabama	52423	AL
1	Alaska	656425	AK
2	Arizona	114006	AZ
3	Arkansas	53182	AR
4	California	163707	CA

In [7]:

final = pd.merge(merged, pop, left_on='abbreviation', right_on='state/region').drop('state/region', axis=1)
final.head()

Out[7]:

	state	area (sq. mi)	abbreviation	ages	year	population
0	Alabama	52423	AL	under18	2012	1117489.0
1	Alabama	52423	AL	total	2012	4817528.0
2	Alabama	52423	AL	under18	2010	1130966.0
3	Alabama	52423	AL	total	2010	4785570.0
4	Alabama	52423	AL	under18	2011	1125763.0

In [8]:

final['density'] = final['population'] / final['area (sq. mi)']
final.head()

Out[8]:

	state	area (sq. mi)	abbreviation	ages	year	population	density
0	Alabama	52423	AL	under18	2012	1117489.0	21.316769
1	Alabama	52423	AL	total	2012	4817528.0	91.897221
2	Alabama	52423	AL	under18	2010	1130966.0	21.573851
3	Alabama	52423	AL	total	2010	4785570.0	91.287603
4	Alabama	52423	AL	under18	2011	1125763.0	21.474601

In [9]:

data2010 = final[(final['year'] == 2010) & (final['ages'] == 'total')]
data2010

Out[9]:

	state	area (sq. mi)	abbreviation	ages	year	population	density
3	Alabama	52423	AL	total	2010	4785570.0	91.287603
91	Alaska	656425	AK	total	2010	713868.0	1.087509
101	Arizona	114006	AZ	total	2010	6408790.0	56.214497
189	Arkansas	53182	AR	total	2010	2922280.0	54.948667
197	California	163707	CA	total	2010	37333601.0	228.051342
283	Colorado	104100	CO	total	2010	5048196.0	48.493718
293	Connecticut	5544	CT	total	2010	3579210.0	645.600649
379	Delaware	1954	DE	total	2010	899711.0	460.445752
427	Florida	65758	FL	total	2010	18846054.0	286.597129
437	Georgia	59441	GA	total	2010	9713248.0	163.409902
522	Hawaii	10932	HI	total	2010	1363731.0	124.746707
533	Idaho	83574	ID	total	2010	1570718.0	18.794338
618	Illinois	57918	IL	total	2010	12839695.0	221.687472
629	Indiana	36420	IN	total	2010	6489965.0	178.197831
714	Iowa	56276	IA	total	2010	3050314.0	54.202751
725	Kansas	82282	KS	total	2010	2858910.0	34.745266
810	Kentucky	40411	KY	total	2010	4347698.0	107.586994
821	Louisiana	51843	LA	total	2010	4545392.0	87.676099
906	Maine	35387	ME	total	2010	1327366.0	37.509990
917	Maryland	12407	MD	total	2010	5787193.0	466.445797
1002	Massachusetts	10555	MA	total	2010	6563263.0	621.815538
1013	Michigan	96810	MI	total	2010	9876149.0	102.015794
1098	Minnesota	86943	MN	total	2010	5310337.0	61.078373
1109	Mississippi	48434	MS	total	2010	2970047.0	61.321530
1194	Missouri	69709	MO	total	2010	5996063.0	86.015622
1205	Montana	147046	MT	total	2010	990527.0	6.736171
1290	Nebraska	77358	NE	total	2010	1829838.0	23.654153
1301	Nevada	110567	NV	total	2010	2703230.0	24.448796
1386	New Hampshire	9351	NH	total	2010	1316614.0	140.799273
1397	New Jersey	8722	NJ	total	2010	8802707.0	1009.253268
1482	New Mexico	121593	NM	total	2010	2064982.0	16.982737
1493	New York	54475	NY	total	2010	19398228.0	356.094135
1578	North Carolina	53821	NC	total	2010	9559533.0	177.617157
1589	North Dakota	70704	ND	total	2010	674344.0	9.537565
1674	Ohio	44828	OH	total	2010	11545435.0	257.549634
1685	Oklahoma	69903	OK	total	2010	3759263.0	53.778278
1770	Oregon	98386	OR	total	2010	3837208.0	39.001565
1781	Pennsylvania	46058	PA	total	2010	12710472.0	275.966651
1866	Rhode Island	1545	RI	total	2010	1052669.0	681.339159
1877	South Carolina	32007	SC	total	2010	4636361.0	144.854594
1962	South Dakota	77121	SD	total	2010	816211.0	10.583512
1973	Tennessee	42146	TN	total	2010	6356683.0	150.825298
2058	Texas	268601	TX	total	2010	25245178.0	93.987655
2069	Utah	84904	UT	total	2010	2774424.0	32.677188
2154	Vermont	9615	VT	total	2010	625793.0	65.085075
2165	Virginia	42769	VA	total	2010	8024417.0	187.622273
2250	Washington	71303	WA	total	2010	6742256.0	94.557817
2261	West Virginia	24231	WV	total	2010	1854146.0	76.519582
2346	Wisconsin	65503	WI	total	2010	5689060.0	86.851900
2357	Wyoming	97818	WY	total	2010	564222.0	5.768079
2405	District of Columbia	68	DC	total	2010	605125.0	8898.897059

In [10]:

data2010.set_index('abbreviation' , inplace=True)
data2010.head()

Out[10]:

	state	area (sq. mi)	ages	year	population	density
abbreviation
AL	Alabama	52423	total	2010	4785570.0	91.287603
AK	Alaska	656425	total	2010	713868.0	1.087509
AZ	Arizona	114006	total	2010	6408790.0	56.214497
AR	Arkansas	53182	total	2010	2922280.0	54.948667
CA	California	163707	total	2010	37333601.0	228.051342

In [11]:

data2010.density.sort_values(ascending=False)

Out[11]:

abbreviation
DC    8898.897059
NJ    1009.253268
RI     681.339159
CT     645.600649
MA     621.815538
MD     466.445797
DE     460.445752
NY     356.094135
FL     286.597129
PA     275.966651
OH     257.549634
CA     228.051342
IL     221.687472
VA     187.622273
IN     178.197831
NC     177.617157
GA     163.409902
TN     150.825298
SC     144.854594
NH     140.799273
HI     124.746707
KY     107.586994
MI     102.015794
WA      94.557817
TX      93.987655
AL      91.287603
LA      87.676099
WI      86.851900
MO      86.015622
WV      76.519582
VT      65.085075
MS      61.321530
MN      61.078373
AZ      56.214497
AR      54.948667
IA      54.202751
OK      53.778278
CO      48.493718
OR      39.001565
ME      37.509990
KS      34.745266
UT      32.677188
NV      24.448796
NE      23.654153
ID      18.794338
NM      16.982737
SD      10.583512
ND       9.537565
MT       6.736171
WY       5.768079
AK       1.087509
Name: density, dtype: float64

In [12]:

merged.isnull().any()

Out[12]:

state            False
area (sq. mi)    False
abbreviation     False
dtype: bool

A different approach, with more munging:¶

In [ ]:

GroupBy¶

In [1]:

import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

Out[1]:

(1035, 6)

In [2]:

planets.head()

Out[2]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

In [3]:

planets.dropna().describe()

Out[3]:

	number	orbital_period	mass	distance	year
count	498.00000	498.000000	498.000000	498.000000	498.000000
mean	1.73494	835.778671	2.509320	52.068213	2007.377510
std	1.17572	1469.128259	3.636274	46.596041	4.167284
min	1.00000	1.328300	0.003600	1.350000	1989.000000
25%	1.00000	38.272250	0.212500	24.497500	2005.000000
50%	1.00000	357.000000	1.245000	39.940000	2009.000000
75%	2.00000	999.600000	2.867500	59.332500	2011.000000
max	6.00000	17337.500000	25.000000	354.000000	2014.000000

The following table summarizes some other built-in Pandas aggregations:

Aggregation	Description
`count()`	Total number of items
`first()`, `last()`	First and last item
`mean()`, `median()`	Mean and median
`min()`, `max()`	Minimum and maximum
`std()`, `var()`	Standard deviation and variance
`mad()`	Mean absolute deviation
`prod()`	Product of all items
`sum()`	Sum of all items

These are all methods of DataFrame and Series objects.

In [6]:

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

Out[6]:

	key	data
0	A	0
1	B	1
2	C	2
3	A	3
4	B	4
5	C	5

In [9]:

df.groupby('key')  # think of this as a special view of the df.
                   # ready to perform some aggregation

Out[9]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000027F216AA978>

In [10]:

df.groupby('key').sum()

Out[10]:

	data
key
A	3
B	5
C	7

In [11]:

planets.head()

Out[11]:

	method	number	orbital_period	mass	distance	year
0	Radial Velocity	1	269.300	7.10	77.40	2006
1	Radial Velocity	1	874.774	2.21	56.95	2008
2	Radial Velocity	1	763.000	2.60	19.84	2011
3	Radial Velocity	1	326.030	19.40	110.62	2007
4	Radial Velocity	1	516.220	10.50	119.47	2009

In [15]:

planets.groupby('method')['orbital_period'].median()

Out[15]:

method
Astrometry                         631.180000
Eclipse Timing Variations         4343.500000
Imaging                          27500.000000
Microlensing                      3300.000000
Orbital Brightness Modulation        0.342887
Pulsar Timing                       66.541900
Pulsation Timing Variations       1170.000000
Radial Velocity                    360.200000
Transit                              5.714932
Transit Timing Variations           57.011000
Name: orbital_period, dtype: float64

In [16]:

for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

Astrometry                     shape=(2, 6)
Eclipse Timing Variations      shape=(9, 6)
Imaging                        shape=(38, 6)
Microlensing                   shape=(23, 6)
Orbital Brightness Modulation  shape=(3, 6)
Pulsar Timing                  shape=(5, 6)
Pulsation Timing Variations    shape=(1, 6)
Radial Velocity                shape=(553, 6)
Transit                        shape=(397, 6)
Transit Timing Variations      shape=(4, 6)

In [18]:

planets.groupby('method')['year'].describe()

Out[18]:

	count	mean	std	min	25%	50%	75%	max
method
Astrometry	2.0	2011.500000	2.121320	2010.0	2010.75	2011.5	2012.25	2013.0
Eclipse Timing Variations	9.0	2010.000000	1.414214	2008.0	2009.00	2010.0	2011.00	2012.0
Imaging	38.0	2009.131579	2.781901	2004.0	2008.00	2009.0	2011.00	2013.0
Microlensing	23.0	2009.782609	2.859697	2004.0	2008.00	2010.0	2012.00	2013.0
Orbital Brightness Modulation	3.0	2011.666667	1.154701	2011.0	2011.00	2011.0	2012.00	2013.0
Pulsar Timing	5.0	1998.400000	8.384510	1992.0	1992.00	1994.0	2003.00	2011.0
Pulsation Timing Variations	1.0	2007.000000	NaN	2007.0	2007.00	2007.0	2007.00	2007.0
Radial Velocity	553.0	2007.518987	4.249052	1989.0	2005.00	2009.0	2011.00	2014.0
Transit	397.0	2011.236776	2.077867	2002.0	2010.00	2012.0	2013.00	2014.0
Transit Timing Variations	4.0	2012.500000	1.290994	2011.0	2011.75	2012.5	2013.25	2014.0

Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.

This is just one example of the utility of dispatch methods. Notice that they are applied to each individual group, and the results are then combined within GroupBy and returned. Again, any valid DataFrame/Series method can be used on the corresponding GroupBy object, which allows for some very flexible and powerful operations!

`GroupBy` objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods¶

efficiently implement a variety of useful operations before combining the grouped data.

In [19]:

rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

Out[19]:

	key	data1	data2
0	A	0	5
1	B	1	0
2	C	2	3
3	A	3	3
4	B	4	7
5	C	5	9

`.aggregate()`¶

can take a string, a function, or a list therof:

In [22]:

df.groupby('key').aggregate(['min', np.median, max])

Out[22]:

	data1			data2
	min	median	max	min	median	max
key
A	0	1.5	3	3	4.0	5
B	1	2.5	4	0	3.5	7
C	2	3.5	5	3	6.0	9

Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

In [35]:

df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Out[35]:

	data1	data2
key
A	0	5
B	1	7
C	2	9

`.filter()`¶

A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [36]:

def filter_func(x):
    return x['data2'].std() > 4

In [37]:

df

Out[37]:

	key	data1	data2
0	A	0	5
1	B	1	0
2	C	2	3
3	A	3	3
4	B	4	7
5	C	5	9

In [38]:

df.groupby('key').std()

Out[38]:

	data1	data2
key
A	2.12132	1.414214
B	2.12132	4.949747
C	2.12132	4.242641

In [44]:

df.groupby('key').filter(filter_func)

Out[44]:

	key	data1	data2
1	B	1	0
2	C	2	3
4	B	4	7
5	C	5	9

`.apply()`¶

The apply() method lets you apply an arbitrary function to the group results. The function should take a DataFrame, and return either a Pandas object (e.g., DataFrame, Series) or a scalar; the combine operation will be tailored to the type of output returned.

For example, here is an apply() that normalizes the first column by the sum of the second:

In [46]:

def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

In [50]:

df

Out[50]:

	key	data1	data2
0	A	0	5
1	B	1	0
2	C	2	3
3	A	3	3
4	B	4	7
5	C	5	9

In [48]:

df.groupby('key').apply(norm_by_data2)

Out[48]:

	key	data1	data2
0	A	0.000000	5
1	B	0.142857	0
2	C	0.166667	3
3	A	0.375000	3
4	B	0.571429	7
5	C	0.416667	9

splitting by other methods¶

Another method is to provide a dictionary that maps index values to the group keys:

In [55]:

df

Out[55]:

	key	data1	data2
0	A	0	5
1	B	1	0
2	C	2	3
3	A	3	3
4	B	4	7
5	C	5	9

In [57]:

df2 = df.set_index('key')
df2

Out[57]:

	data1	data2
key
A	0	5
B	1	0
C	2	3
A	3	3
B	4	7
C	5	9

In [58]:

mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()

Out[58]:

	data1	data2
consonant	12	19
vowel	3	8

Similar to mapping, you can pass any Python function that will input the index value and output the group:

In [59]:

df2.groupby(str.lower).mean()

Out[59]:

	data1	data2
a	1.5	4.0
b	2.5	3.5
c	3.5	6.0

In [60]:

df2.groupby([str.lower, mapping]).mean()

Out[60]:

		data1	data2
a	vowel	1.5	4.0
b	consonant	2.5	3.5
c	consonant	3.5	6.0

In [65]:

# grouping example
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna('-')

Out[65]:

decade	1980s	1990s	2000s	2010s
method
Astrometry	-	-	-	2
Eclipse Timing Variations	-	-	5	10
Imaging	-	-	29	21
Microlensing	-	-	12	15
Orbital Brightness Modulation	-	-	-	5
Pulsar Timing	-	9	1	1
Pulsation Timing Variations	-	-	1	-
Radial Velocity	1	52	475	424
Transit	-	-	64	712
Transit Timing Variations	-	-	-	9

Pivot Tables¶

Like multidimentional GroupBys¶

In [66]:

titanic = sns.load_dataset('titanic')
titanic.head()

Out[66]:

	survived	pclass	sex	age	sibsp	fare	embarked	class	who	adult_male	deck	embark_town	alive	alone
0	0	3	male	22.0	1	7.2500	S	Third	man	True	NaN	Southampton	no	False
1	1	1	female	38.0	1	71.2833	C	First	woman	False	C	Cherbourg	yes	False
2	1	3	female	26.0	0	7.9250	S	Third	woman	False	NaN	Southampton	yes	True
3	1	1	female	35.0	1	53.1000	S	First	woman	False	C	Southampton	yes	False
4	0	3	male	35.0	0	8.0500	S	Third	man	True	NaN	Southampton	no	True

In [67]:

titanic.groupby('sex')['survived'].mean()

Out[67]:

sex
female    0.742038
male      0.188908
Name: survived, dtype: float64

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we

group by class and gender
select survival
apply a mean aggregate
combine the resulting groups
and then unstack the hierarchical index to reveal the hidden multidimensionality.

In code:

In [68]:

titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()

Out[68]:

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

While each step of this pipeline makes sense in light of the tools we've previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional GroupBy is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multi-dimensional aggregation.

In [69]:

titanic.pivot_table('survived', index='sex', columns='class')

Out[69]:

class	First	Second	Third
sex
female	0.968085	0.921053	0.500000
male	0.368852	0.157407	0.135447

multi-level pivot tables¶

In [78]:

age = pd.cut(titanic['age'], [0, 18, 100])
titanic.pivot_table('survived', ['sex', age], 'class')

Out[78]:

	class	First	Second	Third
sex	age
female	(0, 18]	0.909091	1.000000	0.511628
female	(18, 100]	0.972973	0.900000	0.423729
male	(0, 18]	0.800000	0.600000	0.215686
male	(18, 100]	0.375000	0.071429	0.133663

We can apply the same strategy when working with the columns as well; let's add info on the fare paid using pd.qcut to automatically compute quantiles:

In [79]:

fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])

Out[79]:

	fare	(-0.001, 14.454]			(14.454, 512.329]
	class	First	Second	Third	First	Second	Third
sex	age
female	(0, 18]	NaN	1.000000	0.714286	0.909091	1.000000	0.318182
female	(18, 100]	NaN	0.880000	0.444444	0.972973	0.914286	0.391304
male	(0, 18]	NaN	0.000000	0.260870	0.800000	0.818182	0.178571
male	(18, 100]	0.0	0.098039	0.125000	0.391304	0.030303	0.192308

# call signature as of Pandas 0.18
df.pivot_table(data, values=None, index=None, columns=None,
                      aggfunc='mean', fill_value=None,
                      margins=False, dropna=True, margins_name='All')

In [80]:

titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived':sum, 'fare':'mean'})

Out[80]:

	fare			survived
class	First	Second	Third	First	Second	Third
sex
female	106.125798	21.970121	16.118810	91	70	72
male	67.226127	19.741782	12.661633	45	17	47

At times it's useful to compute totals along each grouping. This can be done via the margins keyword:

In [81]:

titanic.pivot_table('survived', index='sex', columns='class', margins=True)

Out[81]:

class	First	Second	Third	All
sex
female	0.968085	0.921053	0.500000	0.742038
male	0.368852	0.157407	0.135447	0.188908
All	0.629630	0.472826	0.242363	0.383838

In [82]:

!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  258k  100  258k    0     0   258k      0  0:00:01 --:--:--  0:00:01  662k

In [85]:

births = pd.read_csv('births.csv')
births.head()

Out[85]:

	year	month	day	gender	births
0	1969	1	1.0	F	4046
1	1969	1	1.0	M	4440
2	1969	1	2.0	F	4454
3	1969	1	2.0	M	4548
4	1969	1	3.0	F	4548

In [86]:

births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')

Out[86]:

gender	F	M
decade
1960	1753634	1846572
1970	16263075	17121550
1980	18310351	19243452
1990	19479454	20420553
2000	18229309	19106428

In [87]:

import matplotlib.pyplot as plt
sns.set()  # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

Though this doesn't necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we'll do this via a robust sigma-clipping operation:

In [88]:

quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])

In [89]:

births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')

In [90]:

births['day'] = births['day'].astype(int)

In [91]:

births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')

births['dayofweek'] = births.index.dayofweek

In [92]:

births.pivot_table('births', index='dayofweek',
                    columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');

In [94]:

births_by_date = births.pivot_table('births', 
                                    [births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
                        for (month, day) in births_by_date.index]
births_by_date.head()

Out[94]:

	births
2012-01-01	4009.225
2012-01-02	4247.400
2012-01-03	4500.900
2012-01-04	4571.350
2012-01-05	4603.625

In [95]:

fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);

Vectorized `.str` operations¶

In [97]:

data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names.str.capitalize()

Out[97]:

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

In [98]:

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])


`len()`	`lower()`	`translate()`	`islower()`
`ljust()`	`upper()`	`startswith()`	`isupper()`
`rjust()`	`find()`	`endswith()`	`isnumeric()`
`center()`	`rfind()`	`isalnum()`	`isdecimal()`
`zfill()`	`index()`	`isalpha()`	`split()`
`strip()`	`rindex()`	`isdigit()`	`rsplit()`
`rstrip()`	`capitalize()`	`isspace()`	`partition()`
`lstrip()`	`swapcase()`	`istitle()`	`rpartition()`

In [99]:

monte.str.lower()

Out[99]:

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [100]:

monte.str.len()

Out[100]:

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [101]:

monte.str.startswith('T')

Out[101]:

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [102]:

monte.str.startswith('t')

Out[102]:

0    False
1    False
2    False
3    False
4    False
5    False
dtype: bool

In [103]:

monte.str.split()

Out[103]:

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Method	Description
`match()`	Call `re.match()` on each element, returning a boolean.
`extract()`	Call `re.match()` on each element, returning matched groups as strings.
`findall()`	Call `re.findall()` on each element
`replace()`	Replace occurrences of pattern with some other string
`contains()`	Call `re.search()` on each element, returning a boolean
`count()`	Count occurrences of pattern
`split()`	Equivalent to `str.split()`, but accepts regexps
`rsplit()`	Equivalent to `str.rsplit()`, but accepts regexps

In [104]:

monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

Out[104]:

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

Get Dummies¶

In [105]:

full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Out[105]:

	name	info
0	Graham Chapman	B\|C\|D
1	John Cleese	B\|D
2	Terry Gilliam	A\|C
3	Eric Idle	B\|D
4	Terry Jones	B\|C
5	Michael Palin	B\|C\|D

In [106]:

full_monte['info'].str.get_dummies('|')

Out[106]:

	A	B	C	D
0	0	1	1	1
1	0	1	0	1
2	1	0	1	0
3	0	1	0	1
4	0	1	1	0
5	0	1	1	1

Recipe Recommendations¶

In [135]:

!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  1 29.3M    1  407k    0     0   407k      0  0:01:13 --:--:--  0:01:13  636k
 50 29.3M   50 14.9M    0     0  14.9M      0  0:00:01  0:00:01 --:--:-- 9397k
100 29.3M  100 29.3M    0     0  14.6M      0  0:00:02  0:00:02 --:--:-- 11.7M

In [131]:

!gzip -d recipeitems-latest.json.gz

In [132]:

try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data

In [133]:

with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

Out[133]:

(2, 12)

In [137]:

# read the entire file into a Python array
with open('recipeitems-latest.json', 'r', encoding="utf8") as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [138]:

recipes.shape

Out[138]:

(173278, 17)

In [139]:

recipes.iloc[0]

Out[139]:

_id                                {'$oid': '5160756b96cc62079cc2db15'}
name                                    Drop Biscuits and Sausage Gravy
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url                   http://thepioneerwoman.com/cooking/2013/03/dro...
image                 http://static.thepioneerwoman.com/cooking/file...
ts                                             {'$date': 1365276011104}
cookTime                                                          PT30M
source                                                  thepioneerwoman
recipeYield                                                          12
datePublished                                                2013-03-11
prepTime                                                          PT10M
description           Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                           NaN
creator                                                             NaN
recipeCategory                                                      NaN
dateModified                                                        NaN
recipeInstructions                                                  NaN
Name: 0, dtype: object

In [140]:

recipes.ingredients.str.len().describe()

Out[140]:

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

In [144]:

recipes.name[np.argmax(np.array(recipes.ingredients.str.len()))]

Out[144]:

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

In [145]:

recipes.description.str.contains('[Bb]reakfast').sum()

Out[145]:

In [146]:

recipes.ingredients.str.contains('[Cc]innamon').sum()

Out[146]:

In [147]:

recipes.ingredients.str.contains('[Cc]inamon').sum()

Out[147]:

In [156]:

spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [157]:

import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Out[157]:

	salt	pepper	oregano	sage	parsley	rosemary	tarragon	thyme	paprika	cumin
0	False	False	False	True	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False
2	True	True	False	False	False	False	False	False	False	True
3	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False

In [160]:

selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)

Out[160]:

In [161]:

recipes.name[selection.index]

Out[161]:

1965      Spinach, caramelised garlic, red pepper and po...
171726          Lamb shoulder with harissa and lamb stovies
Name: name, dtype: object

Time Series¶

good ref: https://colab.research.google.com/github/devinrourke/PythonDataScienceHandbook/blob/8a34a4f653bdbdc01415a94dc20d4e9b97438965/notebooks/03.11-Working-with-Time-Series.ipynb#scrollTo=K56gTg1VFkaA

`query()` and `eval()`¶

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an eval() or query() expression. You can check the approximate size of your array in bytes using this:

In [ ]:

df.values.nbytes

In [ ]:

Pandas review¶

And parsing thousands of recipes scraped from the internet, ex:¶

Series as generalized NumPy array¶

In general..¶

DataFrame as generalized NumPy array¶

The Pandas Index Object¶

Selecting data¶

Indexers loc, iloc¶

Missing Data¶

Hierarchical Indexing (MultiIndex)¶

Merge and Join with pd.merge()¶

1-to-1¶

N-to-1¶

N-to-N¶

Options for pd.merge()¶

Specify merge key with on¶

left_on and right_on¶

left_index and right_index¶

For convenience, merging on indices is join()¶

Set Arithmetic (inner(), outer(), etc)¶

By default, pd.merge() uses inner(), which finds the intersection:¶

suffixes¶

Example: US States Data¶

A different approach, with more munging:¶

GroupBy¶

GroupBy objects have aggregate(), filter(), transform(), and apply() methods¶

.aggregate()¶

.filter()¶

.apply()¶

splitting by other methods¶

Pivot Tables¶

Like multidimentional GroupBys¶

multi-level pivot tables¶

Vectorized .str operations¶

Get Dummies¶

Recipe Recommendations¶

Time Series¶

query() and eval()¶

`Series` as generalized NumPy array¶

`DataFrame` as generalized NumPy array¶

Indexers `loc`, `iloc`¶

Merge and Join with `pd.merge()`¶

Options for `pd.merge()`¶

Specify merge key with `on`¶

`left_on` and `right_on`¶

`left_index` and `right_index`¶

For convenience, merging on indices is `join()`¶

Set Arithmetic (`inner()`, `outer()`, etc)¶

By default, `pd.merge()` uses `inner()`, which finds the intersection:¶

`suffixes`¶

`GroupBy` objects have `aggregate()`, `filter()`, `transform()`, and `apply()` methods¶

`.aggregate()`¶

`.filter()`¶

`.apply()`¶

Vectorized `.str` operations¶

`query()` and `eval()`¶