DS Notebook 03
Posted on Wed 04 December 2019 in Projects
Pandas review¶
And parsing thousands of recipes scraped from the internet, ex:¶
selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)
recipes.name[selection.index]
import numpy as np
import pandas as pd
Series
as generalized NumPy array¶
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
data.values
data.index
data[1]
data[1:3]
Pandas Series
object has an explicitly defined index associated with the values (as opposed to NumPy arrays' implicit integer index).
Need not be an integer:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
data['a']
# non-contiguous or non-sequential:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
data[7]
In this way, you can think of a Pandas Series
a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series
makes it much more efficient than Python dictionaries for certain operations.
The Series
-as-dictionary analogy can be made even more clear by constructing a Series
object directly from a Python dictionary:
population_dict = {'CA': 38332521,
'TX': 26448193,
'NY': 19651127,
'FL': 19552860,
'IL': 12882135}
population = pd.Series(population_dict)
population
population['CA']
Unlike a dictionary, though, the Series
also supports array-style operations such as slicing:
population['CA':'NY']
In general..¶
>>> pd.Series(data, index=index)
pd.Series({2:'a', 1:'b', 3:'c'})
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
DataFrame
as generalized NumPy array¶
area_dict = {'CA': 423967, 'TX': 695662, 'NY': 141297,
'FL': 170312, 'IL': 149995}
area = pd.Series(area_dict)
area
states = pd.DataFrame({'pop': population,
'area': area})
states
states.index
states.columns
states['area'] # returns a Series
The Pandas Index Object¶
Immutable Array, Ordered Set
ind = pd.Index([2, 3, 5, 7, 11])
ind[0]
ind[::2]
print(ind.size, ind.shape, ind.ndim, ind.dtype)
# Immutable
ind[1] = 0
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
indA.intersection(indB)
indA | indB # union
indA ^ indB # xor
Selecting data¶
data['b']
'b' in data
data.keys()
list(data.items())
data['e'] = 1.25
data
#masking
data[(data > 0.3) & (data < 0.8)]
Indexers loc
, iloc
¶
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
# explicit
data[1]
# implicit
data[1:4]
.loc
always references the explicit index:
data.loc[3]
.iloc
always references the implicit Python-style index:
data.iloc[2]
DataFrame as a dict of Series:
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
#access the individual series:
data['area']
#Equivalently, use attribute-style access with column names that are strings:
data.area
# they are exactly the same:
data.area is data['area']
In particular, you should avoid the temptation to try column assignment via attribute (i.e., use data['pop'] = z
rather than data.pop = z
).
Like with the Series
objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
data['density'] = data['pop'] / data['area']
data
# as enhanced 2D array:
data.values
data.T
data.values[0]
data.iloc[:3, 1:3]
data.loc[:'New York','pop':'density']
data.loc[data.density > 100, ['pop', 'density']]
# modify values:
data.iloc[0, 2] = 90
data
data[data.density > 100]
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df
np.exp(ser)
np.sin(df * np.pi/4)
# Pandas aligns indices, in case data is incomplete:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')
population / area
area.index | population.index # union of indices of two input arrays
A = pd.Series([2, 4, 6], index=[0, 1, 2])
print(A)
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(B)
print(A + B)
calling A.add(B)
is equivalent to calling A + B
, but allows optional explicit specification of the fill value for any elements in A
or B
that might be missing:
A.add(B, fill_value=0)
# similarly for dfs:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('12'))
A
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('213'))
B
A + B
A.stack()
A.stack().mean()
A.add(B, fill_value=A.stack().mean())
Python Operator | Pandas Method(s) |
---|---|
+ |
add() |
- |
sub() , subtract() |
* |
mul() , multiply() |
/ |
truediv() , div() , divide() |
// |
floordiv() |
% |
mod() |
** |
pow() |
A = rng.randint(10, size=(3, 4))
A
A[0]
A-A[0]
#pandas similarly operates row-wise:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
#column-wise goes like this:
df.subtract(df['R'], axis=0)
df
df.iloc[0, ::2]
df - df.iloc[0, ::2]
Missing Data¶
# sentinel values
vals1 = np.array([1, None, 3, 4])
vals1 # <-- common type = object
vals1.sum() #addition between int and None is undefined
# NaN is different:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
1 + np.nan #data virus, infects any object it touches:
0 * np.nan
vals2.sum(), vals2.min(), vals2.max()
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
pd.Series([1, np.nan, 2, None])
# detection
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
data[data.notnull()]
data.dropna()
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
By default, dropna()
will drop all rows in which any null value is present:
df.dropna()
df.dropna(axis='columns')
But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values.
This can be specified through the how
or thresh
parameters, which allow fine control of the number of nulls to allow through.
The default is how='any'
, such that any row or column (depending on the axis
keyword) containing a null value will be dropped.
You can also specify how='all'
, which will only drop rows/columns that are all null values:
df[3] = np.nan
df
df.dropna(axis='columns', how='all')
For finer-grained control, the thresh
parameter lets you specify a minimum number of non-null values for the row/column to be kept:
df.dropna(axis='rows', thresh=3)
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
data.fillna(0)
#forward-fill
data.fillna(method='ffill')
#back-fill
data.fillna(method='bfill')
df
df.fillna(method='ffill', axis=1)
Hierarchical Indexing (MultiIndex)¶
like always, but keep in mind can pass level=
to many methods. ex.:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
health_data.index.levels
health_data['Bob']
health_data['Bob', 'HR']
health_data.loc[:, ('Bob', 'HR')]
health_data.sort_index()
health_data.unstack(level=0)
health_data.unstack(level=1)
health_data.stack(level=0)
health_data.stack(level=1)
health_data.reset_index() #turn index into columns
data_mean = health_data.mean(level='year')
data_mean
data_mean.mean(axis=1, level='type')
Merge and Join with pd.merge()
¶
# For visualization purposes, to see two dfs side-by-side
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
1-to-1¶
Like column-wise concatenation. Finds the common column between the two dfs ("employee"), and combines. Order does not matter, and the index is discarded:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
df3 = pd.merge(df1, df2)
df3
N-to-1¶
When one of the two key columns (in this case, "group") contains duplicate entries (in this case, "Engineering"). So, "supervisor" column has repeated information, where necessary.
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
N-to-N¶
When key column in both L and R dfs contains duplicates:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
Options for pd.merge()
¶
Specify merge key with on
¶
pd.merge(df1, df2, on='employee')
left_on
and right_on
¶
we may have a dataset in which the employee name is labeled as "name" rather than "employee". In this case, we can use the left_on and right_on keywords to specify the two column names:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3')
pd.merge(df1, df3, left_on='employee', right_on='name').drop('name', axis=1)
left_index
and right_index
¶
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
pd.merge(df1a, df2a, left_index=True, right_index=True)
For convenience, merging on indices is join()
¶
df1a.join(df2a)
pd.merge(df1a, df3, left_index=True, right_on='name')
Set Arithmetic (inner()
, outer()
, etc)¶
When a value appears in one key column but not the other:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
By default, pd.merge()
uses inner()
, which finds the intersection:¶
pd.merge(df6, df7, how='inner')
# outer is union, filled with NAs:
pd.merge(df6, df7, how='outer')
# left keeps all left entries, ignoring Joseph
pd.merge(df6, df7, how='left')
# Right keeps all right entries, ignoring Peter and Paul:
pd.merge(df6, df7, how='right')
suffixes
¶
For conflicting column names you wish to keep separate
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
pd.merge(df8, df9, on='name', suffixes=('_L', '_R'))
Example: US States Data¶
Q: rank US states and territories by their 2010 population density
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
!curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
pop.head()
areas.head()
abbrevs.head()
merged = pd.merge(areas, abbrevs, on='state')
merged.head()
final = pd.merge(merged, pop, left_on='abbreviation', right_on='state/region').drop('state/region', axis=1)
final.head()
final['density'] = final['population'] / final['area (sq. mi)']
final.head()
data2010 = final[(final['year'] == 2010) & (final['ages'] == 'total')]
data2010
data2010.set_index('abbreviation' , inplace=True)
data2010.head()
data2010.density.sort_values(ascending=False)
merged.isnull().any()
A different approach, with more munging:¶
GroupBy¶
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
planets.head()
planets.dropna().describe()
The following table summarizes some other built-in Pandas aggregations:
Aggregation | Description |
---|---|
count() |
Total number of items |
first() , last() |
First and last item |
mean() , median() |
Mean and median |
min() , max() |
Minimum and maximum |
std() , var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
These are all methods of DataFrame
and Series
objects.
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
df.groupby('key') # think of this as a special view of the df.
# ready to perform some aggregation
df.groupby('key').sum()
planets.head()
planets.groupby('method')['orbital_period'].median()
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
planets.groupby('method')['year'].describe()
Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade. The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
This is just one example of the utility of dispatch methods.
Notice that they are applied to each individual group, and the results are then combined within GroupBy
and returned.
Again, any valid DataFrame
/Series
method can be used on the corresponding GroupBy
object, which allows for some very flexible and powerful operations!
GroupBy
objects have aggregate()
, filter()
, transform()
, and apply()
methods¶
efficiently implement a variety of useful operations before combining the grouped data.
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
.aggregate()
¶
can take a string, a function, or a list therof:
df.groupby('key').aggregate(['min', np.median, max])
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
.filter()
¶
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
def filter_func(x):
return x['data2'].std() > 4
df
df.groupby('key').std()
df.groupby('key').filter(filter_func)
.apply()
¶
The apply()
method lets you apply an arbitrary function to the group results.
The function should take a DataFrame
, and return either a Pandas object (e.g., DataFrame
, Series
) or a scalar; the combine operation will be tailored to the type of output returned.
For example, here is an apply()
that normalizes the first column by the sum of the second:
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
df
df.groupby('key').apply(norm_by_data2)
splitting by other methods¶
Another method is to provide a dictionary that maps index values to the group keys:
df
df2 = df.set_index('key')
df2
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()
Similar to mapping, you can pass any Python function that will input the index value and output the group:
df2.groupby(str.lower).mean()
df2.groupby([str.lower, mapping]).mean()
# grouping example
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna('-')
titanic = sns.load_dataset('titanic')
titanic.head()
titanic.groupby('sex')['survived'].mean()
This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class.
Using the vocabulary of GroupBy
, we might proceed using something like this:
we
- group by class and gender
- select survival
- apply a mean aggregate
- combine the resulting groups
- and then unstack the hierarchical index to reveal the hidden multidimensionality.
In code:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
While each step of this pipeline makes sense in light of the tools we've previously discussed, the long string of code is not particularly easy to read or use.
This two-dimensional GroupBy
is common enough that Pandas includes a convenience routine, pivot_table
, which succinctly handles this type of multi-dimensional aggregation.
titanic.pivot_table('survived', index='sex', columns='class')
multi-level pivot tables¶
age = pd.cut(titanic['age'], [0, 18, 100])
titanic.pivot_table('survived', ['sex', age], 'class')
We can apply the same strategy when working with the columns as well; let's add info on the fare paid using pd.qcut
to automatically compute quantiles:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
# call signature as of Pandas 0.18
df.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None,
margins=False, dropna=True, margins_name='All')
titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived':sum, 'fare':'mean'})
At times it's useful to compute totals along each grouping.
This can be done via the margins
keyword:
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
!curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
births = pd.read_csv('births.csv')
births.head()
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');
Though this doesn't necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we'll do this via a robust sigma-clipping operation:
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births['day'] = births['day'].astype(int)
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births['dayofweek'] = births.index.dayofweek
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
births_by_date = births.pivot_table('births',
[births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
births_by_date.head()
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
Vectorized .str
operations¶
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
names = pd.Series(data)
names.str.capitalize()
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
len() |
lower() |
translate() |
islower() |
ljust() |
upper() |
startswith() |
isupper() |
rjust() |
find() |
endswith() |
isnumeric() |
center() |
rfind() |
isalnum() |
isdecimal() |
zfill() |
index() |
isalpha() |
split() |
strip() |
rindex() |
isdigit() |
rsplit() |
rstrip() |
capitalize() |
isspace() |
partition() |
lstrip() |
swapcase() |
istitle() |
rpartition() |
monte.str.lower()
monte.str.len()
monte.str.startswith('T')
monte.str.startswith('t')
monte.str.split()
Method | Description |
---|---|
match() |
Call re.match() on each element, returning a boolean. |
extract() |
Call re.match() on each element, returning matched groups as strings. |
findall() |
Call re.findall() on each element |
replace() |
Replace occurrences of pattern with some other string |
contains() |
Call re.search() on each element, returning a boolean |
count() |
Count occurrences of pattern |
split() |
Equivalent to str.split() , but accepts regexps |
rsplit() |
Equivalent to str.rsplit() , but accepts regexps |
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
Get Dummies¶
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
full_monte['info'].str.get_dummies('|')
Recipe Recommendations¶
!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
!gzip -d recipeitems-latest.json.gz
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r', encoding="utf8") as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
recipes.shape
recipes.iloc[0]
recipes.ingredients.str.len().describe()
recipes.name[np.argmax(np.array(recipes.ingredients.str.len()))]
recipes.description.str.contains('[Bb]reakfast').sum()
recipes.ingredients.str.contains('[Cc]innamon').sum()
recipes.ingredients.str.contains('[Cc]inamon').sum()
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
selection = spice_df.query('rosemary & cumin & tarragon')
len(selection)
recipes.name[selection.index]
Time Series¶
query()
and eval()
¶
If the size of the temporary DataFrame
s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an eval()
or query()
expression.
You can check the approximate size of your array in bytes using this:
df.values.nbytes