pandas, MLR, and RESTful APIs, oh my

Posted on Wed 14 February 2018 in Projects

Munge, Plot, Verify Data

We start by importing the data, which were downloaded as multiple different .csv files from the business's financial management system. We import the data into pandas dataframes:

In [1]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import time
In [2]:
cat = pd.read_csv('brewery data/category sales 2017.csv', parse_dates=True, index_col=0)
df1 = cat.transpose()
df1.columns = [c.lower().replace(' ', '_') for c in df1.columns]
df1.head(3)
Out[2]:
beer_flagships beer_seasonal brew_guest food growlers wine_and_cider soda memberships merch flights bottle_sales spirits crowlers uncategorized kegs
1/1/2017 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/2/2017 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1/3/2017 48.26 36.8 58.49 10.38 13.57 12.27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [3]:
daily = pd.read_csv('brewery data/daily sales summary 2017.csv', parse_dates=True, index_col=0)
df2 = daily.transpose()
df2.columns = [c.lower().replace(' ', '_') for c in df2.columns]
df2.head(3)
Out[3]:
gross_sales refunds discounts_and_comps net_sales gift_card_sales tax tip partial_refunds total_collected fees net_total
1/1/2017 0.00 0.0 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.00
1/2/2017 0.00 0.0 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.00
1/3/2017 179.77 0.0 -42.35 137.42 0.0 8.23 28.15 0.0 173.8 -4.08 169.72

Some munging gets us a final dataframe. Columns are various parameters of interest, Rows are the days in the 2017 calendar year (so we have 365 rows).

In [4]:
df = pd.concat([df1, df2], axis = 1)
df = df.reset_index()
df = df.rename(columns={'index':'date'})
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['weekday'] = df['date'].apply(lambda x: x.weekday())   # Monday=0, Sunday=6
df.head(3)
Out[4]:
date beer_flagships beer_seasonal brew_guest food growlers wine_and_cider soda memberships merch ... discounts_and_comps net_sales gift_card_sales tax tip partial_refunds total_collected fees net_total weekday
0 2017-01-01 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 ... 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.00 6
1 2017-01-02 0.00 0.0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 ... 0.00 0.00 0.0 0.00 0.00 0.0 0.0 0.00 0.00 0
2 2017-01-03 48.26 36.8 58.49 10.38 13.57 12.27 0.0 0.0 0.0 ... -42.35 137.42 0.0 8.23 28.15 0.0 173.8 -4.08 169.72 1

3 rows × 28 columns

In [5]:
len(df)
Out[5]:
365
In [6]:
df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), legend=None)
plt.ylim([0, 4000])
plt.ylabel('net_total ($)')
Out[6]:
<matplotlib.text.Text at 0xb5ddfd0>
In [7]:
df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), label = 'net total ($)')
plt.ylim([0,10000])
plt.xlim([dt.date(2017, 2, 1), dt.date(2017, 3, 1)])
plt.ylabel('net_total ($)')

palisade_reunion = dt.date(2017, 2, 18)
devin_tnt = dt.date(2017, 2, 21)
the_foggy_tops = dt.date(2017, 2, 24)
the_aristocats = dt.date(2017, 2, 25)

plt.axvline(palisade_reunion, c = 'r', ls = '--', label = 'palisade reunion bottle release')
plt.axvline(devin_tnt, c = 'k', ls = '-', label = "devin's TNT")
plt.axvline(the_foggy_tops, c = 'c', ls = '-.', label = 'live music: the foggy tops')
plt.axvline(the_aristocats, c = 'y', ls = '-', label = 'live music: the aristocats')

plt.legend(loc='best')
Out[7]:
<matplotlib.legend.Legend at 0xbd3dac8>
In [8]:
daily_avg = df.groupby('weekday').mean().round(2)
daily_avg   # Monday=0, Sunday=6
Out[8]:
beer_flagships beer_seasonal brew_guest food growlers wine_and_cider soda memberships merch flights ... refunds discounts_and_comps net_sales gift_card_sales tax tip partial_refunds total_collected fees net_total
weekday
0 40.93 53.74 43.70 18.14 5.12 13.10 1.17 10.89 1.27 2.05 ... -1.31 -11.76 344.04 7.27 20.52 34.13 -1.23 404.73 -9.78 394.95
1 100.08 104.03 123.44 71.33 5.04 32.16 2.63 0.00 0.82 1.27 ... -2.65 -20.68 528.47 0.00 31.47 85.09 -2.77 642.26 -15.25 627.01
2 104.21 130.61 110.60 51.93 5.31 34.90 2.29 3.63 1.27 1.58 ... -9.16 -8.56 614.00 0.00 36.67 66.21 -3.81 713.07 -15.88 697.18
3 109.84 97.45 128.56 54.19 6.44 34.99 2.12 7.26 1.54 2.96 ... -1.14 -6.15 543.62 0.00 32.49 77.78 -0.69 653.21 -15.12 638.09
4 191.38 200.40 270.38 116.00 14.87 98.85 8.17 3.63 0.73 5.10 ... -2.47 -33.19 1028.48 0.00 61.27 161.86 -3.59 1248.01 -29.44 1218.57
5 193.17 297.63 218.42 138.50 30.59 85.91 10.25 7.26 5.00 8.84 ... -2.58 -47.44 1258.22 0.00 74.90 165.14 -3.98 1494.28 -34.22 1460.06
6 81.36 109.08 94.83 55.88 15.67 26.48 3.65 0.00 1.88 5.19 ... -0.68 -27.51 447.71 0.00 26.40 72.21 0.00 546.32 -12.94 533.38

7 rows × 26 columns

In [9]:
f, ax = plt.subplots()

var = 'net_total'
df.groupby('weekday')[var].mean().plot()
weekday_names = "Mon Tue Wed Thu Fri Sat Sun".split(' ')
ax.set_xticklabels(weekday_names, rotation = 45)
plt.ylabel('average ' + var + (' ($)'))
Out[9]:
<matplotlib.text.Text at 0xbcb0c88>

Multiple Linear Regression (MLR) with statsmodels

y = $\sum_{n}c_nx_n$ = $c_1$x$_1$ + $c_2$x$_2$ + $c_3$x$_3$ + ... + $c_n$x$_n$ + $\epsilon$

where:

  • y = dependent variable (ie. 'net profit'), predicted variable.
  • $x_n$ = independent variable(s) (ie. 'beers sold' or 'hours open' or 'day of week', etc.), predictor variables.
  • $c_n$ = regression coefficient, ie. unit change in the dependent variable given changes in the corresponding predictor.
  • $\epsilon$ = residual, ie. error in the fit.

In this application, we wish to use the following variables to predict a rise in the dependent variable, net profit:

beer_flagships , beer_seasonal , brew_guest , food , growlers , wine_and_cider , soda , merch , flights , bottle_sales , spirits , crowlers , kegs

In [10]:
import statsmodels.formula.api as smf
In [11]:
results = smf.ols('net_total ~ beer_seasonal', data=df).fit()

results.summary()
Out[11]:
OLS Regression Results
Dep. Variable: net_total R-squared: 0.668
Model: OLS Adj. R-squared: 0.667
Method: Least Squares F-statistic: 730.2
Date: Thu, 01 Mar 2018 Prob (F-statistic): 6.44e-89
Time: 12:26:55 Log-Likelihood: -2770.1
No. Observations: 365 AIC: 5544.
Df Residuals: 363 BIC: 5552.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 394.5451 29.153 13.534 0.000 337.215 451.875
beer_seasonal 2.8241 0.105 27.022 0.000 2.619 3.030
Omnibus: 177.280 Durbin-Watson: 1.364
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2065.999
Skew: 1.730 Prob(JB): 0.00
Kurtosis: 14.130 Cond. No. 324.

Visualize each parameter's relationship to net_total

In [12]:
fig, ax = plt.subplots(figsize=(12, 6))
plt.axis('off')

for key, num in zip(df1.keys(), xrange(1,16)):
    ax = fig.add_subplot(3, 5, num)
    ax.scatter(df[key], df['net_total'], s=5, c='k', marker='.')
    ax.set_title(key)
    ax.set_ylim(-500,5000)
    ax.set_xlim(-100,1000)
    ax.xaxis.set_major_locator(plt.MaxNLocator(4))
    ax.yaxis.set_major_locator(plt.MaxNLocator(4))
    plt.tight_layout(w_pad=0.2)

plt.show()
In [13]:
for key in df1.keys():
    results = smf.ols('net_total ~ '+ key, data=df).fit()
    print key, results.rsquared
beer_flagships 0.227081519369
beer_seasonal 0.667941650209
brew_guest 0.172562855854
food 0.342457812996
growlers 0.0322134612388
wine_and_cider 0.236260517853
soda 0.134356060618
memberships 0.0176718839336
merch 0.0109826887202
flights 0.0207434099073
bottle_sales 0.715102607252
spirits 0.000120601443241
crowlers 0.0212923225072
uncategorized 0.000278440322601
kegs 0.0197746113386

Harvest tweets with Twitter's API

I set up a dev account with twitter to programmatically query the "Full Archive" (which goes back to ~2006). I used a "Sandbox" environment, which has rate and query limits. Fun to play, at least..

Twitter's API is OAuth protected, so after setting up your dev account, you have to authenticate by passing in your consumer key and secret. Details in the links below:

In [14]:
import base64
import requests
In [ ]:
client_key = 'XXXXXXXXXXXXXXXXXXXXXXXXXX'
client_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN= 'XXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN_SECRET= 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')
In [ ]:
base_url = 'https://api.twitter.com/'
auth_url = '{}oauth2/token'.format(base_url)
auth_headers = {
    'Authorization': 'Basic {}'.format(b64_encoded_key),
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}
auth_data = {
    'grant_type': 'client_credentials'}
auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
access_token = auth_resp.json()['access_token']

search_headers = {
    'Authorization': 'Bearer {}'.format(access_token)}

auth_resp.status_code
In [ ]:
def twitter_search(query, fromDate):
    search_url = '{}1.1/tweets/search/fullarchive/learning.json'.format(base_url)
    
    search_params = {
        'query': query,
        'fromDate': fromDate,
        'toDate': time.strftime("%Y%m%d0000")} 
    
    search_resp = requests.get(search_url, headers=search_headers, params=search_params)
    print search_resp.status_code        

    tweets = search_resp.json()['results'][0]
    
    while search_resp.json()['next']:
        time.sleep(31)
        search_params = {
            'query': query,
            'fromDate': fromDate,
            'toDate': time.strftime("%Y%m%d0000"),
            'next': search_resp.json()['next']} 
        
        search_resp = requests.get(search_url, headers=search_headers, params=search_params)
        print search_resp.status_code
        print search_resp.json()['next']

        add_tweets = search_resp.json()['results'][0]
        tweets = dict(tweets, **add_tweets)
        
    return tweets
In [ ]:
tweets = twitter_search('"this exact phrase" OR @thisuser', 201401010000)
In [ ]:
for tweet in tweets:
    print tweets['date'] + '-' + tweets['user']['screen_name'] + ': ' + tweets['text'] + '\n' +
    '_________________________________________________________________________________________'