Munge, Plot, Verify Data¶

We start by importing the data, which were downloaded as multiple different .csv files from the business's financial management system. We import the data into pandas dataframes:

In [1]:

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import time

In [2]:

cat = pd.read_csv('brewery data/category sales 2017.csv', parse_dates=True, index_col=0)
df1 = cat.transpose()
df1.columns = [c.lower().replace(' ', '_') for c in df1.columns]
df1.head(3)

Out[2]:

	beer_flagships	beer_seasonal	brew_guest	food	growlers	wine_and_cider
1/1/2017	0.00	0.0	0.00	0.00	0.00	0.00
1/2/2017	0.00	0.0	0.00	0.00	0.00	0.00
1/3/2017	48.26	36.8	58.49	10.38	13.57	12.27

In [3]:

daily = pd.read_csv('brewery data/daily sales summary 2017.csv', parse_dates=True, index_col=0)
df2 = daily.transpose()
df2.columns = [c.lower().replace(' ', '_') for c in df2.columns]
df2.head(3)

Out[3]:

	gross_sales	discounts_and_comps	net_sales	tax	tip	total_collected	fees	net_total
1/1/2017	0.00	0.00	0.00	0.00	0.00	0.0	0.00	0.00
1/2/2017	0.00	0.00	0.00	0.00	0.00	0.0	0.00	0.00
1/3/2017	179.77	-42.35	137.42	8.23	28.15	173.8	-4.08	169.72

Some munging gets us a final dataframe. Columns are various parameters of interest, Rows are the days in the 2017 calendar year (so we have 365 rows).

In [4]:

df = pd.concat([df1, df2], axis = 1)
df = df.reset_index()
df = df.rename(columns={'index':'date'})
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['weekday'] = df['date'].apply(lambda x: x.weekday())   # Monday=0, Sunday=6
df.head(3)

Out[4]:

	date	beer_flagships	beer_seasonal	brew_guest	food	growlers	wine_and_cider	...	discounts_and_comps	net_sales	tax	tip	total_collected	fees	net_total	weekday
0	2017-01-01	0.00	0.0	0.00	0.00	0.00	0.00	...	0.00	0.00	0.00	0.00	0.0	0.00	0.00	6
1	2017-01-02	0.00	0.0	0.00	0.00	0.00	0.00	...	0.00	0.00	0.00	0.00	0.0	0.00	0.00	0
2	2017-01-03	48.26	36.8	58.49	10.38	13.57	12.27	...	-42.35	137.42	8.23	28.15	173.8	-4.08	169.72	1

3 rows × 28 columns

In [5]:

len(df)

Out[5]:

In [6]:

df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), legend=None)
plt.ylim([0, 4000])
plt.ylabel('net_total ($)')

Out[6]:

<matplotlib.text.Text at 0xb5ddfd0>

In [7]:

df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), label = 'net total ($)')
plt.ylim([0,10000])
plt.xlim([dt.date(2017, 2, 1), dt.date(2017, 3, 1)])
plt.ylabel('net_total ($)')

palisade_reunion = dt.date(2017, 2, 18)
devin_tnt = dt.date(2017, 2, 21)
the_foggy_tops = dt.date(2017, 2, 24)
the_aristocats = dt.date(2017, 2, 25)

plt.axvline(palisade_reunion, c = 'r', ls = '--', label = 'palisade reunion bottle release')
plt.axvline(devin_tnt, c = 'k', ls = '-', label = "devin's TNT")
plt.axvline(the_foggy_tops, c = 'c', ls = '-.', label = 'live music: the foggy tops')
plt.axvline(the_aristocats, c = 'y', ls = '-', label = 'live music: the aristocats')

plt.legend(loc='best')

Out[7]:

<matplotlib.legend.Legend at 0xbd3dac8>

In [8]:

daily_avg = df.groupby('weekday').mean().round(2)
daily_avg   # Monday=0, Sunday=6

Out[8]:

	beer_flagships	beer_seasonal	brew_guest	food	growlers	wine_and_cider	soda	memberships	merch	flights	...	refunds	discounts_and_comps	net_sales	gift_card_sales	tax	tip	partial_refunds	total_collected	fees	net_total
weekday
0	40.93	53.74	43.70	18.14	5.12	13.10	1.17	10.89	1.27	2.05	...	-1.31	-11.76	344.04	7.27	20.52	34.13	-1.23	404.73	-9.78	394.95
1	100.08	104.03	123.44	71.33	5.04	32.16	2.63	0.00	0.82	1.27	...	-2.65	-20.68	528.47	0.00	31.47	85.09	-2.77	642.26	-15.25	627.01
2	104.21	130.61	110.60	51.93	5.31	34.90	2.29	3.63	1.27	1.58	...	-9.16	-8.56	614.00	0.00	36.67	66.21	-3.81	713.07	-15.88	697.18
3	109.84	97.45	128.56	54.19	6.44	34.99	2.12	7.26	1.54	2.96	...	-1.14	-6.15	543.62	0.00	32.49	77.78	-0.69	653.21	-15.12	638.09
4	191.38	200.40	270.38	116.00	14.87	98.85	8.17	3.63	0.73	5.10	...	-2.47	-33.19	1028.48	0.00	61.27	161.86	-3.59	1248.01	-29.44	1218.57
5	193.17	297.63	218.42	138.50	30.59	85.91	10.25	7.26	5.00	8.84	...	-2.58	-47.44	1258.22	0.00	74.90	165.14	-3.98	1494.28	-34.22	1460.06
6	81.36	109.08	94.83	55.88	15.67	26.48	3.65	0.00	1.88	5.19	...	-0.68	-27.51	447.71	0.00	26.40	72.21	0.00	546.32	-12.94	533.38

7 rows × 26 columns

In [9]:

f, ax = plt.subplots()

var = 'net_total'
df.groupby('weekday')[var].mean().plot()
weekday_names = "Mon Tue Wed Thu Fri Sat Sun".split(' ')
ax.set_xticklabels(weekday_names, rotation = 45)
plt.ylabel('average ' + var + (' ($)'))

Out[9]:

<matplotlib.text.Text at 0xbcb0c88>

Multiple Linear Regression (MLR) with statsmodels¶

y = $\sum_{n}c_nx_n$ = $c_1$x$_1$ + $c_2$x$_2$ + $c_3$x$_3$ + ... + $c_n$x$_n$ + $\epsilon$

where:

y = dependent variable (ie. 'net profit'), predicted variable.
$x_n$ = independent variable(s) (ie. 'beers sold' or 'hours open' or 'day of week', etc.), predictor variables.
$c_n$ = regression coefficient, ie. unit change in the dependent variable given changes in the corresponding predictor.
$\epsilon$ = residual, ie. error in the fit.

In this application, we wish to use the following variables to predict a rise in the dependent variable, net profit:

beer_flagships , beer_seasonal , brew_guest , food , growlers , wine_and_cider , soda , merch , flights , bottle_sales , spirits , crowlers , kegs

In [10]:

import statsmodels.formula.api as smf

In [11]:

results = smf.ols('net_total ~ beer_seasonal', data=df).fit()

results.summary()

Out[11]:

OLS Regression Results
Dep. Variable:	net_total	R-squared:	0.668
Model:	OLS	Adj. R-squared:	0.667
Method:	Least Squares	F-statistic:	730.2
Date:	Thu, 01 Mar 2018	Prob (F-statistic):	6.44e-89
Time:	12:26:55	Log-Likelihood:	-2770.1
No. Observations:	365	AIC:	5544.
Df Residuals:	363	BIC:	5552.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	394.5451	29.153	13.534	0.000	337.215	451.875
beer_seasonal	2.8241	0.105	27.022	0.000	2.619	3.030

Omnibus:	177.280	Durbin-Watson:	1.364
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2065.999
Skew:	1.730	Prob(JB):	0.00
Kurtosis:	14.130	Cond. No.	324.

Visualize each parameter's relationship to `net_total`¶

In [12]:

fig, ax = plt.subplots(figsize=(12, 6))
plt.axis('off')

for key, num in zip(df1.keys(), xrange(1,16)):
    ax = fig.add_subplot(3, 5, num)
    ax.scatter(df[key], df['net_total'], s=5, c='k', marker='.')
    ax.set_title(key)
    ax.set_ylim(-500,5000)
    ax.set_xlim(-100,1000)
    ax.xaxis.set_major_locator(plt.MaxNLocator(4))
    ax.yaxis.set_major_locator(plt.MaxNLocator(4))
    plt.tight_layout(w_pad=0.2)

plt.show()

In [13]:

for key in df1.keys():
    results = smf.ols('net_total ~ '+ key, data=df).fit()
    print key, results.rsquared

beer_flagships 0.227081519369
beer_seasonal 0.667941650209
brew_guest 0.172562855854
food 0.342457812996
growlers 0.0322134612388
wine_and_cider 0.236260517853
soda 0.134356060618
memberships 0.0176718839336
merch 0.0109826887202
flights 0.0207434099073
bottle_sales 0.715102607252
spirits 0.000120601443241
crowlers 0.0212923225072
uncategorized 0.000278440322601
kegs 0.0197746113386

Harvest tweets with Twitter's API¶

I set up a dev account with twitter to programmatically query the "Full Archive" (which goes back to ~2006). I used a "Sandbox" environment, which has rate and query limits. Fun to play, at least..

Twitter's API is OAuth protected, so after setting up your dev account, you have to authenticate by passing in your consumer key and secret. Details in the links below:

In [14]:

import base64
import requests

In [ ]:

client_key = 'XXXXXXXXXXXXXXXXXXXXXXXXXX'
client_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN= 'XXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN_SECRET= 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')

In [ ]:

base_url = 'https://api.twitter.com/'
auth_url = '{}oauth2/token'.format(base_url)
auth_headers = {
    'Authorization': 'Basic {}'.format(b64_encoded_key),
    'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}
auth_data = {
    'grant_type': 'client_credentials'}
auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
access_token = auth_resp.json()['access_token']

search_headers = {
    'Authorization': 'Bearer {}'.format(access_token)}

auth_resp.status_code

In [ ]:

def twitter_search(query, fromDate):
    search_url = '{}1.1/tweets/search/fullarchive/learning.json'.format(base_url)
    
    search_params = {
        'query': query,
        'fromDate': fromDate,
        'toDate': time.strftime("%Y%m%d0000")} 
    
    search_resp = requests.get(search_url, headers=search_headers, params=search_params)
    print search_resp.status_code        

    tweets = search_resp.json()['results'][0]
    
    while search_resp.json()['next']:
        time.sleep(31)
        search_params = {
            'query': query,
            'fromDate': fromDate,
            'toDate': time.strftime("%Y%m%d0000"),
            'next': search_resp.json()['next']} 
        
        search_resp = requests.get(search_url, headers=search_headers, params=search_params)
        print search_resp.status_code
        print search_resp.json()['next']

        add_tweets = search_resp.json()['results'][0]
        tweets = dict(tweets, **add_tweets)
        
    return tweets

In [ ]:

tweets = twitter_search('"this exact phrase" OR @thisuser', 201401010000)

In [ ]:

for tweet in tweets:
    print tweets['date'] + '-' + tweets['user']['screen_name'] + ': ' + tweets['text'] + '\n' +
    '_________________________________________________________________________________________'

Munge, Plot, Verify Data¶

Multiple Linear Regression (MLR) with statsmodels¶

Visualize each parameter's relationship to net_total¶

Harvest tweets with Twitter's API¶

Visualize each parameter's relationship to `net_total`¶