pandas, MLR, and RESTful APIs, oh my
Posted on Wed 14 February 2018 in Projects
Munge, Plot, Verify Data¶
We start by importing the data, which were downloaded as multiple different .csv files from the business's financial management system. We import the data into pandas dataframes:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import time
cat = pd.read_csv('brewery data/category sales 2017.csv', parse_dates=True, index_col=0)
df1 = cat.transpose()
df1.columns = [c.lower().replace(' ', '_') for c in df1.columns]
df1.head(3)
daily = pd.read_csv('brewery data/daily sales summary 2017.csv', parse_dates=True, index_col=0)
df2 = daily.transpose()
df2.columns = [c.lower().replace(' ', '_') for c in df2.columns]
df2.head(3)
Some munging gets us a final dataframe. Columns are various parameters of interest, Rows are the days in the 2017 calendar year (so we have 365 rows).
df = pd.concat([df1, df2], axis = 1)
df = df.reset_index()
df = df.rename(columns={'index':'date'})
df['date'] = pd.to_datetime(df['date'], format='%m/%d/%Y')
df['weekday'] = df['date'].apply(lambda x: x.weekday()) # Monday=0, Sunday=6
df.head(3)
len(df)
df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), legend=None)
plt.ylim([0, 4000])
plt.ylabel('net_total ($)')
df.plot(kind='line', x='date', y='net_total', figsize=(10, 3), label = 'net total ($)')
plt.ylim([0,10000])
plt.xlim([dt.date(2017, 2, 1), dt.date(2017, 3, 1)])
plt.ylabel('net_total ($)')
palisade_reunion = dt.date(2017, 2, 18)
devin_tnt = dt.date(2017, 2, 21)
the_foggy_tops = dt.date(2017, 2, 24)
the_aristocats = dt.date(2017, 2, 25)
plt.axvline(palisade_reunion, c = 'r', ls = '--', label = 'palisade reunion bottle release')
plt.axvline(devin_tnt, c = 'k', ls = '-', label = "devin's TNT")
plt.axvline(the_foggy_tops, c = 'c', ls = '-.', label = 'live music: the foggy tops')
plt.axvline(the_aristocats, c = 'y', ls = '-', label = 'live music: the aristocats')
plt.legend(loc='best')
daily_avg = df.groupby('weekday').mean().round(2)
daily_avg # Monday=0, Sunday=6
f, ax = plt.subplots()
var = 'net_total'
df.groupby('weekday')[var].mean().plot()
weekday_names = "Mon Tue Wed Thu Fri Sat Sun".split(' ')
ax.set_xticklabels(weekday_names, rotation = 45)
plt.ylabel('average ' + var + (' ($)'))
Multiple Linear Regression (MLR) with statsmodels¶
y = $\sum_{n}c_nx_n$ = $c_1$x$_1$ + $c_2$x$_2$ + $c_3$x$_3$ + ... + $c_n$x$_n$ + $\epsilon$
where:
- y = dependent variable (ie. 'net profit'), predicted variable.
- $x_n$ = independent variable(s) (ie. 'beers sold' or 'hours open' or 'day of week', etc.), predictor variables.
- $c_n$ = regression coefficient, ie. unit change in the dependent variable given changes in the corresponding predictor.
- $\epsilon$ = residual, ie. error in the fit.
In this application, we wish to use the following variables to predict a rise in the dependent variable, net profit
:
beer_flagships
, beer_seasonal
, brew_guest
, food
, growlers
, wine_and_cider
, soda
, merch
, flights
, bottle_sales
, spirits
, crowlers
, kegs
import statsmodels.formula.api as smf
results = smf.ols('net_total ~ beer_seasonal', data=df).fit()
results.summary()
Visualize each parameter's relationship to net_total
¶
fig, ax = plt.subplots(figsize=(12, 6))
plt.axis('off')
for key, num in zip(df1.keys(), xrange(1,16)):
ax = fig.add_subplot(3, 5, num)
ax.scatter(df[key], df['net_total'], s=5, c='k', marker='.')
ax.set_title(key)
ax.set_ylim(-500,5000)
ax.set_xlim(-100,1000)
ax.xaxis.set_major_locator(plt.MaxNLocator(4))
ax.yaxis.set_major_locator(plt.MaxNLocator(4))
plt.tight_layout(w_pad=0.2)
plt.show()
for key in df1.keys():
results = smf.ols('net_total ~ '+ key, data=df).fit()
print key, results.rsquared
Harvest tweets with Twitter's API¶
I set up a dev account with twitter to programmatically query the "Full Archive" (which goes back to ~2006). I used a "Sandbox" environment, which has rate and query limits. Fun to play, at least..
Twitter's API is OAuth protected, so after setting up your dev account, you have to authenticate by passing in your consumer key and secret. Details in the links below:
import base64
import requests
client_key = 'XXXXXXXXXXXXXXXXXXXXXXXXXX'
client_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN= 'XXXXXXXXX-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
# OAUTH_TOKEN_SECRET= 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
key_secret = '{}:{}'.format(client_key, client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')
base_url = 'https://api.twitter.com/'
auth_url = '{}oauth2/token'.format(base_url)
auth_headers = {
'Authorization': 'Basic {}'.format(b64_encoded_key),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8'}
auth_data = {
'grant_type': 'client_credentials'}
auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)
access_token = auth_resp.json()['access_token']
search_headers = {
'Authorization': 'Bearer {}'.format(access_token)}
auth_resp.status_code
def twitter_search(query, fromDate):
search_url = '{}1.1/tweets/search/fullarchive/learning.json'.format(base_url)
search_params = {
'query': query,
'fromDate': fromDate,
'toDate': time.strftime("%Y%m%d0000")}
search_resp = requests.get(search_url, headers=search_headers, params=search_params)
print search_resp.status_code
tweets = search_resp.json()['results'][0]
while search_resp.json()['next']:
time.sleep(31)
search_params = {
'query': query,
'fromDate': fromDate,
'toDate': time.strftime("%Y%m%d0000"),
'next': search_resp.json()['next']}
search_resp = requests.get(search_url, headers=search_headers, params=search_params)
print search_resp.status_code
print search_resp.json()['next']
add_tweets = search_resp.json()['results'][0]
tweets = dict(tweets, **add_tweets)
return tweets
tweets = twitter_search('"this exact phrase" OR @thisuser', 201401010000)
for tweet in tweets:
print tweets['date'] + '-' + tweets['user']['screen_name'] + ': ' + tweets['text'] + '\n' +
'_________________________________________________________________________________________'