As a High School Cross Country coach, I visit co.milesplit.com quite often. This website is a nationwide database of High School Running (Cross Country and Track & Field) performances. For a week~~end~~long project, I thought I'd try to visualize some runner data.

Part 1 - Webscraping

Part 2 - Exploring the Data

Part 3 - Making Predictions (you are here)

I gave a Tuesday Nerd Talk at a local brewery about this project. Check it out here

Statistics¶

In [290]:

dfsimple = df.drop(['Athlete/School', 'Meet'], axis = 1)
dfsimple['Elevation'] = dfsimple['Elevation'].astype(float)
dfsimple.head()

Out[290]:

	Rank	Time	Grade	Year	State	Gender	Elevation
0	1	959.9	2002	2001	AL	Boys	500.0
1	2	1077.4	2002	2001	AL	Boys	500.0
2	1	1032.3	2005	2002	AL	Boys	500.0
3	2	1103.2	2005	2002	AL	Boys	500.0
4	3	1109.7	2006	2002	AL	Boys	500.0

In [291]:

dfsimple.dtypes

Out[291]:

Rank           int64
Time         float64
Grade          int64
Year           int64
State         object
Gender        object
Elevation    float64
dtype: object

In [292]:

import statsmodels.formula.api as smf

# create a fitted model with Time and Elevation
results = smf.ols(formula='Time ~ Elevation', data=dfsimple).fit()

In [293]:

results.summary()

Out[293]:

OLS Regression Results
Dep. Variable:	Time	R-squared:	0.001
Model:	OLS	Adj. R-squared:	0.001
Method:	Least Squares	F-statistic:	851.7
Date:	Mon, 26 Feb 2018	Prob (F-statistic):	3.79e-187
Time:	13:35:45	Log-Likelihood:	-6.3315e+06
No. Observations:	948591	AIC:	1.266e+07
Df Residuals:	948589	BIC:	1.266e+07
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	1213.5399	0.270	4489.630	0.000	1213.010	1214.070
Elevation	0.0033	0.000	29.184	0.000	0.003	0.003

Omnibus:	262278.890	Durbin-Watson:	0.029
Prob(Omnibus):	0.000	Jarque-Bera (JB):	853107.316
Skew:	1.406	Prob(JB):	0.00
Kurtosis:	6.699	Cond. No.	3.30e+03

In [301]:

results.params

Out[301]:

Intercept    1213.539906
Elevation       0.003280
dtype: float64

So, as Elevation increases 1000 ft, time increases 3.3 sec. That means that here in Boulder at ~5000 ft, the time penalty is 3.3 * 5 = 16.5 s.

In the first line, our import statement imports the statsmodels package, which allows us to do a simple linear regression.
The second line defines the model we will be using, and the variables we wish to fit. In this case, we're looking at the impact that Elevation has on Time.
The third and final line simply prints our parameters.

We interpret this as follows: "for every unit of Elevation gain (in feet), Time increases by 0.003280 units (in seconds)". Evidently, our data is telling us that for every 1000 feet of elevation gain, a runner's 5K time increases (so, they go slower) by 3.3 seconds. For a performance at 5000 feet, which is commonly thought of as an "altitude performance", that number is 5*3.3 seconds = 16.5 seconds.

Cool! Our data agrees with the "typical" number used in the sport. The 20-30 sec rule applies across the board, whereas our number was derived from the top 1000 performances in each state each year. Therefore, it makes sense that our number is a little bit lower, since athletes who are more athletically fit are less susceptible to the effects of altitude.

Repeating this linear regression analysis for other distance races gives us the following:

In [294]:

x = [800, 1600, 3200, 5000]
y = [0.381*5, 1.095*5, 0.398*5 ,3.280*5]

plt.scatter(x, y)
plt.title('Altitude allowance per 5000 ft elevation gain')
plt.xlabel('Distance (meters)')
plt.ylabel('Time added (seconds)')

plt.show()