milesplit, part 3 - Making Predictions

Posted on Thu 30 November 2017 in Projects

As a High School Cross Country coach, I visit co.milesplit.com quite often. This website is a nationwide database of High School Running (Cross Country and Track & Field) performances. For a weekendlong project, I thought I'd try to visualize some runner data.

I gave a Tuesday Nerd Talk at a local brewery about this project. Check it out here

Statistics

In [290]:
dfsimple = df.drop(['Athlete/School', 'Meet'], axis = 1)
dfsimple['Elevation'] = dfsimple['Elevation'].astype(float)
dfsimple.head()
Out[290]:
Rank Time Grade Year State Gender Elevation
0 1 959.9 2002 2001 AL Boys 500.0
1 2 1077.4 2002 2001 AL Boys 500.0
2 1 1032.3 2005 2002 AL Boys 500.0
3 2 1103.2 2005 2002 AL Boys 500.0
4 3 1109.7 2006 2002 AL Boys 500.0
In [291]:
dfsimple.dtypes
Out[291]:
Rank           int64
Time         float64
Grade          int64
Year           int64
State         object
Gender        object
Elevation    float64
dtype: object
In [292]:
import statsmodels.formula.api as smf

# create a fitted model with Time and Elevation
results = smf.ols(formula='Time ~ Elevation', data=dfsimple).fit()
In [293]:
results.summary()
Out[293]:
OLS Regression Results
Dep. Variable: Time R-squared: 0.001
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 851.7
Date: Mon, 26 Feb 2018 Prob (F-statistic): 3.79e-187
Time: 13:35:45 Log-Likelihood: -6.3315e+06
No. Observations: 948591 AIC: 1.266e+07
Df Residuals: 948589 BIC: 1.266e+07
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1213.5399 0.270 4489.630 0.000 1213.010 1214.070
Elevation 0.0033 0.000 29.184 0.000 0.003 0.003
Omnibus: 262278.890 Durbin-Watson: 0.029
Prob(Omnibus): 0.000 Jarque-Bera (JB): 853107.316
Skew: 1.406 Prob(JB): 0.00
Kurtosis: 6.699 Cond. No. 3.30e+03
In [301]:
results.params
Out[301]:
Intercept    1213.539906
Elevation       0.003280
dtype: float64

So, as Elevation increases 1000 ft, time increases 3.3 sec. That means that here in Boulder at ~5000 ft, the time penalty is 3.3 * 5 = 16.5 s.

  • In the first line, our import statement imports the statsmodels package, which allows us to do a simple linear regression.
  • The second line defines the model we will be using, and the variables we wish to fit. In this case, we're looking at the impact that Elevation has on Time.
  • The third and final line simply prints our parameters.

We interpret this as follows: "for every unit of Elevation gain (in feet), Time increases by 0.003280 units (in seconds)". Evidently, our data is telling us that for every 1000 feet of elevation gain, a runner's 5K time increases (so, they go slower) by 3.3 seconds. For a performance at 5000 feet, which is commonly thought of as an "altitude performance", that number is 5*3.3 seconds = 16.5 seconds.

Cool! Our data agrees with the "typical" number used in the sport. The 20-30 sec rule applies across the board, whereas our number was derived from the top 1000 performances in each state each year. Therefore, it makes sense that our number is a little bit lower, since athletes who are more athletically fit are less susceptible to the effects of altitude.

Repeating this linear regression analysis for other distance races gives us the following:

In [294]:
x = [800, 1600, 3200, 5000]
y = [0.381*5, 1.095*5, 0.398*5 ,3.280*5]

plt.scatter(x, y)
plt.title('Altitude allowance per 5000 ft elevation gain')
plt.xlabel('Distance (meters)')
plt.ylabel('Time added (seconds)')

plt.show()