milesplit, part 3 - Making Predictions
Posted on Thu 30 November 2017 in Projects
As a High School Cross Country coach, I visit co.milesplit.com quite often. This website is a nationwide database of High School Running (Cross Country and Track & Field) performances. For a week
endlong project, I thought I'd try to visualize some runner data.
- Part 1 - Webscraping
- Part 2 - Exploring the Data
- Part 3 - Making Predictions (you are here)
I gave a Tuesday Nerd Talk at a local brewery about this project. Check it out here
Statistics¶
dfsimple = df.drop(['Athlete/School', 'Meet'], axis = 1)
dfsimple['Elevation'] = dfsimple['Elevation'].astype(float)
dfsimple.head()
dfsimple.dtypes
import statsmodels.formula.api as smf
# create a fitted model with Time and Elevation
results = smf.ols(formula='Time ~ Elevation', data=dfsimple).fit()
results.summary()
results.params
So, as Elevation increases 1000 ft, time increases 3.3 sec. That means that here in Boulder at ~5000 ft, the time penalty is 3.3 * 5 = 16.5 s
.
- In the first line, our
import
statement imports the statsmodels package, which allows us to do a simple linear regression. - The second line defines the model we will be using, and the variables we wish to fit. In this case, we're looking at the impact that Elevation has on Time.
- The third and final line simply prints our parameters.
We interpret this as follows: "for every unit of Elevation gain (in feet), Time increases by 0.003280 units (in seconds)". Evidently, our data is telling us that for every 1000 feet of elevation gain, a runner's 5K time increases (so, they go slower) by 3.3 seconds. For a performance at 5000 feet, which is commonly thought of as an "altitude performance", that number is 5*3.3 seconds = 16.5 seconds.
Cool! Our data agrees with the "typical" number used in the sport. The 20-30 sec rule applies across the board, whereas our number was derived from the top 1000 performances in each state each year. Therefore, it makes sense that our number is a little bit lower, since athletes who are more athletically fit are less susceptible to the effects of altitude.
Repeating this linear regression analysis for other distance races gives us the following:
x = [800, 1600, 3200, 5000]
y = [0.381*5, 1.095*5, 0.398*5 ,3.280*5]
plt.scatter(x, y)
plt.title('Altitude allowance per 5000 ft elevation gain')
plt.xlabel('Distance (meters)')
plt.ylabel('Time added (seconds)')
plt.show()