As a High School Cross Country coach, I visit co.milesplit.com quite often. This website is a nationwide database of High School Running (Cross Country and Track & Field) performances. For a week~~end~~long project, I thought I'd try to visualize some runner data.

Part 1 - Webscraping

Part 2 - Exploring the Data (you are here)

Part 3 - Making Predictions

I gave a Tuesday Nerd Talk at a local brewery about this project. Check it out here

Exploring the Data¶

Let's start by reading in the data we scraped during Part 1:

In [13]:

dfL = pd.read_csv('milesplit data/5K.csv')

For reasons that will become clear later on, I'm also going to read in elevation data per State, and merge the two dfs:

In [14]:

dfR = pd.read_csv('milesplit data/elevation.csv', names = ['State', 'Elevation'])
df = pd.merge(dfL, dfR, on='State')
df.head()

Out[14]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation
0	1	959.9	Tyler Stanfield Homewood HS	2002	Foot Locker Nationals 2001 26th Dec 8, 2001	2001	AL	Boys	500.0
1	2	1077.4	Scott Fuqua Oak Mountain HS	2002	Foot Locker Nationals 2001 32nd Dec 8, 2001	2001	AL	Boys	500.0
2	1	1032.3	Robert Bedsole Hoover	2005	USATF National Junior Olympic XC Championships...	2002	AL	Boys	500.0
3	2	1103.2	Joshua Pawlik Homewood HS	2005	USATF National Junior Olympic XC Championships...	2002	AL	Boys	500.0
4	3	1109.7	Jeremy Moujoodi Hoover	2006	USATF National Junior Olympic XC Championships...	2002	AL	Boys	500.0

Now its time to really dive in. Let's do some grouping:

In [16]:

boys = df.groupby('Gender').get_group('Boys')
girls = df.groupby('Gender').get_group('Girls')
print str(len(boys)) + ' boy records since 2000'
print str(len(girls)) + ' girl records since 2000'

526350 boy records since 2000
422241 girl records since 2000

In [18]:

df.dtypes

Out[18]:

Rank                int64
Time              float64
Athlete/School     object
Grade               int64
Meet               object
Year                int64
State              object
Gender             object
Elevation         float64
dtype: object

'Object' here, in python parlance, should be thought of as 'a collection of strings'. We'll see more on that a bit later.

Let's plot some stuff¶

In [19]:

athlete_school = df['Athlete/School'].str.split(expand = True)
athlete_school.columns = ['First_Name' , 'Next_Name' , 'C' , 'D' , 'E' , 'F' , 'G' , 'H', 'I', 'J' , 'K']
athlete_school.head()

Out[19]:

	First_Name	Next_Name	C	D	E	F	G	H	I	J	K
0	Tyler	Stanfield	Homewood	HS	None	None	None	None	None	None	None
1	Scott	Fuqua	Oak	Mountain	HS	None	None	None	None	None	None
2	Robert	Bedsole	Hoover	None	None	None	None	None	None	None	None
3	Joshua	Pawlik	Homewood	HS	None	None	None	None	None	None	None
4	Jeremy	Moujoodi	Hoover	None	None	None	None	None	None	None	None

In [22]:

athlete_school[['First_Name','Next_Name']].head()

Out[22]:

	First_Name	Next_Name
0	Tyler	Stanfield
1	Scott	Fuqua
2	Robert	Bedsole
3	Joshua	Pawlik
4	Jeremy	Moujoodi

In [23]:

athlete_school.First_Name.describe()

Out[23]:

count     948591
unique     34217
top         Alex
freq        9531
Name: First_Name, dtype: object

In [214]:

plt.subplot(1, 2, 1)
athlete_school.First_Name.value_counts().head(25).plot.barh(figsize=(10,5), color='b')
ax1 = plt.gca()
ax1.invert_yaxis()
plt.title('First Names')

plt.subplot(1, 2, 2)
athlete_school.Next_Name.value_counts().head(25).plot.barh(figsize=(10,5), color='b')
ax2 = plt.gca()
ax2.invert_yaxis()
plt.title('Next Names')

plt.tight_layout()
plt.show()

In [211]:

plt.figure(figsize=(9,4))
athlete_school.First_Name.value_counts().head(500).plot.barh()
ax = plt.gca()
ax.invert_yaxis()
plt.yticks([])
plt.xlabel('Frequency')
plt.ylabel('First Names')
plt.title('Occurrence of First Names')
plt.show()

In [38]:

devins = sorted(athlete_school.loc[athlete_school['First_Name'] == 'Devin','Next_Name'].unique())
len(devins)

Out[38]:

In [43]:

## Fastest boys/girls times run since 2000
boys.sort_values('Time').head()
# girls.sort_values('Time').head()

Out[43]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation
418739	1	850.4	Dathan Ritzenhein Rockford	2001	MHSAA State Championships LP 1st Nov 4, 2000	2000	MI	Boys	900.0
570173	1	858.7	Edward Cheserek St. Benedict's Prep	2013	Essex County Championships 1st Oct 26, 2012	2012	NJ	Boys	250.0
569173	1	860.0	Edward Cheserek St. Benedict's Prep	2013	Essex County Championships 1st Oct 28, 2011	2011	NJ	Boys	250.0
867606	1	860.8	Andrew Hunter Loudoun Valley	2016	Third Battle Invitational 1st Oct 17, 2015	2015	VA	Boys	950.0
78805	1	864.0	German Fernandez Riverbank High School (SJ)	2008	CIF State Cross Country Championships 1st Nov...	2007	CA	Boys	2900.0

In [44]:

## lookup a particular record (in this case, the one with the quickest time)
df.iloc[boys.Time.idxmin()]

Out[44]:

Rank                                                          1
Time                                                      850.4
Athlete/School                      Dathan Ritzenhein  Rockford
Grade                                                      2001
Meet              MHSAA State Championships LP  1st Nov 4, 2000
Year                                                       2000
State                                                        MI
Gender                                                     Boys
Elevation                                                   900
Name: 418739, dtype: object

In [104]:

response1 = requests.get('http://archive.dyestat.com/rivals/pics/2000100701880790.jpg')
response2 = requests.get('http://archive.dyestat.com/image/4tr/April/30StanfordCardinalInv/040430DathanRitzenheinStanfordInvMGallagher.jpg')
response3 = requests.get('http://1.bp.blogspot.com/-fJEQUnFpGCk/UBnndSVzPBI/AAAAAAAAE1w/lJbObv9k0hQ/s1600/Dathan+Ritzenhein-1.jpg')

ritzhs = Image.open(BytesIO(response1.content))
ritzcu = Image.open(BytesIO(response2.content))
ritzpro = Image.open(BytesIO(response3.content))

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10,5))
ax1.imshow(ritzhs)
ax1.set_title('HS 2000 14:10')
ax1.axis('off')
ax2.imshow(ritzcu)
ax2.set_title('Colorado 2002 13:27')
ax2.axis('off')
ax3.imshow(ritzpro)
ax3.set_title('American Record 2009 12:56')
ax3.axis('off')

Out[104]:

(-0.5, 299.5, 417.5, -0.5)

In [105]:

## Calculate nationwide averages
avg_5k = datetime.timedelta(seconds = df.Time.mean())
boys_avg_5k = datetime.timedelta(seconds = boys.Time.mean())
girls_avg_5k = datetime.timedelta(seconds = girls.Time.mean())

print 'Average 5K:   ' + str(avg_5k)[2:7]
print 'Boys Average 5K:   ' +  str(boys_avg_5k)[2:7]
print 'Girls Average 5K:   ' + str(girls_avg_5k)[2:7]

Average 5K:   20:18
Boys Average 5K:   18:34
Girls Average 5K:   22:29

In [108]:

# Were YOU a fast runner in HS??
df[df['Athlete/School'].str.contains('Devin Rourke')] # Nope

Out[108]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation

In [110]:

df[df['Athlete/School'].str.contains('Elise Cranny')]  # YUP.

Out[110]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation
122922	32	1135.0	Elise Cranny Niwot High School	2014	Colorado 4A Region 2 Cross Country 1st Oct 21...	2010	CO	Girls	6800.0
123907	17	1114.0	Elise Cranny Niwot High School	2014	St. Vrain Cross Country Invitational 1st Sep ...	2011	CO	Girls	6800.0
124901	10	1073.0	Elise Cranny Niwot High School	2014	Andy Myers Invitational 1st Oct 5, 2012	2012	CO	Girls	6800.0
125892	1	1005.0	Elise Cranny Niwot High School	2014	Northern Conference Championship 1st Oct 5, 2013	2013	CO	Girls	6800.0

Plotting¶

In [114]:

## Function to label y-axis of plots with (mm:ss) instead of (sec)
def timeTicks(x, pos):
    d = dt.timedelta(seconds = x)
    return str(d)[2:7]                                                                             
formatter = ticker.FuncFormatter(timeTicks)

In [215]:

f, (ax1, ax2) = plt.subplots(1, 2, sharey = True, figsize=(10, 5))
ax1.scatter(boys.Year, boys.Time, s = 2, marker = ".", color = 'b')
ax1.set_title('Boys')
ax1.set_xlim([1999, 2017])
ax1.set_ylim([10*60, 60*60])
ax1.set_ylabel('Time')

ax2.scatter(girls.Year, girls.Time, s = 2, marker = ".", color = 'r')
ax2.set_title('Girls')
ax2.set_xlim([1999, 2017])

ax1.yaxis.set_major_formatter(formatter)

plt.tight_layout()
plt.show()

In [216]:

import seaborn.apionly as sns

ax = plt.figure(figsize=(10, 5))
ax = sns.violinplot(x=df.Year, y=df.Time, hue = df.Gender, cut = 0,
                    scale = "count", scale_hue = False, split = True,
                    palette = {"Boys": "blue", "Girls": "red"},
                    linewidth = 0, saturation = 0.9)
ax.set_ylim([15*60, 30*60])
ax.yaxis.set_major_formatter(formatter)
plt.title('Nationwide 5K times')
plt.xlabel('Year')
plt.ylabel('Time')

plt.show()

Completeness of the database¶

In [308]:

int_years = map(int, year)
ind = 0

for key in sorted(df.Time.groupby(df.State).groups.keys()):    
    ind = ind + 1
    ax = plt.subplot(6, 10, ind)
    plt.axis([2000, 2016, 0, 1000])
    plt.xticks([])
    frame1 = plt.gca()
    frame1.axes.get_xaxis().set_visible(False)
    df.Time.groupby([df.State, df.Year]).count().get(key).reindex(index=int_years).plot.bar(
        title = key, figsize = (15, 8), sharey = True, sharex = False, color = 'b') 
    plt.tight_layout()
    plt.xlabel('')
plt.show()

In [172]:

## State Record holding Boys/Girls (only online records, only dating back to 2000)
boys.sort_values('Time').groupby('State').head(1)
# girls.sort_values('Time').groupby('State').head(1).sort_index()  #.sort_index() sorts by the groupby index, ie. 'State' in this case.

Out[172]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation
418739	1	850.4	Dathan Ritzenhein Rockford	2001	MHSAA State Championships LP 1st Nov 4, 2000	2000	MI	Boys	900.0
570173	1	858.7	Edward Cheserek St. Benedict's Prep	2013	Essex County Championships 1st Oct 26, 2012	2012	NJ	Boys	250.0
867606	1	860.8	Andrew Hunter Loudoun Valley	2016	Third Battle Invitational 1st Oct 17, 2015	2015	VA	Boys	950.0
78805	1	864.0	German Fernandez Riverbank High School (SJ)	2008	CIF State Cross Country Championships 1st Nov...	2007	CA	Boys	2900.0
36452	1	867.2	Bernie Montoya Cibola High School	2013	Division 1 Section 1 1st Oct 26, 2012	2012	AZ	Boys	4100.0
888984	1	871.7	Tanner Anderson North Central High School	2015	WIAA State Championship 1st Nov 8, 2014	2014	WA	Boys	1700.0
798976	1	872.2	Brodey Hasty BrentwoodH High School	2018	Great American Cross Country Festival 1st Sep...	2016	TN	Boys	900.0
755672	1	872.4	David Principe Jr. La Salle Academy	2017	Great American Cross Country Festival 2nd Sep...	2016	RI	Boys	200.0
834704	1	872.5	Conner Mantz Sky View	2015	Utah Region 5 XC Championships 1st Oct 16, 2013	2013	UT	Boys	6100.0
715918	1	872.7	Matthew Maton Summit	2015	George Fox Cross Country Classic 1st Oct 12, ...	2013	OR	Boys	3300.0
814836	1	874.0	Craig Lutz Marcus	2011	Foot Locker South Regional 1st Nov 28, 2009	2009	TX	Boys	1700.0
270612	1	875.0	Curtis Eckstein Oldenburg Academy	2017	IHSAA Cross Country Semi-State 2 - Shelbyville...	2016	IN	Boys	700.0
931718	1	879.4	Finn Gessner Madison La Follette	2017	Nike Cross Nationals Heartland Regional 1st N...	2016	WI	Boys	1050.0
520628	1	879.5	Seth Hirsch Millard West High School	2017	Nike Cross Nationals Heartland Regional 2nd N...	2016	NE	Boys	2600.0
203462	1	881.0	Josh Brickell Peachtree Ridge High School	2013	Foot Locker South Regional 4th Nov 26, 2011	2011	GA	Boys	600.0
323837	1	883.9	Jacob Thomson Holy Cross (Louisville)	2013	Great American Cross Country Festival 2nd Sep ...	2012	KY	Boys	750.0
22427	1	883.9	Levi Thomet Kodiak High School	2015	George Fox Cross Country Classic 2nd Oct 12, ...	2013	AK	Boys	1900.0
281009	1	884.0	Ellen Ries North-Linn High School	2005	Iowa State Cross Country Meet 1st Nov 1, 2003	2003	IA	Boys	1100.0
642538	1	884.0	Ben Huffman Providence Day School	2014	Foot Locker South Regional 2nd Nov 30, 2013	2013	NC	Boys	700.0
247153	1	884.2	Lukas Verzbicas Sandburg High School	2011	Foot Locker Midwest Regional 1st Nov 27, 2010	2010	IL	Boys	600.0
678380	1	884.7	Andrew Jordan Watkins Memorial	2016	Galion Cross Country Festival 1st Sep 19, 2015	2015	OH	Boys	850.0
113327	1	886.5	Cerake Geberkidane Denver East High School	2014	Arvada West Cross Country Invitational 1st Se...	2013	CO	Boys	6800.0
766366	1	886.8	Brent Demarest Porter Gaud	2014	SCISA Championships 1st Oct 26, 2013	2013	SC	Boys	350.0
617832	1	887.0	Jeriqho Gadway Plattsburgh HS	2015	Section 7 Championships 1st Oct 31, 2014	2014	NY	Boys	1000.0
739802	1	887.0	Noah Affolder Carlisle	2017	24th Carlisle High School Invitational 1st Se...	2016	PA	Boys	1100.0
136645	1	887.0	Alex Ostberg Darien High School	2015	FCIAC XC Championships 1st Oct 20, 2014	2014	CT	Boys	500.0
700649	1	891.0	Ben Barrett Norman North High School	2015	Foot Locker South Regional 6th Nov 30, 2013	2013	OK	Boys	1300.0
178506	1	891.0	Matt Mizereck Leon HS	2010	Foot Locker South Regional 4th Nov 28, 2009	2009	FL	Boys	100.0
461397	1	894.1	Seth Eliason Hopkins High School	2017	Nike Cross Nationals Heartland Regional 4th N...	2016	MN	Boys	1200.0
594997	1	894.4	Luis Martinez Sue Cleveland High School	2013	Nike Cross Regionals - Southwest 3rd Nov 17, ...	2012	NM	Boys	5700.0
663319	1	896.0	Jake Leingang Bismarck High School	2013	Foot Locker Midwest Regional 1st Nov 24, 2012	2012	ND	Boys	1900.0
787577	1	897.1	Derick Peters West Central High School	2018	Lennox Invitational 1st Sep 30, 2016	2016	SD	Boys	2200.0
908753	1	898.0	Jacob Burcham Cabell Midland	2013	West Virginia State XC Championships 1st Oct ...	2012	WV	Boys	1500.0
304929	1	898.3	Stuart Mcnutt Blue Valley West High School	2015	6A Regional - Olathe Northwest 1st Oct 26, 2013	2013	KS	Boys	2000.0
533020	1	899.7	Henry Weisberg McQueen High School	2017	Stanford Invitational 4th Oct 1, 2016	2016	NV	Boys	5500.0
488396	1	900.6	Caleb Hoover College Heights Christian	2011	Missouri Southern Stampede 1st Sep 18, 2010	2010	MO	Boys	800.0
232316	1	901.0	Elijah Armstrong Pocatello High School	2015	Asics Clovis Invitational 1st Oct 11, 2014	2014	ID	Boys	5000.0
349865	1	902.0	Ben True Greely High School	2004	Foot Locker Nationals 5th Dec 13, 2003	2003	ME	Boys	600.0
60602	1	903.0	Jacob Shiohira Bentonville High School	2015	Foot Locker South Regional 13th Nov 30, 2013	2013	AR	Boys	650.0
5752	1	903.3	Mac Macoy Vestavia Hills HS	2014	FSU Invitational (Pre-State) 2nd Oct 11, 2013...	2013	AL	Boys	500.0
549475	1	903.4	Patrick O'brien Oyster River High School	2017	NH Meet of Champions 1st Nov 5, 2016	2016	NH	Boys	1000.0
345960	1	903.8	Eric Coston St. Paul's School	2017	MC Watson Ford Invitational 1st Oct 8, 2016	2016	LA	Boys	100.0
508744	1	904.4	Marshall Beatty Sentinel High School	2017	Missoula Coaches Meet 1st Sep 1, 2016	2016	MT	Boys	3400.0
399220	1	905.2	John Murray Shrewsbury Senior High School	2011	Brown Invitational 2nd Oct 16, 2010	2010	MA	Boys	500.0
163089	1	906.8	Kevin Murray Charter School of Wilmington	2016	Nike Cross Nationals Southeast Regional 2nd N...	2015	DE	Boys	60.0
380593	1	907.9	Tyler Spear Loyola-Blakefield	2014	Nike Cross Nationals Southeast Regional 3rd N...	2013	MD	Boys	350.0
148550	1	916.5	Mike Crozier Gonzaga College High School	2012	Nike Cross Nationals Southeast Regional 4th N...	2011	DC	Boys	150.0
943190	1	917.5	Brody Smith Cody High School	2016	Nike Portland XC Invite 3rd Sep 26, 2015	2015	WY	Boys	6700.0
471601	1	919.2	Max Holman Tupelo High School	2010	Trinity/Valkyrie Invitational 4th Sep 18, 200...	2009	MS	Boys	300.0
221048	1	922.2	Kaeo Kruse Kamehameha Schools Oahu	2016	Hawaii HHSAA Cross Country Championships at CO...	2015	HI	Boys	3030.0
853024	1	924.2	Tyler Marshall Champlain Valley Union High Sc...	2017	NVAC Championship 1st Oct 29, 2016	2016	VT	Boys	1000.0

In [134]:

boys.Time.groupby(df.Year).mean().plot(color = 'b')
girls.Time.groupby(df.Year).mean().plot(color = 'r')
plt.axis([1998, 2018, 15*60, 30*60])
plt.ylabel('Time')
plt.title('Nationwide average 5k time')
ax = plt.gca()
ax.yaxis.set_major_formatter(formatter)
plt.legend(['boys','girls'])
plt.show()

In [225]:

dfboys = boys.Time.groupby([df.State, df.Year]).mean()
dfgirls = girls.Time.groupby([df.State, df.Year]).mean()
years = ['2000','2001','2002','2003','2004','2005','2006','2007',
         '2008','2009','2010','2011','2012','2013','2014','2015','2016']
years = map(int, years)
ind = 0
for key in sorted(df.Time.groupby(df.State).groups.keys()):    
    ind = ind + 1 
    plt.subplot(6, 10, ind)
    plt.xticks([])
    frame1 = plt.gca()
    frame1.axes.get_xaxis().set_visible(False)
    dfboys[key].fillna(0).reindex(index = years).plot(title = key, figsize = (16, 9), yticks = [15*60, 20*60, 25*60],
                            color = 'b', sharey = True, sharex = False)
    dfgirls[key].fillna(0).reindex(index = years).plot(title = key, figsize = (16, 9), yticks = [15*60, 20*60, 25*60],
                            color = 'r', sharey = True, sharex = False)
    plt.axis([1998, 2018, 14*60, 30*60])
    ax = plt.gca()
    ax.yaxis.set_major_formatter(formatter)    
    plt.tight_layout()
    plt.xlabel('')
plt.show()

In [201]:

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [ ]:

year = ['2000','2001','2002','2003','2004','2005','2006','2007',
        '2008','2009','2010','2011','2012','2013','2014','2015','2016']

dfplotly = pd.read_csv('milesplit data/d3usa5kboys.csv')

scl = [[0.0, '#a80000'],[0.25, '#ff3f3f'],[0.5, '#ff7f7f'],
       [0.75, '#ffbfbf'],[1.0, '#ffffff']]

for yr in year:
  
    # df['text'] = df['state'] + '<br>' +\
    #     '2000 '+df['2000']+' 2001 '+df['2001']+'<br>'+\
    #     '2002 '+df['2002']+' 2003 ' + df['2003']+'<br>'+\
    #     '2004 '+df['2004']+' 2005 '+df['2005']

    data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = dfplotly['code'],
        z = dfplotly[yr].astype(float),
        zmax = 20.0,
        zmin = -20.0,
        zauto = False,
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 1
            ) ),
            colorbar = dict(
            title = "% Difference" 
        ))]

    layout = dict(
        title = 'Percent difference from yearly national average<br>Boys 5K<br><br>'+yr,
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa'),
            showlakes = False))
    
    fig = dict(data=data, layout=layout)
    py.plot(fig, filename='choropleth-map_' + yr, image_height = 600, image_width = 900,  image = 'png')

In [228]:

dfCHS = df[df['Athlete/School'].str.contains("Centaurus High School")].sort_values(by=['Time', 'Athlete/School'])
dfCHS.head()

Out[228]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation
112338	12	943.5	Chandler Reid Centaurus High School	2013	Liberty Bell Cross Country Invitational 1st S...	2012	CO	Boys	6800.0
112365	39	962.6	Jake Link Centaurus High School	2015	Liberty Bell Cross Country Invitational 3rd S...	2012	CO	Boys	6800.0
111352	29	967.7	Chandler Reid Centaurus High School	2013	Pat Amato Classic 1st Oct 7, 2011	2011	CO	Boys	6800.0
105896	11	969.6	Tim Gaskins Centaurus High School	2005	Colorado State Cross Country Championships 5t...	2004	CO	Boys	6800.0
113378	52	973.0	Jake Link Centaurus High School	2015	Colorado 4A Region 3 Cross Country 3rd Oct 18...	2013	CO	Boys	6800.0

In [234]:

# Centaurus School Records
dfCHS['Duration'] = pd.to_datetime(df['Time'], unit='s')
dfCHS[df.Year == df.Grade - 1].head(10) #senior class records
# dfCHS[df.Year == df.Grade - 2] #junior class records
# dfCHS[df.Year == df.Grade - 3] #sophomore class records
# dfCHS[df.Year == df.Grade - 4] #freshman class records

C:\Users\Devin\Anaconda2\lib\site-packages\ipykernel_launcher.py:3: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[234]:

	Rank	Time	Athlete/School	Grade	Meet	Year	State	Gender	Elevation	Duration
112338	12	943.5	Chandler Reid Centaurus High School	2013	Liberty Bell Cross Country Invitational 1st S...	2012	CO	Boys	6800.0	1970-01-01 00:15:43.500
105896	11	969.6	Tim Gaskins Centaurus High School	2005	Colorado State Cross Country Championships 5t...	2004	CO	Boys	6800.0	1970-01-01 00:16:09.600
114401	75	976.0	Jake Link Centaurus High School	2015	Colorado 4A Region 3 Cross Country 1st Oct 17...	2014	CO	Boys	6800.0	1970-01-01 00:16:16.000
115416	90	988.0	Brooks Macdonald Centaurus High School	2016	Liberty Bell Cross Country Invitational 2nd S...	2015	CO	Boys	6800.0	1970-01-01 00:16:28.000
111409	85	990.0	Jack Marshall Centaurus High School	2012	Liberty Bell Cross Country Invitational 8th S...	2011	CO	Boys	6800.0	1970-01-01 00:16:30.000
114459	133	991.0	Ben Patzer Centaurus High School	2015	Colorado 4A Region 3 Cross Country 4th Oct 17...	2014	CO	Boys	6800.0	1970-01-01 00:16:31.000
112512	186	1004.8	Sam Patzer Centaurus High School	2013	Pat Amato Classic 27th Oct 5, 2012	2012	CO	Boys	6800.0	1970-01-01 00:16:44.800
107045	18	1012.6	Jesse Fassler Centaurus High School	2007	Colorado State Cross Country Championships 10...	2006	CO	Boys	6800.0	1970-01-01 00:16:52.600
111495	171	1016.0	Logan Goodrich Centaurus High School	2012	Liberty Bell Cross Country Invitational 18th ...	2011	CO	Boys	6800.0	1970-01-01 00:16:56.000
112603	277	1020.3	Tyler Menger Centaurus High School	2013	Liberty Bell Cross Country Invitational 26th ...	2012	CO	Boys	6800.0	1970-01-01 00:17:00.300

In [243]:

dfCHS = df[df['Athlete/School'].str.contains("Centaurus High School")]
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))

ax1.scatter(dfCHS.Year, dfCHS.Rank)
ax1.set_title('Centaurus "Top 1000" Performances')
ax1.set_xlabel('Year')
ax1.set_ylabel('Rank')
ax1.set_ylim([-100, 1100])


ax2.plot(dfCHS.Year.unique(), dfCHS.groupby('Year').Time.nunique())
ax2.set_title('Number of Centaurus Performances in the Top 1000')
ax2.set_xlim([1999, 2017])
ax2.set_ylim([0, 25])

plt.tight_layout()
plt.show()

To Do¶

get "Year = All" records
Integrate "course difficulty" ratings
Include Track events, and field events
Determine seconds / ft conversion
Machine Learning, feature selection.. predict times?