milesplit, part 2 - Exploring the Data

Posted on Thu 29 December 2016 in Projects

As a High School Cross Country coach, I visit co.milesplit.com quite often. This website is a nationwide database of High School Running (Cross Country and Track & Field) performances. For a weekendlong project, I thought I'd try to visualize some runner data.

I gave a Tuesday Nerd Talk at a local brewery about this project. Check it out here

Exploring the Data

Let's start by reading in the data we scraped during Part 1:

In [13]:
dfL = pd.read_csv('milesplit data/5K.csv')

For reasons that will become clear later on, I'm also going to read in elevation data per State, and merge the two dfs:

In [14]:
dfR = pd.read_csv('milesplit data/elevation.csv', names = ['State', 'Elevation'])
df = pd.merge(dfL, dfR, on='State')
df.head()
Out[14]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
0 1 959.9 Tyler Stanfield Homewood HS 2002 Foot Locker Nationals 2001 26th Dec 8, 2001 2001 AL Boys 500.0
1 2 1077.4 Scott Fuqua Oak Mountain HS 2002 Foot Locker Nationals 2001 32nd Dec 8, 2001 2001 AL Boys 500.0
2 1 1032.3 Robert Bedsole Hoover 2005 USATF National Junior Olympic XC Championships... 2002 AL Boys 500.0
3 2 1103.2 Joshua Pawlik Homewood HS 2005 USATF National Junior Olympic XC Championships... 2002 AL Boys 500.0
4 3 1109.7 Jeremy Moujoodi Hoover 2006 USATF National Junior Olympic XC Championships... 2002 AL Boys 500.0

Now its time to really dive in. Let's do some grouping:

In [16]:
boys = df.groupby('Gender').get_group('Boys')
girls = df.groupby('Gender').get_group('Girls')
print str(len(boys)) + ' boy records since 2000'
print str(len(girls)) + ' girl records since 2000'
526350 boy records since 2000
422241 girl records since 2000
In [18]:
df.dtypes
Out[18]:
Rank                int64
Time              float64
Athlete/School     object
Grade               int64
Meet               object
Year                int64
State              object
Gender             object
Elevation         float64
dtype: object

'Object' here, in python parlance, should be thought of as 'a collection of strings'. We'll see more on that a bit later.

Let's plot some stuff

In [19]:
athlete_school = df['Athlete/School'].str.split(expand = True)
athlete_school.columns = ['First_Name' , 'Next_Name' , 'C' , 'D' , 'E' , 'F' , 'G' , 'H', 'I', 'J' , 'K']
athlete_school.head()
Out[19]:
First_Name Next_Name C D E F G H I J K
0 Tyler Stanfield Homewood HS None None None None None None None
1 Scott Fuqua Oak Mountain HS None None None None None None
2 Robert Bedsole Hoover None None None None None None None None
3 Joshua Pawlik Homewood HS None None None None None None None
4 Jeremy Moujoodi Hoover None None None None None None None None
In [22]:
athlete_school[['First_Name','Next_Name']].head()
Out[22]:
First_Name Next_Name
0 Tyler Stanfield
1 Scott Fuqua
2 Robert Bedsole
3 Joshua Pawlik
4 Jeremy Moujoodi
In [23]:
athlete_school.First_Name.describe()
Out[23]:
count     948591
unique     34217
top         Alex
freq        9531
Name: First_Name, dtype: object
In [214]:
plt.subplot(1, 2, 1)
athlete_school.First_Name.value_counts().head(25).plot.barh(figsize=(10,5), color='b')
ax1 = plt.gca()
ax1.invert_yaxis()
plt.title('First Names')

plt.subplot(1, 2, 2)
athlete_school.Next_Name.value_counts().head(25).plot.barh(figsize=(10,5), color='b')
ax2 = plt.gca()
ax2.invert_yaxis()
plt.title('Next Names')

plt.tight_layout()
plt.show()
In [211]:
plt.figure(figsize=(9,4))
athlete_school.First_Name.value_counts().head(500).plot.barh()
ax = plt.gca()
ax.invert_yaxis()
plt.yticks([])
plt.xlabel('Frequency')
plt.ylabel('First Names')
plt.title('Occurrence of First Names')
plt.show()
In [38]:
devins = sorted(athlete_school.loc[athlete_school['First_Name'] == 'Devin','Next_Name'].unique())
len(devins)
Out[38]:
621
In [43]:
## Fastest boys/girls times run since 2000
boys.sort_values('Time').head()
# girls.sort_values('Time').head()
Out[43]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
418739 1 850.4 Dathan Ritzenhein Rockford 2001 MHSAA State Championships LP 1st Nov 4, 2000 2000 MI Boys 900.0
570173 1 858.7 Edward Cheserek St. Benedict's Prep 2013 Essex County Championships 1st Oct 26, 2012 2012 NJ Boys 250.0
569173 1 860.0 Edward Cheserek St. Benedict's Prep 2013 Essex County Championships 1st Oct 28, 2011 2011 NJ Boys 250.0
867606 1 860.8 Andrew Hunter Loudoun Valley 2016 Third Battle Invitational 1st Oct 17, 2015 2015 VA Boys 950.0
78805 1 864.0 German Fernandez Riverbank High School (SJ) 2008 CIF State Cross Country Championships 1st Nov... 2007 CA Boys 2900.0
In [44]:
## lookup a particular record (in this case, the one with the quickest time)
df.iloc[boys.Time.idxmin()]
Out[44]:
Rank                                                          1
Time                                                      850.4
Athlete/School                      Dathan Ritzenhein  Rockford
Grade                                                      2001
Meet              MHSAA State Championships LP  1st Nov 4, 2000
Year                                                       2000
State                                                        MI
Gender                                                     Boys
Elevation                                                   900
Name: 418739, dtype: object
In [104]:
response1 = requests.get('http://archive.dyestat.com/rivals/pics/2000100701880790.jpg')
response2 = requests.get('http://archive.dyestat.com/image/4tr/April/30StanfordCardinalInv/040430DathanRitzenheinStanfordInvMGallagher.jpg')
response3 = requests.get('http://1.bp.blogspot.com/-fJEQUnFpGCk/UBnndSVzPBI/AAAAAAAAE1w/lJbObv9k0hQ/s1600/Dathan+Ritzenhein-1.jpg')

ritzhs = Image.open(BytesIO(response1.content))
ritzcu = Image.open(BytesIO(response2.content))
ritzpro = Image.open(BytesIO(response3.content))

f, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(10,5))
ax1.imshow(ritzhs)
ax1.set_title('HS 2000 14:10')
ax1.axis('off')
ax2.imshow(ritzcu)
ax2.set_title('Colorado 2002 13:27')
ax2.axis('off')
ax3.imshow(ritzpro)
ax3.set_title('American Record 2009 12:56')
ax3.axis('off')
Out[104]:
(-0.5, 299.5, 417.5, -0.5)
In [105]:
## Calculate nationwide averages
avg_5k = datetime.timedelta(seconds = df.Time.mean())
boys_avg_5k = datetime.timedelta(seconds = boys.Time.mean())
girls_avg_5k = datetime.timedelta(seconds = girls.Time.mean())

print 'Average 5K:   ' + str(avg_5k)[2:7]
print 'Boys Average 5K:   ' +  str(boys_avg_5k)[2:7]
print 'Girls Average 5K:   ' + str(girls_avg_5k)[2:7]
Average 5K:   20:18
Boys Average 5K:   18:34
Girls Average 5K:   22:29
In [108]:
# Were YOU a fast runner in HS??
df[df['Athlete/School'].str.contains('Devin Rourke')] # Nope
Out[108]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
In [110]:
df[df['Athlete/School'].str.contains('Elise Cranny')]  # YUP.
Out[110]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
122922 32 1135.0 Elise Cranny Niwot High School 2014 Colorado 4A Region 2 Cross Country 1st Oct 21... 2010 CO Girls 6800.0
123907 17 1114.0 Elise Cranny Niwot High School 2014 St. Vrain Cross Country Invitational 1st Sep ... 2011 CO Girls 6800.0
124901 10 1073.0 Elise Cranny Niwot High School 2014 Andy Myers Invitational 1st Oct 5, 2012 2012 CO Girls 6800.0
125892 1 1005.0 Elise Cranny Niwot High School 2014 Northern Conference Championship 1st Oct 5, 2013 2013 CO Girls 6800.0

Plotting

In [114]:
## Function to label y-axis of plots with (mm:ss) instead of (sec)
def timeTicks(x, pos):
    d = dt.timedelta(seconds = x)
    return str(d)[2:7]                                                                             
formatter = ticker.FuncFormatter(timeTicks)
In [215]:
f, (ax1, ax2) = plt.subplots(1, 2, sharey = True, figsize=(10, 5))
ax1.scatter(boys.Year, boys.Time, s = 2, marker = ".", color = 'b')
ax1.set_title('Boys')
ax1.set_xlim([1999, 2017])
ax1.set_ylim([10*60, 60*60])
ax1.set_ylabel('Time')

ax2.scatter(girls.Year, girls.Time, s = 2, marker = ".", color = 'r')
ax2.set_title('Girls')
ax2.set_xlim([1999, 2017])

ax1.yaxis.set_major_formatter(formatter)

plt.tight_layout()
plt.show()
In [216]:
import seaborn.apionly as sns

ax = plt.figure(figsize=(10, 5))
ax = sns.violinplot(x=df.Year, y=df.Time, hue = df.Gender, cut = 0,
                    scale = "count", scale_hue = False, split = True,
                    palette = {"Boys": "blue", "Girls": "red"},
                    linewidth = 0, saturation = 0.9)
ax.set_ylim([15*60, 30*60])
ax.yaxis.set_major_formatter(formatter)
plt.title('Nationwide 5K times')
plt.xlabel('Year')
plt.ylabel('Time')

plt.show()

Completeness of the database

In [308]:
int_years = map(int, year)
ind = 0

for key in sorted(df.Time.groupby(df.State).groups.keys()):    
    ind = ind + 1
    ax = plt.subplot(6, 10, ind)
    plt.axis([2000, 2016, 0, 1000])
    plt.xticks([])
    frame1 = plt.gca()
    frame1.axes.get_xaxis().set_visible(False)
    df.Time.groupby([df.State, df.Year]).count().get(key).reindex(index=int_years).plot.bar(
        title = key, figsize = (15, 8), sharey = True, sharex = False, color = 'b') 
    plt.tight_layout()
    plt.xlabel('')
plt.show()
In [172]:
## State Record holding Boys/Girls (only online records, only dating back to 2000)
boys.sort_values('Time').groupby('State').head(1)
# girls.sort_values('Time').groupby('State').head(1).sort_index()  #.sort_index() sorts by the groupby index, ie. 'State' in this case.
Out[172]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
418739 1 850.4 Dathan Ritzenhein Rockford 2001 MHSAA State Championships LP 1st Nov 4, 2000 2000 MI Boys 900.0
570173 1 858.7 Edward Cheserek St. Benedict's Prep 2013 Essex County Championships 1st Oct 26, 2012 2012 NJ Boys 250.0
867606 1 860.8 Andrew Hunter Loudoun Valley 2016 Third Battle Invitational 1st Oct 17, 2015 2015 VA Boys 950.0
78805 1 864.0 German Fernandez Riverbank High School (SJ) 2008 CIF State Cross Country Championships 1st Nov... 2007 CA Boys 2900.0
36452 1 867.2 Bernie Montoya Cibola High School 2013 Division 1 Section 1 1st Oct 26, 2012 2012 AZ Boys 4100.0
888984 1 871.7 Tanner Anderson North Central High School 2015 WIAA State Championship 1st Nov 8, 2014 2014 WA Boys 1700.0
798976 1 872.2 Brodey Hasty BrentwoodH High School 2018 Great American Cross Country Festival 1st Sep... 2016 TN Boys 900.0
755672 1 872.4 David Principe Jr. La Salle Academy 2017 Great American Cross Country Festival 2nd Sep... 2016 RI Boys 200.0
834704 1 872.5 Conner Mantz Sky View 2015 Utah Region 5 XC Championships 1st Oct 16, 2013 2013 UT Boys 6100.0
715918 1 872.7 Matthew Maton Summit 2015 George Fox Cross Country Classic 1st Oct 12, ... 2013 OR Boys 3300.0
814836 1 874.0 Craig Lutz Marcus 2011 Foot Locker South Regional 1st Nov 28, 2009 2009 TX Boys 1700.0
270612 1 875.0 Curtis Eckstein Oldenburg Academy 2017 IHSAA Cross Country Semi-State 2 - Shelbyville... 2016 IN Boys 700.0
931718 1 879.4 Finn Gessner Madison La Follette 2017 Nike Cross Nationals Heartland Regional 1st N... 2016 WI Boys 1050.0
520628 1 879.5 Seth Hirsch Millard West High School 2017 Nike Cross Nationals Heartland Regional 2nd N... 2016 NE Boys 2600.0
203462 1 881.0 Josh Brickell Peachtree Ridge High School 2013 Foot Locker South Regional 4th Nov 26, 2011 2011 GA Boys 600.0
323837 1 883.9 Jacob Thomson Holy Cross (Louisville) 2013 Great American Cross Country Festival 2nd Sep ... 2012 KY Boys 750.0
22427 1 883.9 Levi Thomet Kodiak High School 2015 George Fox Cross Country Classic 2nd Oct 12, ... 2013 AK Boys 1900.0
281009 1 884.0 Ellen Ries North-Linn High School 2005 Iowa State Cross Country Meet 1st Nov 1, 2003 2003 IA Boys 1100.0
642538 1 884.0 Ben Huffman Providence Day School 2014 Foot Locker South Regional 2nd Nov 30, 2013 2013 NC Boys 700.0
247153 1 884.2 Lukas Verzbicas Sandburg High School 2011 Foot Locker Midwest Regional 1st Nov 27, 2010 2010 IL Boys 600.0
678380 1 884.7 Andrew Jordan Watkins Memorial 2016 Galion Cross Country Festival 1st Sep 19, 2015 2015 OH Boys 850.0
113327 1 886.5 Cerake Geberkidane Denver East High School 2014 Arvada West Cross Country Invitational 1st Se... 2013 CO Boys 6800.0
766366 1 886.8 Brent Demarest Porter Gaud 2014 SCISA Championships 1st Oct 26, 2013 2013 SC Boys 350.0
617832 1 887.0 Jeriqho Gadway Plattsburgh HS 2015 Section 7 Championships 1st Oct 31, 2014 2014 NY Boys 1000.0
739802 1 887.0 Noah Affolder Carlisle 2017 24th Carlisle High School Invitational 1st Se... 2016 PA Boys 1100.0
136645 1 887.0 Alex Ostberg Darien High School 2015 FCIAC XC Championships 1st Oct 20, 2014 2014 CT Boys 500.0
700649 1 891.0 Ben Barrett Norman North High School 2015 Foot Locker South Regional 6th Nov 30, 2013 2013 OK Boys 1300.0
178506 1 891.0 Matt Mizereck Leon HS 2010 Foot Locker South Regional 4th Nov 28, 2009 2009 FL Boys 100.0
461397 1 894.1 Seth Eliason Hopkins High School 2017 Nike Cross Nationals Heartland Regional 4th N... 2016 MN Boys 1200.0
594997 1 894.4 Luis Martinez Sue Cleveland High School 2013 Nike Cross Regionals - Southwest 3rd Nov 17, ... 2012 NM Boys 5700.0
663319 1 896.0 Jake Leingang Bismarck High School 2013 Foot Locker Midwest Regional 1st Nov 24, 2012 2012 ND Boys 1900.0
787577 1 897.1 Derick Peters West Central High School 2018 Lennox Invitational 1st Sep 30, 2016 2016 SD Boys 2200.0
908753 1 898.0 Jacob Burcham Cabell Midland 2013 West Virginia State XC Championships 1st Oct ... 2012 WV Boys 1500.0
304929 1 898.3 Stuart Mcnutt Blue Valley West High School 2015 6A Regional - Olathe Northwest 1st Oct 26, 2013 2013 KS Boys 2000.0
533020 1 899.7 Henry Weisberg McQueen High School 2017 Stanford Invitational 4th Oct 1, 2016 2016 NV Boys 5500.0
488396 1 900.6 Caleb Hoover College Heights Christian 2011 Missouri Southern Stampede 1st Sep 18, 2010 2010 MO Boys 800.0
232316 1 901.0 Elijah Armstrong Pocatello High School 2015 Asics Clovis Invitational 1st Oct 11, 2014 2014 ID Boys 5000.0
349865 1 902.0 Ben True Greely High School 2004 Foot Locker Nationals 5th Dec 13, 2003 2003 ME Boys 600.0
60602 1 903.0 Jacob Shiohira Bentonville High School 2015 Foot Locker South Regional 13th Nov 30, 2013 2013 AR Boys 650.0
5752 1 903.3 Mac Macoy Vestavia Hills HS 2014 FSU Invitational (Pre-State) 2nd Oct 11, 2013... 2013 AL Boys 500.0
549475 1 903.4 Patrick O'brien Oyster River High School 2017 NH Meet of Champions 1st Nov 5, 2016 2016 NH Boys 1000.0
345960 1 903.8 Eric Coston St. Paul's School 2017 MC Watson Ford Invitational 1st Oct 8, 2016 2016 LA Boys 100.0
508744 1 904.4 Marshall Beatty Sentinel High School 2017 Missoula Coaches Meet 1st Sep 1, 2016 2016 MT Boys 3400.0
399220 1 905.2 John Murray Shrewsbury Senior High School 2011 Brown Invitational 2nd Oct 16, 2010 2010 MA Boys 500.0
163089 1 906.8 Kevin Murray Charter School of Wilmington 2016 Nike Cross Nationals Southeast Regional 2nd N... 2015 DE Boys 60.0
380593 1 907.9 Tyler Spear Loyola-Blakefield 2014 Nike Cross Nationals Southeast Regional 3rd N... 2013 MD Boys 350.0
148550 1 916.5 Mike Crozier Gonzaga College High School 2012 Nike Cross Nationals Southeast Regional 4th N... 2011 DC Boys 150.0
943190 1 917.5 Brody Smith Cody High School 2016 Nike Portland XC Invite 3rd Sep 26, 2015 2015 WY Boys 6700.0
471601 1 919.2 Max Holman Tupelo High School 2010 Trinity/Valkyrie Invitational 4th Sep 18, 200... 2009 MS Boys 300.0
221048 1 922.2 Kaeo Kruse Kamehameha Schools Oahu 2016 Hawaii HHSAA Cross Country Championships at CO... 2015 HI Boys 3030.0
853024 1 924.2 Tyler Marshall Champlain Valley Union High Sc... 2017 NVAC Championship 1st Oct 29, 2016 2016 VT Boys 1000.0
In [134]:
boys.Time.groupby(df.Year).mean().plot(color = 'b')
girls.Time.groupby(df.Year).mean().plot(color = 'r')
plt.axis([1998, 2018, 15*60, 30*60])
plt.ylabel('Time')
plt.title('Nationwide average 5k time')
ax = plt.gca()
ax.yaxis.set_major_formatter(formatter)
plt.legend(['boys','girls'])
plt.show()
In [225]:
dfboys = boys.Time.groupby([df.State, df.Year]).mean()
dfgirls = girls.Time.groupby([df.State, df.Year]).mean()
years = ['2000','2001','2002','2003','2004','2005','2006','2007',
         '2008','2009','2010','2011','2012','2013','2014','2015','2016']
years = map(int, years)
ind = 0
for key in sorted(df.Time.groupby(df.State).groups.keys()):    
    ind = ind + 1 
    plt.subplot(6, 10, ind)
    plt.xticks([])
    frame1 = plt.gca()
    frame1.axes.get_xaxis().set_visible(False)
    dfboys[key].fillna(0).reindex(index = years).plot(title = key, figsize = (16, 9), yticks = [15*60, 20*60, 25*60],
                            color = 'b', sharey = True, sharex = False)
    dfgirls[key].fillna(0).reindex(index = years).plot(title = key, figsize = (16, 9), yticks = [15*60, 20*60, 25*60],
                            color = 'r', sharey = True, sharex = False)
    plt.axis([1998, 2018, 14*60, 30*60])
    ax = plt.gca()
    ax.yaxis.set_major_formatter(formatter)    
    plt.tight_layout()
    plt.xlabel('')
plt.show()
In [201]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
In [ ]:
year = ['2000','2001','2002','2003','2004','2005','2006','2007',
        '2008','2009','2010','2011','2012','2013','2014','2015','2016']

dfplotly = pd.read_csv('milesplit data/d3usa5kboys.csv')

scl = [[0.0, '#a80000'],[0.25, '#ff3f3f'],[0.5, '#ff7f7f'],
       [0.75, '#ffbfbf'],[1.0, '#ffffff']]

for yr in year:
  
    # df['text'] = df['state'] + '<br>' +\
    #     '2000 '+df['2000']+' 2001 '+df['2001']+'<br>'+\
    #     '2002 '+df['2002']+' 2003 ' + df['2003']+'<br>'+\
    #     '2004 '+df['2004']+' 2005 '+df['2005']

    data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = dfplotly['code'],
        z = dfplotly[yr].astype(float),
        zmax = 20.0,
        zmin = -20.0,
        zauto = False,
        locationmode = 'USA-states',
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 1
            ) ),
            colorbar = dict(
            title = "% Difference" 
        ))]

    layout = dict(
        title = 'Percent difference from yearly national average<br>Boys 5K<br><br>'+yr,
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa'),
            showlakes = False))
    
    fig = dict(data=data, layout=layout)
    py.plot(fig, filename='choropleth-map_' + yr, image_height = 600, image_width = 900,  image = 'png')

In [228]:
dfCHS = df[df['Athlete/School'].str.contains("Centaurus High School")].sort_values(by=['Time', 'Athlete/School'])
dfCHS.head()
Out[228]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation
112338 12 943.5 Chandler Reid Centaurus High School 2013 Liberty Bell Cross Country Invitational 1st S... 2012 CO Boys 6800.0
112365 39 962.6 Jake Link Centaurus High School 2015 Liberty Bell Cross Country Invitational 3rd S... 2012 CO Boys 6800.0
111352 29 967.7 Chandler Reid Centaurus High School 2013 Pat Amato Classic 1st Oct 7, 2011 2011 CO Boys 6800.0
105896 11 969.6 Tim Gaskins Centaurus High School 2005 Colorado State Cross Country Championships 5t... 2004 CO Boys 6800.0
113378 52 973.0 Jake Link Centaurus High School 2015 Colorado 4A Region 3 Cross Country 3rd Oct 18... 2013 CO Boys 6800.0
In [234]:
# Centaurus School Records
dfCHS['Duration'] = pd.to_datetime(df['Time'], unit='s')
dfCHS[df.Year == df.Grade - 1].head(10) #senior class records
# dfCHS[df.Year == df.Grade - 2] #junior class records
# dfCHS[df.Year == df.Grade - 3] #sophomore class records
# dfCHS[df.Year == df.Grade - 4] #freshman class records
C:\Users\Devin\Anaconda2\lib\site-packages\ipykernel_launcher.py:3: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

Out[234]:
Rank Time Athlete/School Grade Meet Year State Gender Elevation Duration
112338 12 943.5 Chandler Reid Centaurus High School 2013 Liberty Bell Cross Country Invitational 1st S... 2012 CO Boys 6800.0 1970-01-01 00:15:43.500
105896 11 969.6 Tim Gaskins Centaurus High School 2005 Colorado State Cross Country Championships 5t... 2004 CO Boys 6800.0 1970-01-01 00:16:09.600
114401 75 976.0 Jake Link Centaurus High School 2015 Colorado 4A Region 3 Cross Country 1st Oct 17... 2014 CO Boys 6800.0 1970-01-01 00:16:16.000
115416 90 988.0 Brooks Macdonald Centaurus High School 2016 Liberty Bell Cross Country Invitational 2nd S... 2015 CO Boys 6800.0 1970-01-01 00:16:28.000
111409 85 990.0 Jack Marshall Centaurus High School 2012 Liberty Bell Cross Country Invitational 8th S... 2011 CO Boys 6800.0 1970-01-01 00:16:30.000
114459 133 991.0 Ben Patzer Centaurus High School 2015 Colorado 4A Region 3 Cross Country 4th Oct 17... 2014 CO Boys 6800.0 1970-01-01 00:16:31.000
112512 186 1004.8 Sam Patzer Centaurus High School 2013 Pat Amato Classic 27th Oct 5, 2012 2012 CO Boys 6800.0 1970-01-01 00:16:44.800
107045 18 1012.6 Jesse Fassler Centaurus High School 2007 Colorado State Cross Country Championships 10... 2006 CO Boys 6800.0 1970-01-01 00:16:52.600
111495 171 1016.0 Logan Goodrich Centaurus High School 2012 Liberty Bell Cross Country Invitational 18th ... 2011 CO Boys 6800.0 1970-01-01 00:16:56.000
112603 277 1020.3 Tyler Menger Centaurus High School 2013 Liberty Bell Cross Country Invitational 26th ... 2012 CO Boys 6800.0 1970-01-01 00:17:00.300
In [243]:
dfCHS = df[df['Athlete/School'].str.contains("Centaurus High School")]
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))

ax1.scatter(dfCHS.Year, dfCHS.Rank)
ax1.set_title('Centaurus "Top 1000" Performances')
ax1.set_xlabel('Year')
ax1.set_ylabel('Rank')
ax1.set_ylim([-100, 1100])


ax2.plot(dfCHS.Year.unique(), dfCHS.groupby('Year').Time.nunique())
ax2.set_title('Number of Centaurus Performances in the Top 1000')
ax2.set_xlim([1999, 2017])
ax2.set_ylim([0, 25])

plt.tight_layout()
plt.show()

To Do

  • get "Year = All" records
  • Integrate "course difficulty" ratings
  • Include Track events, and field events
  • Determine seconds / ft conversion
  • Machine Learning, feature selection.. predict times?