Calculating Race Handicaps using Data Science: The Hallam Chase¶

The Hallam Chase is the oldest continuously run fell race in the world. It takes place every year in Sheffield around the second bank holiday in May. As a Sheffield resident and (once) keen fell runner I have competed myself (even coming second in 2013). To compete in the race you must submit recent race times so you can be assigned an appropriate handicap. I love the idea of a handicapped race, if you run a good race you have a chance of winning overall -- but this rely relies on providing good quality handicaps to everyone. Competitors may not submit their best times or may seek to gain an advantage in some other way (to clarify, I am not suggesting this happens, merely ruminating). This got me thinking, we have great records of fell races in the region. Could we design an automated system that automatically calculates a handicap based on recent race performance? Would such a system be able to improve on the current, tried and tested method? What an interesting proposition! This notebook will outline the process I undertook to investigate the potential of an automated, data based handicap system for the Hallam Chase.

Current Performance¶

The first thing to look at, is the performance of current handicapping. This isn't obvious, how could we say whether someone has been given too much, or too little, handicap time? One way to examine this is to imagine that a 'perfect race' involves everyone crossing the finish line at the same time! Of course, many may disagree, probably the person tasked with recording finish times. This isn't a realistic scenario but it serves as a starting point. If we assume that errors in handicapping are normally distributed around zero, we can use the average finish time to calculate handicap performance (or error).

Each racer is given a handicap as seconds from the slowest person. The first person to set off has a 0 seconds handicap. We would like to know how well a person's finish time was predicted by their handicap. To do this we need to reframe the handicap for each person as 'expected time from mean finish time'. In this way someone that runs the race slower than average has a negative handicap, a faster runner has a positive handicap (it isn't hard to see why this isn't used for the actual race).

To rescale the handicap we calculate what the mean handicap given in a race is and subtract it from the original handicap value: $$H^{*}_i = H_i - \hat{H} $$ Where $\hat{H}$ is mean handicap.

We can calculate a runner's predicted finish time by using an individual's handicap and the mean finish time $\hat{t}$: $$t_i = \hat{t} - H^{*}_i$$

Summary Finish Times¶

It is useful to examine finishing times for Hallam Chase races. We will examine every race since 2017. Looking at times spent running the course (course times) and finishing times which include the handicap (race times). You can observe that even though the gap between the course times is relatively consistent, average handicaps can vary by several minutes between different editions of the race.

Fastest Course Time Slowest Course Time Average Course Time Average Handicap Fastest Race Time Slowest Race Time
Year
2017 22 m, 30.00 s 47 m, 30.00 s 32 m, 42.23 s 09 m, 44.93 s 38 m, 41.00 s 50 m, 17.00 s
2018 23 m, 56.00 s 47 m, 24.00 s 32 m, 14.83 s 12 m, 2.41 s 40 m, 19.00 s 49 m, 24.00 s
2019 25 m, 7.00 s 47 m, 31.00 s 32 m, 47.43 s 11 m, 3.49 s 38 m, 10.00 s 52 m, 21.00 s
2021 21 m, 46.00 s 46 m, 1.00 s 31 m, 47.44 s 12 m, 14.41 s 40 m, 5.00 s 50 m, 50.00 s
2022 24 m, 43.00 s 45 m, 29.00 s 32 m, 13.42 s 14 m, 28.07 s 41 m, 54.00 s 53 m, 20.00 s
2023 24 m, 53.00 s 50 m, 49.00 s 32 m, 25.94 s 15 m, 33.39 s 44 m, 35.00 s 51 m, 52.00 s

Quality of prediction¶

Using the equations above, we can calculate a 'predicted finishing time' for every competitor in the race and plot that against their actual finishing time. The root mean squared error (RMSE) indicates how far, on average, each prediction was away from reality. The lower this value the 'better' the prediction. Of course, it would be impossible to get RMSE to zero, but generally a lower RMSE would lead to a 'tighter' race where runners would reach the finishing line more closely together than a race with a high RMSE. The first place and last place from each race is highlighted green and red respectively. In each race these are the runners that are furthest away from the 'equality line' placed on the plot (the line where Predicted Time = Race Time). It is impossible to know whether this is due to an inappropriate handicap or abnormal performance from each runner. Overall, I think it is impressive how well handicaps are assigned each year. The winner comes from across the whole spectrum of race times each year (arguably the very fastest runners have their work cut out, but there are other races for them).

Does your Handicap predict performance?¶

Another metric to examine might be whether the handicap you're given is related to your overall position. In other words do final race positions have any relationship to the handicap you were given? Do racers starting first (with a low handicap) finish in better positions to those starting later (with a high handicap). A simple way to examine this is with Pearson's correlation ($r$). The $r$ value varies between 0 and 1 and indicates the quality of a (linear) relationship. In other words, if the r value is 1, then the person starting first finishes first, the person starting second finishes second and so on. If the r value is -1 then the person starting last finishes first, the person starting second to last finishes second, and so on. What we want is an r value of 0 -- this indicates that there is no relationship between handicap and finishing position. The table below shows that the handicap has done a great job of 'mixing' up the competitors, there are no strong relationships between handicap and finishing position. In some years the faster runners have a slight advantage, in other years the slower runners have a slight advantage but the effect is very small.

Pearson's r
Year
2017 -0.200918
2018 -0.200616
2019 0.172618
2021 0.115444
2022 0.295902
2023 0.233319

Predicting Handicaps¶

So the question arises, could the handicap times be improved using racing data? Before we can ask this question, we have to gather data. This was done through a semi-automated process of scraping results tables from local races and collating all of this data into a single database.

No standards¶

The first obvious thing to notice when conducting this exercise was the variety of different data formats used when recording race data. This included tables on a webpage, JSON files gathered via an API or an excel file stored in the cloud somewhere. Different methods had to be created that could deal with all of these different formats. Secondly, I was keen to record as much rich data as I could and that included the category of each runner. This is when I realised that there is similarity between categories but no definitive set. While typically there was a male/men's category there could be a female (F), women's (W) or ladies (L). Does the race recognise under 23? Under 21? Under 18? Is the main age range 'senior' or 'open'? Do age categories go up by 5 or 10 years? The answer is, all of these and many more. Given that I haven't used age categories in this analysis, I have spent an inordinate amount of time coalescing all of these options into a finite set of categories.

Who's who?¶

Another common issue arises when trying to link lots of small sets of data (race results) from multiple different locations. Names. Although the fell racing scene has moved on in recent years I still remember the years of writing my name in shaky biro on a tiny slip of paper in order to get my race number. As a result of my horrendous hand-writing I often raced under a name which bore only a passing resemblance to that on my birth certificate. As you'd expect there also isn't any other personal information given in race results which might be useful to confirm identity (date of birth for example). As a result, it can often be difficult to work out whether two people with similar names are the actually the same person, or not. To try and combat this I meticuously went through the whole data set looking at potential close matches of names and assigning them the same identity, or not. This was mostly a matter of judgement. All I had to go off was the person's name and their race performances. If names were close enough and their race times were similar, they would often be combined. However, quite often it wasn't obvious that the difference in name was a spelling mistake or sobriquet (Dave for David for example). It could be a perfectly reasonable alternative that is a different person. For that reason I erred on the side of caution.

The data¶

To make the data capture task manageable I focused on a specific races over the same time period as our results for the Chase (2017 - 2023). The plot below shows the data we have captured. It lists the races, showing the distribution of race times and the number of runners we have for each race (including all years/instances of the race we have captured).

Is that normal?¶

You may notice that the shape showing each of the race distributions has a thin tail trailing off towards the right (this is most obvious for the ParkRun, PR, data). This effect is caused by the nature of running events. There tends to be a 'hard' limit at one end of the results, you are generally limited to how fast you run a race. However, there is (theoretically) no limit at the slower end, you can take as long as you like, providing that the person with the stopwatch agrees to wait for you (my thoughts here are drawn to the person completing the London Marathon in a suit of armour over several days). As a result, race results are not 'normal' distributions but 'log-normal'. Log-normal distributions are skewed in one direction. I would like to compare performance between races by taking a z-score. A z-score is a standard way of measuring a value from any normal distribution (a z-score of 0 means a value is equal to the mean/average value). In order for me to do this, I have to transform our race time to normal distributions. If I transform all the race times like so: $$t^*_i = log(t_i) $$ The race times should become more normally distributed. A plot of each race now shows a much more symmetrical shape (I've removed axis labels as log(time) doesn't mean much).

With log transformed times for all races (including the Chase) we can calculate z-scores and compare how race times from different events compare to race times from the Hallam Chase!

Why Z-Scores¶

I chose to use Z-scores rather than race times as they are measure of relative performance. It doesn't matter how long the race is, Z-scores can be compared to determine whether one time is better than another. Another great advantage is that it is a better way of combining data over multiple years. Let us imagine that for a particular edition of a race, the weather is particularly bad, howling wind and pummeling rain. The times are much slower compared to previous years. It would be bad practice to group all this data together, excellent performances would look mediocre compared to an edition of the race when weather was fine. In order to combine the data we would have to standardise the times in some way -- Z-scores are a great way to do this.

Race vs Chase¶

If we want to predict Chase times based on other race performances, we need runners in our database that have competed in one of the races we've collected and in the Hallam Chase between 2017 and 2023. A quick database query can tell us how much of an overlap we have.

There are 147 racers that have run in the Hallam Chase and one of our races.
They have run in 5450 races between them.

That is a lot of races overall! However from interrogating the data I know that most of these are from the two Parkruns that I gathered the data from. The event is weekly and so if you're a regular runner it's possible to tot up a lot of races in a year!

Let's have a look at how the Z-Scores from one of the Parkruns compares to the Chase results. For each data point I'm comparing run results with Chase results from the same year. For example, your Z-Score from a parkrun in 2018 will be compared with your Hallam Chase from 2018. If you didn't run the Chase that year, no data is shown. There are some subtleties here that I've chosen to ignore. The Chase is run in May. So it may be prudent to re-scale the year to run from May to May. For simplicity I haven't done that in this case as I don't think the results would be worth it. Let's look at the comparison between Endcliffe Parkrun and the Hallam Chase.

Here we are comparing Z-Scores, the lower the number, the faster the time. You may be able to notice that some data points are clustered in horizontal lines. This is when a runner has performed the event multiple times in a year and each is assigned the same Chase performance value (as there is only one each year). Some of these horizontal clusters cover quite a range of performances -- some are a lot faster than others. This data reveals something quite tricky about Parkrun as a comparison event -- it's not a race! One week you may choose to go for a PB, the next have a nice chat with a friend. However there may be useful information in there somewhere. Firstly, how do we deal with someone that has run the event many times in a single year? Which performance do we take? We could take the average but this might lead us to use a value that is unrepresentative of reality (an equal number of 'runs' and 'walks' would put the average between the two). Instead I'll choose a compromise, the 75th percentile performance. In practice this would lead to the fastest of two runs, the second fastest from four etc. We're not just taking the fastest but choosing one that we hope represents a 'good effort'. Let's change our processing and have a look.

Things look a lot clearer now. A runner's performance in a single year is now represented by a single data point. While the data looks 'noisy', there does seem to be a nice linear relationship in there somewhere. There are good reasons why some datapoints might belong to a relationship, while others are 'noisy'. As discussed, at a Parkrun there will be a lot of people attending for social reasons. A time does not necessarily represent someone's 'best effort'. However, for many people it will! Is there a way of excluding the social runs and just focusing on best efforts? There is a fantastic algorithm (if I was being rash, I might go as far to say my favourite algorithm) called RANSAC (random sampling and consensus). It's a very versatile process that finds data points belonging to a particular relationship, and excludes those that do not. I first came across it as a way to find shapes in 3D data (and I've loved it ever since). In the case of our run data, RANSAC will find a line of best fit that gets 'consensus' I.e. it's the 'best' line in our data. After that it can reject any data points that don't belong to this line of best fit, effectively squirreling out all the data points that belong to 'best effort runs'. Let's apply RANSAC to this data and see what it looks like, I'll keep in the removed data points but grey them out.

The RANSAC has cleaned up the plot considerably! It looks like thise approach will give us some nice relationships that we can work with. Let's do the same but for every race we have data for.

Creating our models¶

RANSAC has done a decent job of separating noisy data points for each of the races. Hopefully the linear relationships we have uncovered will be useful in predicting Hallam Chase race times. To do this fairly, we should follow the process below:

  1. Remove a year's worth of data -- this is the data we'll be predicting
  2. With the remaining data, calculate our linear relationships that link a race's time to the Hallam Chase times
  3. Use the race performances in the data from step (1) to predict Hallam Chase times
  4. Repeat for all years to see how well the method works!

Predicting a time¶

The way we predict a Hallam Chase time is to use a linear relationship or 'model'. This model will be in the form: $$P_{Chase} = A \times P_{race} + B $$ In this case I've used $P$ to denote performance -- the Z-Scores we have calculated for each runner. To transform these performances back into race times we first transform back into our log-transformed times (we use the mean and standard deviation of that race to do so) and then take the inverse log (exponential) to end up with the race time in seconds. Simple!

Let's do this, we'll make a similar plot to what we've seen before but this time we'll be comparing the recorded Chase time to those that we've predicted using the method above.

NB: you can see in the plots above that some races have very few data points, these will be removed from the analysis.

To make a prediction will take the average of all the races an individual has competed in.

Equivalent errors¶

We don't have data for all of the Hallam Chase times, not everyone has run a race that we have captured. To compare things equivalently, we remove those people from our Hallam Chase data completely.

Let's plot the predicted times against the actual race times. In this way we can visualise the quality of our predictions. The close the points are on a straight $x=y$ line the better the predictions. We can also look at the RMSE for our 'new' and 'traditional' methods to see how they compare. The lower the RMSE the better the predictions.

Both our new method and the traditional handicapping method follow a straight line well (the odd stray runner can be seen here and there). However examining the RMSE values, it does seem that our new method is making improved predictions! It is difficult to visualise in the plot above though, instead lets plot each runner according to the time they passed the finish line.

We can see how the 'new' handicap system produces a much tighter field. Sometimes the winner is out in front, but the bulk of the runners are generally clustered quite tightly.

Mixing it up¶

Earlier we examiner the relationship between the handicap a runner was given and their finishing position. If done effectively a handicap should eliminate any advantage a runner gains from a higher running speed. In other words there should be no relationship between handicap and finishing position -- something we measured using Pearson's r. Does our new handicap still 'mix' the runners up effectively? I've attempted a fancy plot for this. On the x-axis you can see the finishing position of each runner for each race. Data points are coloured according to the handicap they received. Runners given no, or very little handicap are a dark purple and colour transitions to bright yellow for the largest handicaps (our fastest runners). I should clarify that due to the differeing number of runners we were able to include each year, earlier races won't contain as bright a yellow colour. The colour corresponds to the rank of the handicap within that year, not the absolute value.

I have included Pearson's correlation values for each race we have analysed. Generally the traditional method does a better job of mixing up our runners. However, we still have very low correlations with our new method showing that it is doing a good enough job.

Conclusions¶

So, after this time intensive exercise, what have we learnt? Can we handicap effectively using a more data-based approach?

Firstly, a large chunk of this project is pretty hard to see, the hours spent coralling the results from many different websites into a single, (mostly) coherent database. I must acknolwedge the help of friend and colleague Nick Hamilton in helping me put all that together.

Secondly, the pre-processing of the data is alway important. Getting our Z-score allowed us to group multiple editions of a race together. Clever algorithms meant we could capture only race performance we thought were best efforts. This meant we had access to much more data than we might have otherwise.

Third, we didn't use anything complicated to predict Hallam Chase performances. With all our effective pre-processing, a linear regression did the job!

Overall, I think we've demonstrated that we are able to handicap effectively using these data techniques. This new method has a lower RMSE when compared to equivalent data handicapped using the traditional method, and it still does a good job of 'mixing' the runners. However, let's not forget the maintenance and extra work required here. Every year we'd have to invest time to make sure our database is up-to-date.

Caveat¶

Please note, the times I've presented using this new method are simulated. I've taken the real times given using the traditional method and adjusted the handicap to alter the overall finishing time. In reality these runners would not have run the same time if given a different handicap. Runners will interact with each other and go slower or faster depending on specific circumstances. However, we cannot say whether this effect would make our method more or less effective than what we've observed in this analysis.