MegaTomBike 2021

Aftermath

Posted by Lense Swaenen on November 11, 2021 · 21 mins read

Well, that was a disappointing season of MegaTomBike…

This post is the successor to this one, posted end of March 2021, right when the deadline for team submission was for the 2021 edition of MegaTomBike. MegaTomBike is a local fantasy cycling competition. The team I had submitted was generated through a mathematical optimization approach, namely as a type of knapsack problem. A mathematical approach had proven succesful for me in previous competition, but this year alas…

The last race on the calendar was on October 9th. This post contains some aftermath. My team, Poeske Scherens Fanclub finished 62nd out of 87 teams, with a score of 1089 points. The final winning team was Hala Die Mannschaft with a score of 1564, winning out by a mere 4 points over the second placed team Bekke - Algemeen. The third team lagged quite a bit further behind. The composition of the winning team is:

Rider Points
Geraint Thomas 121
Egan Bernal 132
Tadej Pogačar 315
Mathieu van der Poel 160
Wout van Aert 316
Julian Alaphilippe 226
Quinn Simmons 1
Thomas Pidcock 86
Mark Cavendish 54
Remco Evenepoel 83
Tim Merlier 47
Gianni Vermeersch 19
Arjen Livyns 0
Mauri Vansevenant 3
Heinrich Haussler 1

Why did I not win?

The setup of MegaTomBike is to create a team of exactly 15 riders, with a cost not more than 1.750.000 euro. The team collects points during the season; the team with the most points wins.

The team I created was optimal in some mathematical sense, namely as the solution of a knapsack-type problem, where I searched for a team of 15 riders which satisfies the cost constraint, and maximizes the total number of points. The challenge of course is that the number of points collected by the end of the season is unknown, so one needs to make a prediction / use a proxy. The proxy I used were the UCI points from the season before. Now, we can conclude this was quite a poor proxy.

As was already discussed in the final section of the preceding blog post, the costs that the organizers of MegaTomBike have assigned to riders is not random, but actually also some kind rider level determined from cycling stats from the past few seasons. This in itself is a reasonable predictor, much like the weather of today is a pretty good predictor for the weather of tomorrow. Setting the costs like this, is exactly how a competition like MegaTomBike is made interesting. In the end, my way of team creation was kind of maximizing the disparity between both predictors.

Highest scoring teams

Now the season is over, we can revisit the knapsack approach and use the actual points to determine what the optimal team for this season would have been, and see how close the MegaTomBike winner managed to get. Many thanks to organizer Bob for sharing a csv file with final points and costs to facilitate this analysis (and not having to do all kinds of string matching like in the previous blog post).

Mathematical model

Let us repeat the mathematical model once more:

\[\begin{align*} \text{maximize}_{x_i} \quad & \sum_{i=1}^n p_i x_i \\ \text{subject to} \quad & \sum_{i=1}^n c_i x_i \leq C \\ & \sum_{i=1}^n x_i = N \\ & x_i \in \{0,1\} \end{align*}\]

where $p_i$ is the number of points collected by a rider, $c_i$ is the cost of a rider, $C$ is the total team cost limit (= 1.750.000), and $N$ is the total number of riders (= 15). $x_i$ is a binary variable which either selects or does not select rider $i$.

So, we are maximizing the total number of points, given all the constraints.

Constraint \(\sum_{i=1}^n x_i = N = 15\) is how this problem differs from the standard 0-1 knapsack problem.

Data

%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
df = pd.read_csv(r"lense_riders.txt", delimiter=';')

df['name_'] = df['search_name'].map(lambda name: re.sub('\W+',' ', name)) #Fix Ben O'Connor error...

df = df[['name_', 'value', 'points']]

Let us look at the riders who scored the most points. Fellow countryman Wout van Aert did best! On the UCI ranking, he finished second after Tadej Pogacar, but obviously the MegaTomBike ranking is much more prestigious.

df.sort_values('points', ascending=False).head(15)
name_ value points
465 Wout van Aert 242000 316
549 Tadej Pogacar 324000 315
461 Primoz Roglic 400000 256
147 Julian Alaphilippe 291000 226
240 Richard Carapaz 156000 195
67 Sonny Colbrelli 140500 160
582 Mathieu van der Poel 249000 160
267 Adam Yates 143000 149
326 Michael Woods 145500 142
382 Alejandro Valverde 198500 136
239 Egan Bernal 174500 132
523 Jasper Stuyven 140000 130
468 Jonas Vingegaard 40000 126
78 Matej Mohoric 80500 126
256 Richie Porte 135000 124

Summing the values and points of the best 15 riders gives that scores 2693 points, but is way too expensive (2.85 million). But now we do know an upperbound of the maximum knapsack score: 2693 points.

df.sort_values('points', ascending=False).head(15).sum()
name_       Wout van AertTadej PogacarPrimoz RoglicJulian ...
value                                                 2859500
points                                                   2693
selected                                                  8.0
dtype: object

Solving the knapsack problem

Like before, we use Pyomo as the optimization modeling environment.

import pyomo.environ as pyo
points = {}
cost = {}
count = {}
for i, el in df.iterrows():
    points[el['name_']] = el['points']
    cost[el['name_']] = el['value']
    count[el['name_']] = 1.
cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=list(points.keys()))
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

Let’s run this thing!

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -2223; 4 nodes, 90 iterations, 0.054 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.1037588119506836
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

We got our solution with 979 variables in the blink of an eye: 0.054 seconds. Even though the knapsack problem is NP-hard, it is a common misconception that reasonably sized problems can not be solved to optimality pretty fast. More famous is the Traveling Salesman Problem of which the Traveling Sam Problem formulated by Nerdland is a fun instance with 24 variables. Even though there are $24! \approx 6.2 \times 10^{23}$ solutions, which is way too big to be brute forced, good algorithms exists which can solve a 24 city problem to (provable) optimality in less than a second. In fact, the current world record for a TSP having been solved to proven optimality stands at 57,912 stops and is a Dutch cycling route along national monuments.

So what does NP-hardness then mean? NP-hardness is a statement about asymptotic complexity behavior for the number of cities $N \to \infty$, and that in this limit, the computation times grow exponentially fast (kind of). But it might be that fast growing behavior only appears beyond $N > 10^{100}$. Not saying that this is the case for TSPs or knapsack problems, but the statement about reasonably sized problems being solvable in reasonable time stands.

Apologies for this intermezzo, as you are obviously more interested in the optimal team, so here it is:

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
name_ value points selected
465 Wout van Aert 242000 316 1.0
549 Tadej Pogacar 324000 315 1.0
147 Julian Alaphilippe 291000 226 1.0
240 Richard Carapaz 156000 195 1.0
67 Sonny Colbrelli 140500 160 1.0
267 Adam Yates 143000 149 1.0
78 Matej Mohoric 80500 126 1.0
468 Jonas Vingegaard 40000 126 1.0
150 Kasper Asgreen 114500 118 1.0
265 Dylan van Baarle 59000 103 1.0
18 Ben O Connor 34000 98 1.0
255 Thomas Pidcock 10500 86 1.0
66 Damiano Caruso 70000 81 1.0
199 Neilson Powless 29000 70 1.0
156 Mark Cavendish 2500 54 1.0
df[df['selected'] > 0.5].sort_values('points', ascending=False).sum()
name_       Wout van AertTadej PogacarJulian AlaphilippeRi...
value                                                 1736500
points                                                   2223
selected                                                 15.0
dtype: object

This team would have made 2223 points, which is 659 points more than the current winner!

Many of the big names and big scorers are selected. The highest scorers which are not selected are interestingly Primoz Roglic and Mathieu van der Poel.

df.sort_values('points', ascending=False).head(10)
name_ value points selected
465 Wout van Aert 242000 316 1.0
549 Tadej Pogacar 324000 315 1.0
461 Primoz Roglic 400000 256 0.0
147 Julian Alaphilippe 291000 226 1.0
240 Richard Carapaz 156000 195 1.0
67 Sonny Colbrelli 140500 160 1.0
582 Mathieu van der Poel 249000 160 0.0
267 Adam Yates 143000 149 1.0
326 Michael Woods 145500 142 0.0
382 Alejandro Valverde 198500 136 0.0
df['efficiency'] = 1000*df['points']/df['value']
df[df['selected']>0].sort_values('efficiency', ascending=False)
name_ value points selected efficiency
156 Mark Cavendish 2500 54 1.0 21.600000
255 Thomas Pidcock 10500 86 1.0 8.190476
468 Jonas Vingegaard 40000 126 1.0 3.150000
18 Ben O Connor 34000 98 1.0 2.882353
199 Neilson Powless 29000 70 1.0 2.413793
265 Dylan van Baarle 59000 103 1.0 1.745763
78 Matej Mohoric 80500 126 1.0 1.565217
465 Wout van Aert 242000 316 1.0 1.305785
240 Richard Carapaz 156000 195 1.0 1.250000
66 Damiano Caruso 70000 81 1.0 1.157143
67 Sonny Colbrelli 140500 160 1.0 1.138790
267 Adam Yates 143000 149 1.0 1.041958
150 Kasper Asgreen 114500 118 1.0 1.030568
549 Tadej Pogacar 324000 315 1.0 0.972222
147 Julian Alaphilippe 291000 226 1.0 0.776632

The following seven riders below proved to be among the most popular riders throughout the competition, and all followed through with excellent performances too. Five of them actually are in the optimal team too. Note by the way, how I closed my previous blog post with the observation that Mark Cavendish is a steal, but was not selected by my optimization due to him not having ridden and scored UCI points the year before. I decided not to swap him out for Dutch youngster Olav Kooij. Unfortunaly, Olav only scored 2 points, versus Cavendish’ 54 points.

popular = ['Wout van Aert', 'Mathieu van der Poel', 'Julian Alaphilippe',
           'Tadej Pogacar', 'Primoz Roglic',
           'Mark Cavendish', 'Thomas Pidcock']
sum(points[p] for p in popular)
1413
sum(cost[p] for p in popular)
1519000

A team consisting of all those riders would have already scored 1413 points, much better than I did, enough for a 15th spot in the ranking, and still leaving over 200.000 euro to spend. Let’s see how far we could get with a team having all these riders. In the code, we add equality constraints in which we set the binary selection variable for those riders to 1.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 1)

opt = pyo.SolverFactory("cbc.exe")

results = opt.solve(M)

print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -1957; 0 nodes, 22 iterations, 0.035 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.09378385543823242
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
name_ value points selected efficiency
465 Wout van Aert 242000 316 1.0 1.305785
549 Tadej Pogacar 324000 315 1.0 0.972222
461 Primoz Roglic 400000 256 1.0 0.640000
147 Julian Alaphilippe 291000 226 1.0 0.776632
582 Mathieu van der Poel 249000 160 1.0 0.642570
468 Jonas Vingegaard 40000 126 1.0 3.150000
265 Dylan van Baarle 59000 103 1.0 1.745763
18 Ben O Connor 34000 98 1.0 2.882353
255 Thomas Pidcock 10500 86 1.0 8.190476
199 Neilson Powless 29000 70 1.0 2.413793
156 Mark Cavendish 2500 54 1.0 21.600000
163 Mikkel Frolich Honore 25500 53 1.0 2.078431
76 Gino Mader 18500 46 1.0 2.486486
976 Gianni Marchand 2500 26 1.0 10.400000
111 Ide Schelling 14000 22 1.0 1.571429

With these 7 popular riders, the maximum achievable score is 1957 which is very good, but also quite a drop from 2223 points.

We can also ask the opposite question: What is the best team we could have created without any of these 7 popular riders. In the code, now the binary variables of these riders are put to 0.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 0)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -1940; 0 nodes, 6 iterations, 0.031 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.08577203750610352
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
name_ value points selected efficiency
240 Richard Carapaz 156000 195 1.0 1.250000
67 Sonny Colbrelli 140500 160 1.0 1.138790
267 Adam Yates 143000 149 1.0 1.041958
326 Michael Woods 145500 142 1.0 0.975945
239 Egan Bernal 174500 132 1.0 0.756447
523 Jasper Stuyven 140000 130 1.0 0.928571
78 Matej Mohoric 80500 126 1.0 1.565217
468 Jonas Vingegaard 40000 126 1.0 3.150000
256 Richie Porte 135000 124 1.0 0.918519
264 Geraint Thomas 133500 121 1.0 0.906367
215 David Gaudu 128500 120 1.0 0.933852
150 Kasper Asgreen 114500 118 1.0 1.030568
265 Dylan van Baarle 59000 103 1.0 1.745763
18 Ben O Connor 34000 98 1.0 2.882353
100 Wilco Kelderman 121000 96 1.0 0.793388

The maximum achievable score then is 1940, which is slightly less than the best team having all those riders.

Cheapest team that could have won

We can also turn the knapsack problem on its head: What is the cheapest team, that could have won (= total score of 1564 or more). In the code, the objective and the cost constraint need to be interchanged.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Constraint(expr=sum(points[i]*M.x[i] for i in M.ITEMS) >= 1564+1)
M.weight = pyo.Objective(expr=sum(cost[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 936500; 2 nodes, 47 iterations, 0.125 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.17966532707214355
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
name_ value points selected efficiency
465 Wout van Aert 242000 316 1.0 1.305785
240 Richard Carapaz 156000 195 1.0 1.250000
67 Sonny Colbrelli 140500 160 1.0 1.138790
78 Matej Mohoric 80500 126 1.0 1.565217
468 Jonas Vingegaard 40000 126 1.0 3.150000
265 Dylan van Baarle 59000 103 1.0 1.745763
18 Ben O Connor 34000 98 1.0 2.882353
202 Rigoberto Uran 93500 90 1.0 0.962567
255 Thomas Pidcock 10500 86 1.0 8.190476
199 Neilson Powless 29000 70 1.0 2.413793
156 Mark Cavendish 2500 54 1.0 21.600000
163 Mikkel Frolich Honore 25500 53 1.0 2.078431
76 Gino Mader 18500 46 1.0 2.486486
976 Gianni Marchand 2500 26 1.0 10.400000
977 Finn Fisher Black 2500 21 1.0 8.400000
df[df['selected'] > 0.5].sum()
name_         Ben O ConnorSonny ColbrelliGino MaderMatej Moh...
value                                                    936500
points                                                     1570
selected                                                   15.0
efficiency                                            69.569662
dtype: object

The cheapest team that could have won, costs 936.500 euro, which means that only 53 percent of the budget is used. There are a lot of riders shared with the optimal team, among who again Wout van Aert. What a guy!

Worst team

Finally, the worst team is a trivial problem: just select 15 (cheap?) riders with no points. One way to have a more interesting worst team, could be to set a lower limit on the team cost, e.g. 1.700.000 euro.

count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) >= 1700000)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 0)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 0; 0 nodes, 0 iterations, 0.013 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.058875083923339844
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
name_ value points selected efficiency
15 Bob Jungels 91500 0 1.0 0.0
94 Emanuel Buchmann 116000 0 1.0 0.0
226 Rudy Molard 67500 0 1.0 0.0
227 Thibaut Pinot 172500 0 1.0 0.0
284 Andrea Pasqualon 77000 0 1.0 0.0
308 Chris Froome 131500 0 1.0 0.0
333 John Degenkolb 99000 0 1.0 0.0
450 Dylan Groenewegen 135000 0 1.0 0.0
471 Fabio Aru 88000 0 1.0 0.0
515 Vincenzo Nibali 158000 0 1.0 0.0
550 Jan Polanc 68500 0 1.0 0.0
735 Eduard Prades 114500 0 1.0 0.0
819 Ilnur Zakarin 99500 0 1.0 0.0
879 Nairo Quintana 207500 0 1.0 0.0
925 Niki Terpstra 79000 0 1.0 0.0
df[df['selected'] > 0.5].sum()
name_         Bob JungelsEmanuel BuchmannRudy MolardThibaut ...
value                                                   1705000
points                                                        0
selected                                                   15.0
efficiency                                                  0.0
dtype: object

We managed to find a team that uses nearly all the money and still scores no points. This result could also have been achieved without a MILP solver, and by selection all riders with no points, and then sorting them from expensive to cheap and take the 15 most expensive.

Closing thoughts

As the bottleneck is in the quality of predicting scores (when combined with knapsack), perhaps I should consider building a machine-learning model to predict scores? Or perhaps apply a cool Bayesian rating system like Trueskill to the cycling data? Maybe this can be inspiration for extending Trueskill, which can handle both individual competition as well as team competition, but not a hybrid. The hybrid one is very typical in modern cycling where riders have both individual and team goals, which can sometimes be conflicting.


Want to leave a comment?

Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email.