Well, that was a disappointing season of MegaTomBike…

This post is the successor to this one, posted end of March 2021, right when the deadline for team submission was for the 2021 edition of MegaTomBike. MegaTomBike is a local fantasy cycling competition. The team I had submitted was generated through a mathematical optimization approach, namely as a type of knapsack problem. A mathematical approach had proven succesful for me in previous competition, but this year alas…

The last race on the calendar was on October 9th. This post contains some aftermath. My team, Poeske Scherens Fanclub finished 62nd out of 87 teams, with a score of 1089 points. The final winning team was Hala Die Mannschaft with a score of 1564, winning out by a mere 4 points over the second placed team Bekke - Algemeen. The third team lagged quite a bit further behind. The composition of the winning team is:

Rider	Points
Geraint Thomas	121
Egan Bernal	132
Tadej Pogačar	315
Mathieu van der Poel	160
Wout van Aert	316
Julian Alaphilippe	226
Quinn Simmons	1
Thomas Pidcock	86
Mark Cavendish	54
Remco Evenepoel	83
Tim Merlier	47
Gianni Vermeersch	19
Arjen Livyns	0
Mauri Vansevenant	3
Heinrich Haussler	1

Why did I not win?

The setup of MegaTomBike is to create a team of exactly 15 riders, with a cost not more than 1.750.000 euro. The team collects points during the season; the team with the most points wins.

The team I created was optimal in some mathematical sense, namely as the solution of a knapsack-type problem, where I searched for a team of 15 riders which satisfies the cost constraint, and maximizes the total number of points. The challenge of course is that the number of points collected by the end of the season is unknown, so one needs to make a prediction / use a proxy. The proxy I used were the UCI points from the season before. Now, we can conclude this was quite a poor proxy.

As was already discussed in the final section of the preceding blog post, the costs that the organizers of MegaTomBike have assigned to riders is not random, but actually also some kind rider level determined from cycling stats from the past few seasons. This in itself is a reasonable predictor, much like the weather of today is a pretty good predictor for the weather of tomorrow. Setting the costs like this, is exactly how a competition like MegaTomBike is made interesting. In the end, my way of team creation was kind of maximizing the disparity between both predictors.

Highest scoring teams

Now the season is over, we can revisit the knapsack approach and use the actual points to determine what the optimal team for this season would have been, and see how close the MegaTomBike winner managed to get. Many thanks to organizer Bob for sharing a csv file with final points and costs to facilitate this analysis (and not having to do all kinds of string matching like in the previous blog post).

Mathematical model

Let us repeat the mathematical model once more:

\[\begin{align*} \text{maximize}_{x_i} \quad & \sum_{i=1}^n p_i x_i \\ \text{subject to} \quad & \sum_{i=1}^n c_i x_i \leq C \\ & \sum_{i=1}^n x_i = N \\ & x_i \in \{0,1\} \end{align*}\]

where $p_i$ is the number of points collected by a rider, $c_i$ is the cost of a rider, $C$ is the total team cost limit (= 1.750.000), and $N$ is the total number of riders (= 15). $x_i$ is a binary variable which either selects or does not select rider $i$.

So, we are maximizing the total number of points, given all the constraints.

Constraint $\sum_{i=1}^n x_i = N = 15$ is how this problem differs from the standard 0-1 knapsack problem.

Data

%matplotlib notebook

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import re

df = pd.read_csv(r"lense_riders.txt", delimiter=';')

df['name_'] = df['search_name'].map(lambda name: re.sub('\W+',' ', name)) #Fix Ben O'Connor error...

df = df[['name_', 'value', 'points']]

Let us look at the riders who scored the most points. Fellow countryman Wout van Aert did best! On the UCI ranking, he finished second after Tadej Pogacar, but obviously the MegaTomBike ranking is much more prestigious.

df.sort_values('points', ascending=False).head(15)

	name_	value	points
465	Wout van Aert	242000	316
549	Tadej Pogacar	324000	315
461	Primoz Roglic	400000	256
147	Julian Alaphilippe	291000	226
240	Richard Carapaz	156000	195
67	Sonny Colbrelli	140500	160
582	Mathieu van der Poel	249000	160
267	Adam Yates	143000	149
326	Michael Woods	145500	142
382	Alejandro Valverde	198500	136
239	Egan Bernal	174500	132
523	Jasper Stuyven	140000	130
468	Jonas Vingegaard	40000	126
78	Matej Mohoric	80500	126
256	Richie Porte	135000	124

Summing the values and points of the best 15 riders gives that scores 2693 points, but is way too expensive (2.85 million). But now we do know an upperbound of the maximum knapsack score: 2693 points.

df.sort_values('points', ascending=False).head(15).sum()

name_       Wout van AertTadej PogacarPrimoz RoglicJulian ...
value                                                 2859500
points                                                   2693
selected                                                  8.0
dtype: object

Solving the knapsack problem

Like before, we use Pyomo as the optimization modeling environment.

import pyomo.environ as pyo

points = {}
cost = {}
count = {}
for i, el in df.iterrows():
    points[el['name_']] = el['points']
    cost[el['name_']] = el['value']
    count[el['name_']] = 1.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=list(points.keys()))
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

Let’s run this thing!

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)

print(results)

Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -2223; 4 nodes, 90 iterations, 0.054 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.1037588119506836
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

We got our solution with 979 variables in the blink of an eye: 0.054 seconds. Even though the knapsack problem is NP-hard, it is a common misconception that reasonably sized problems can not be solved to optimality pretty fast. More famous is the Traveling Salesman Problem of which the Traveling Sam Problem formulated by Nerdland is a fun instance with 24 variables. Even though there are $24! \approx 6.2 \times 10^{23}$ solutions, which is way too big to be brute forced, good algorithms exists which can solve a 24 city problem to (provable) optimality in less than a second. In fact, the current world record for a TSP having been solved to proven optimality stands at 57,912 stops and is a Dutch cycling route along national monuments.

So what does NP-hardness then mean? NP-hardness is a statement about asymptotic complexity behavior for the number of cities $N \to \infty$, and that in this limit, the computation times grow exponentially fast (kind of). But it might be that fast growing behavior only appears beyond $N > 10^{100}$. Not saying that this is the case for TSPs or knapsack problems, but the statement about reasonably sized problems being solvable in reasonable time stands.

Apologies for this intermezzo, as you are obviously more interested in the optimal team, so here it is:

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)

	name_	value	points	selected
465	Wout van Aert	242000	316	1.0
549	Tadej Pogacar	324000	315	1.0
147	Julian Alaphilippe	291000	226	1.0
240	Richard Carapaz	156000	195	1.0
67	Sonny Colbrelli	140500	160	1.0
267	Adam Yates	143000	149	1.0
78	Matej Mohoric	80500	126	1.0
468	Jonas Vingegaard	40000	126	1.0
150	Kasper Asgreen	114500	118	1.0
265	Dylan van Baarle	59000	103	1.0
18	Ben O Connor	34000	98	1.0
255	Thomas Pidcock	10500	86	1.0
66	Damiano Caruso	70000	81	1.0
199	Neilson Powless	29000	70	1.0
156	Mark Cavendish	2500	54	1.0

df[df['selected'] > 0.5].sort_values('points', ascending=False).sum()

name_       Wout van AertTadej PogacarJulian AlaphilippeRi...
value                                                 1736500
points                                                   2223
selected                                                 15.0
dtype: object

This team would have made 2223 points, which is 659 points more than the current winner!

Many of the big names and big scorers are selected. The highest scorers which are not selected are interestingly Primoz Roglic and Mathieu van der Poel.

df.sort_values('points', ascending=False).head(10)

	name_	value	points	selected
465	Wout van Aert	242000	316	1.0
549	Tadej Pogacar	324000	315	1.0
461	Primoz Roglic	400000	256	0.0
147	Julian Alaphilippe	291000	226	1.0
240	Richard Carapaz	156000	195	1.0
67	Sonny Colbrelli	140500	160	1.0
582	Mathieu van der Poel	249000	160	0.0
267	Adam Yates	143000	149	1.0
326	Michael Woods	145500	142	0.0
382	Alejandro Valverde	198500	136	0.0

df['efficiency'] = 1000*df['points']/df['value']

df[df['selected']>0].sort_values('efficiency', ascending=False)

	name_	value	points	selected	efficiency
156	Mark Cavendish	2500	54	1.0	21.600000
255	Thomas Pidcock	10500	86	1.0	8.190476
468	Jonas Vingegaard	40000	126	1.0	3.150000
18	Ben O Connor	34000	98	1.0	2.882353
199	Neilson Powless	29000	70	1.0	2.413793
265	Dylan van Baarle	59000	103	1.0	1.745763
78	Matej Mohoric	80500	126	1.0	1.565217
465	Wout van Aert	242000	316	1.0	1.305785
240	Richard Carapaz	156000	195	1.0	1.250000
66	Damiano Caruso	70000	81	1.0	1.157143
67	Sonny Colbrelli	140500	160	1.0	1.138790
267	Adam Yates	143000	149	1.0	1.041958
150	Kasper Asgreen	114500	118	1.0	1.030568
549	Tadej Pogacar	324000	315	1.0	0.972222
147	Julian Alaphilippe	291000	226	1.0	0.776632

Popular riders

The following seven riders below proved to be among the most popular riders throughout the competition, and all followed through with excellent performances too. Five of them actually are in the optimal team too. Note by the way, how I closed my previous blog post with the observation that Mark Cavendish is a steal, but was not selected by my optimization due to him not having ridden and scored UCI points the year before. I decided not to swap him out for Dutch youngster Olav Kooij. Unfortunaly, Olav only scored 2 points, versus Cavendish’ 54 points.

popular = ['Wout van Aert', 'Mathieu van der Poel', 'Julian Alaphilippe',
           'Tadej Pogacar', 'Primoz Roglic',
           'Mark Cavendish', 'Thomas Pidcock']

sum(points[p] for p in popular)

sum(cost[p] for p in popular)

A team consisting of all those riders would have already scored 1413 points, much better than I did, enough for a 15th spot in the ranking, and still leaving over 200.000 euro to spend. Let’s see how far we could get with a team having all these riders. In the code, we add equality constraints in which we set the binary selection variable for those riders to 1.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 1)

opt = pyo.SolverFactory("cbc.exe")

results = opt.solve(M)

print(results)

Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -1957; 0 nodes, 22 iterations, 0.035 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.09378385543823242
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)

	name_	value	points	selected	efficiency
465	Wout van Aert	242000	316	1.0	1.305785
549	Tadej Pogacar	324000	315	1.0	0.972222
461	Primoz Roglic	400000	256	1.0	0.640000
147	Julian Alaphilippe	291000	226	1.0	0.776632
582	Mathieu van der Poel	249000	160	1.0	0.642570
468	Jonas Vingegaard	40000	126	1.0	3.150000
265	Dylan van Baarle	59000	103	1.0	1.745763
18	Ben O Connor	34000	98	1.0	2.882353
255	Thomas Pidcock	10500	86	1.0	8.190476
199	Neilson Powless	29000	70	1.0	2.413793
156	Mark Cavendish	2500	54	1.0	21.600000
163	Mikkel Frolich Honore	25500	53	1.0	2.078431
76	Gino Mader	18500	46	1.0	2.486486
976	Gianni Marchand	2500	26	1.0	10.400000
111	Ide Schelling	14000	22	1.0	1.571429

With these 7 popular riders, the maximum achievable score is 1957 which is very good, but also quite a drop from 2223 points.

Without any of the popular riders

We can also ask the opposite question: What is the best team we could have created without any of these 7 popular riders. In the code, now the binary variables of these riders are put to 0.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 0)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)

Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -1940; 0 nodes, 6 iterations, 0.031 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.08577203750610352
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)

	name_	value	points	selected	efficiency
240	Richard Carapaz	156000	195	1.0	1.250000
67	Sonny Colbrelli	140500	160	1.0	1.138790
267	Adam Yates	143000	149	1.0	1.041958
326	Michael Woods	145500	142	1.0	0.975945
239	Egan Bernal	174500	132	1.0	0.756447
523	Jasper Stuyven	140000	130	1.0	0.928571
78	Matej Mohoric	80500	126	1.0	1.565217
468	Jonas Vingegaard	40000	126	1.0	3.150000
256	Richie Porte	135000	124	1.0	0.918519
264	Geraint Thomas	133500	121	1.0	0.906367
215	David Gaudu	128500	120	1.0	0.933852
150	Kasper Asgreen	114500	118	1.0	1.030568
265	Dylan van Baarle	59000	103	1.0	1.745763
18	Ben O Connor	34000	98	1.0	2.882353
100	Wilco Kelderman	121000	96	1.0	0.793388

The maximum achievable score then is 1940, which is slightly less than the best team having all those riders.

Cheapest team that could have won

We can also turn the knapsack problem on its head: What is the cheapest team, that could have won (= total score of 1564 or more). In the code, the objective and the cost constraint need to be interchanged.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Constraint(expr=sum(points[i]*M.x[i] for i in M.ITEMS) >= 1564+1)
M.weight = pyo.Objective(expr=sum(cost[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)

Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 936500; 2 nodes, 47 iterations, 0.125 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.17966532707214355
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)

	name_	value	points	selected	efficiency
465	Wout van Aert	242000	316	1.0	1.305785
240	Richard Carapaz	156000	195	1.0	1.250000
67	Sonny Colbrelli	140500	160	1.0	1.138790
78	Matej Mohoric	80500	126	1.0	1.565217
468	Jonas Vingegaard	40000	126	1.0	3.150000
265	Dylan van Baarle	59000	103	1.0	1.745763
18	Ben O Connor	34000	98	1.0	2.882353
202	Rigoberto Uran	93500	90	1.0	0.962567
255	Thomas Pidcock	10500	86	1.0	8.190476
199	Neilson Powless	29000	70	1.0	2.413793
156	Mark Cavendish	2500	54	1.0	21.600000
163	Mikkel Frolich Honore	25500	53	1.0	2.078431
76	Gino Mader	18500	46	1.0	2.486486
976	Gianni Marchand	2500	26	1.0	10.400000
977	Finn Fisher Black	2500	21	1.0	8.400000

df[df['selected'] > 0.5].sum()

name_         Ben O ConnorSonny ColbrelliGino MaderMatej Moh...
value                                                    936500
points                                                     1570
selected                                                   15.0
efficiency                                            69.569662
dtype: object

The cheapest team that could have won, costs 936.500 euro, which means that only 53 percent of the budget is used. There are a lot of riders shared with the optimal team, among who again Wout van Aert. What a guy!

Worst team

Finally, the worst team is a trivial problem: just select 15 (cheap?) riders with no points. One way to have a more interesting worst team, could be to set a lower limit on the team cost, e.g. 1.700.000 euro.

count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) >= 1700000)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

M.popular = pyo.ConstraintList()
for p in popular:
    M.popular.add(M.x[p] == 0)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)

Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 9
  Number of variables: 979
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 0; 0 nodes, 0 iterations, 0.013 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.058875083923339844
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)

	name_	value	selected
15	Bob Jungels	91500	1.0
94	Emanuel Buchmann	116000	1.0
226	Rudy Molard	67500	1.0
227	Thibaut Pinot	172500	1.0
284	Andrea Pasqualon	77000	1.0
308	Chris Froome	131500	1.0
333	John Degenkolb	99000	1.0
450	Dylan Groenewegen	135000	1.0
471	Fabio Aru	88000	1.0
515	Vincenzo Nibali	158000	1.0
550	Jan Polanc	68500	1.0
735	Eduard Prades	114500	1.0
819	Ilnur Zakarin	99500	1.0
879	Nairo Quintana	207500	1.0
925	Niki Terpstra	79000	1.0

df[df['selected'] > 0.5].sum()

name_         Bob JungelsEmanuel BuchmannRudy MolardThibaut ...
value                                                   1705000
points                                                        0
selected                                                   15.0
efficiency                                                  0.0
dtype: object

We managed to find a team that uses nearly all the money and still scores no points. This result could also have been achieved without a MILP solver, and by selection all riders with no points, and then sorting them from expensive to cheap and take the 15 most expensive.

Closing thoughts

As the bottleneck is in the quality of predicting scores (when combined with knapsack), perhaps I should consider building a machine-learning model to predict scores? Or perhaps apply a cool Bayesian rating system like Trueskill to the cycling data? Maybe this can be inspiration for extending Trueskill, which can handle both individual competition as well as team competition, but not a hybrid. The hybrid one is very typical in modern cycling where riders have both individual and team goals, which can sometimes be conflicting.

MegaTomBike 2021 - team creation

Want to leave a comment?

Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email.

← Previous Post Next Post →