MegaTomBike 2022 Aftermath

and the new Scipy MILP solver

Posted by Lense Swaenen on January 04, 2023 · 19 mins read

Like the 2021 post, we compute what would have been the optimal team for Megatombike 2022, a local fantasy cycling competition.

This year, I didn’t get round to making a post on how I assembled my team. For the better, in hindsight, as it performed terribly. I hoped to get a team of overachievers as a combination of

  • up-and-coming talents who are still undervalued (like De Lie, Pidcock and Evenepoel)
  • former big riders who had a poor season last year (Froome, Dumoulin, …)

Especially the latter turned out to be a bad idea. Also, out of the expensive, high scoring riders, I kicked out Van Aert to make room for Alaphilippe at the last moment…

Anyways, on to more succesful teams. The winner of this year, once again, was Hala Die Mannschaft with a score of 2435, this year by a margin of almost 100 points over the second team. Very impressive!

Rider Points
LÓPEZ Miguel Ángel 74
JAKOBSEN Fabio 115
VAN AERT Wout 401
CAMPENAERTS Victor 74
VINE Jay 74
FOSS Tobias 45
POGAČAR Tadej 473
EVENEPOEL Remco 394
VALVERDE Alejandro 176
MOHORIČ Matej 125
LAMPAERT Yves 22
MARTÍNEZ Daniel Felipe 191
SHEFFIELD Magnus 80
VADER Milan 0
DE LIE Arnaud 191

Note that the overall scores this year are about 60% higher than last year, because of an extended calender incorporating many smaller races

The optimal team

I have copied some math and code from previous year to show the mathematical model and how to solve that using the combination of the Pyomo optimization modelling languange along with open-source MILP (mixed integer linear programming) solver CBC. However, we have a new kid on the block as well, stay tuned!

Mathematical model

Let us repeat the mathematical model once more:

\[\begin{align*} \text{maximize}_{x_i} \quad & \sum_{i=1}^n p_i x_i \\ \text{subject to} \quad & \sum_{i=1}^n c_i x_i \leq C \\ & \sum_{i=1}^n x_i = N \\ & x_i \in \{0,1\} \end{align*}\]

where $p_i$ is the number of points collected by a rider, $c_i$ is the cost of a rider, $C$ is the total team cost limit (= 1.750.000), and $N$ is the total number of riders (= 15). $x_i$ is a binary variable which either selects or does not select rider $i$.

So, we are maximizing the total number of points, given all the constraints.

Constraint \(\sum_{i=1}^n x_i = N = 15\) is how this problem differs from the standard 0-1 knapsack problem.

Data

%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv(r"C:\Users\swaenenl\OneDrive - Sioux Group B.V\Documents\Personal\website\notebooks\megatombike\2022_results\2022_results\riders.csv", delimiter=';')

df.head(15)
id name price points id.1 name.1 selected
0 527 POGAČAR Tadej 437000 473 18 UAE Team Emirates 1.0
1 322 VAN AERT Wout 320000 401 11 Jumbo-Visma 1.0
2 395 EVENEPOEL Remco 152500 394 14 Quick-Step Alpha Vinyl Team 1.0
3 328 VINGEGAARD Jonas 144000 247 11 Jumbo-Visma 1.0
4 113 VLASOV Aleksandr 174000 213 4 BORA - hansgrohe 0.0
5 315 LAPORTE Christophe 96500 213 11 Jumbo-Visma 1.0
6 96 HIGUITA Sergio 87500 195 4 BORA - hansgrohe 1.0
7 561 VAN DER POEL Mathieu 260000 195 19 Alpecin-Fenix 0.0
8 219 MARTÍNEZ Daniel Felipe 75000 191 8 INEOS Grenadiers 1.0
9 335 DE LIE Arnaud 15000 191 12 Lotto Soudal 1.0
10 189 KÜNG Stefan 121500 187 7 Groupama - FDJ 1.0
11 61 BILBAO Pello 102500 187 3 Bahrain - Victorious 1.0
12 548 MERLIER Tim 144000 181 19 Alpecin-Fenix 0.0
13 383 VALVERDE Alejandro 159500 176 13 Movistar Team 0.0
14 434 MATTHEWS Michael 150000 153 15 Team BikeExchange - Jayco 0.0

The csv file, once again delivered to me by organizer Bob, has the riders already sorted according to their points. The table does have some duplicates which we need to fix. We see 2 Belgian riders in the top 3! I believe one of the reasons of some of my Megatombike successes around 2012 were due to me not having a bias towards Belgian riders unlike most MTB participants. Back then, that was an advantage. Nowadays, not so much!

df = df.drop_duplicates(['name'])
df.sort_values('points', ascending=False).head(15).sum()
id                                                       4826
name        POGAČAR TadejVAN AERT WoutEVENEPOEL RemcoVINGE...
price                                                 2439000
points                                                   3597
id.1                                                      169
name.1      UAE Team EmiratesJumbo-VismaQuick-Step Alpha V...
selected                                                  7.0
dtype: object

Summing the values and points of the best 15 riders gives a team that scores 3597 points, but is way too expensive (2.4 million). Still, this doesn’t exceed the 1.75 million budget as much as last year (then it was 2.85 million). And in fact, starting from the top, we would be able to buy the 9 highest scoring riders, for a point total of 2522. Enough to beat out Hala Die Mannschaft!

df.sort_values('points', ascending=False).head(9).sum()
id                                                       2876
name        POGAČAR TadejVAN AERT WoutEVENEPOEL RemcoVINGE...
price                                                 1746500
points                                                   2522
id.1                                                      100
name.1      UAE Team EmiratesJumbo-VismaQuick-Step Alpha V...
selected                                                  5.0
dtype: object

Pyomo + CBC solve

import pyomo.environ as pyo
points = {}
cost = {}
count = {}
dups = []
for i, el in df.iterrows():
    name = el['name']
    
    points[name] = el['points']
    cost[name] = el['price']
    count[name] = 1.
cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=list(points.keys()))
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

Let’s run this thing!

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 963
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -3263; 0 nodes, 19 iterations, 0.025 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.07400059700012207
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

It never ceases to amaze how fast MILP solvers can solve (with guaranteed optimality) these NP-hard problems. If we take the point of view that the number of candidate solutions is dictated by choosing a combination of 15 elements out of 963, we can use the scipy implementation of $N$-choose-$k$ to get an impression of the size of the solution space and the NP-hardness of it:

import scipy.special
scipy.special.comb(963, 15)
3.893159932009228e+32

The results structure output reports a computation time of ~65 milliseconds to find the optimal solution among $3.89 \times 10^{32}$ candidate solutions. Of course, it does so much faster than brute forcing its way through all those candidates.

%%timeit
opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
111 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If we do a custom timing of the solve, that computation time doubles to 120-130 milliseconds. CBC is included in Pyomo as an executable and the interfacing is file-system-based: Pyomo writes to files which get loaded by cbc.exe (which you have to put somewhere in the path, e.g. in the same directory as the Jupyter notebook I was using.

But you were interested in the optimal team, right?! Here goes:

df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
id name price points id.1 name.1 selected
0 527 POGAČAR Tadej 437000 473 18 UAE Team Emirates 1.0
1 322 VAN AERT Wout 320000 401 11 Jumbo-Visma 1.0
2 395 EVENEPOEL Remco 152500 394 14 Quick-Step Alpha Vinyl Team 1.0
3 328 VINGEGAARD Jonas 144000 247 11 Jumbo-Visma 1.0
5 315 LAPORTE Christophe 96500 213 11 Jumbo-Visma 1.0
6 96 HIGUITA Sergio 87500 195 4 BORA - hansgrohe 1.0
8 219 MARTÍNEZ Daniel Felipe 75000 191 8 INEOS Grenadiers 1.0
9 335 DE LIE Arnaud 15000 191 12 Lotto Soudal 1.0
10 189 KÜNG Stefan 121500 187 7 Groupama - FDJ 1.0
11 61 BILBAO Pello 102500 187 3 Bahrain - Victorious 1.0
21 445 ARENSMAN Thymen 22000 134 16 Team DSM 1.0
29 226 RODRIGUEZ Carlos 57500 120 8 INEOS Grenadiers 1.0
31 252 MEINTJES Louis 41500 117 9 Intermarché - Wanty - Gobert Matériaux 1.0
35 507 AYUSO Juan 40000 110 18 UAE Team Emirates 1.0
41 247 HERMANS Quinten 30000 103 9 Intermarché - Wanty - Gobert Matériaux 1.0
df[df['selected'] > 0.5].sum()
id                                                       4464
name        POGAČAR TadejVAN AERT WoutEVENEPOEL RemcoVINGE...
price                                                 1742500
points                                                   3263
id.1                                                      159
name.1      UAE Team EmiratesJumbo-VismaQuick-Step Alpha V...
selected                                                 15.0
dtype: object

This team would have scored a whopping 3263 points: more than 800 points more than the winner. Easier said than done, of course!

A little bit more analysis: We saw before how the total price of the top 15 scorers doesn’t exceed the budget as much as last year. Now we see that the optimal team has 9 out of the 11 highest scoring riders in it. Only numbers 4 and 7 are not selected: Vlasov and van der Poel.

These observations may indicate that riders are relatively cheap this year, and one can buy almost all of the highest scoring riders. Once you get to the point that you can buy all 15 highest scoring riders with the given budget, the knapsack aspect of the fantasy cycling competition becomes a bit moot, and it becomes only about predicting who will be the highest scoring riders.

However, if one looks at the 15 most expensive riders this season, only 2 of them make it to the optimal team, and there is quite a discrepancy between rider prices and scores. In fact, the 15 most expensive riders together would have scored in total 2226 points, which would not even have won the 2022 MTB competition!

Anyways, I’m not sure whether this all means I should recommend the organisers to make the budgets tighter again, or not…

df.sort_values(['price'], ascending=False).head(15)
id name price points id.1 name.1 selected
0 527 POGAČAR Tadej 437000 473 18 UAE Team Emirates 0.0
16 318 ROGLIČ Primož 377000 148 11 Jumbo-Visma 0.0
1 322 VAN AERT Wout 320000 401 11 Jumbo-Visma 0.0
147 385 ALAPHILIPPE Julian 280000 32 14 Quick-Step Alpha Vinyl Team 0.0
7 561 VAN DER POEL Mathieu 260000 195 19 Alpecin-Fenix 0.0
30 505 ALMEIDA João 219000 118 18 UAE Team Emirates 0.0
541 208 BERNAL Egan 193000 0 8 INEOS Grenadiers 0.0
22 209 CARAPAZ Richard 182500 133 8 INEOS Grenadiers 0.0
92 504 ACKERMANN Pascal 175500 47 18 UAE Team Emirates 0.0
4 113 VLASOV Aleksandr 174000 213 4 BORA - hansgrohe 0.0
141 64 COLBRELLI Sonny 161000 33 3 Bahrain - Victorious 0.0
20 237 YATES Adam 160500 135 8 INEOS Grenadiers 0.0
13 383 VALVERDE Alejandro 159500 176 13 Movistar Team 0.0
55 185 GAUDU David 154000 76 7 Groupama - FDJ 0.0
98 297 WOODS Michael 153000 46 10 Israel - Premier Tech 0.0
df.sort_values(['price'], ascending=False).head(15).sum()
id                                                       4818
name        POGAČAR TadejROGLIČ PrimožVAN AERT WoutALAPHIL...
price                                                 3406000
points                                                   2226
id.1                                                      170
name.1      UAE Team EmiratesJumbo-VismaJumbo-VismaQuick-S...
selected                                                  0.0
dtype: object

A final bit of analysis: In 2021, the highest team scored 1564 points, while the maximum achievable was 2223. The ratio of these two numbers is 70%. In 2022, we have a ratio of $2435/3263 \approx 75\%$. So there seems to be some consistency on how optimal you have to be to win MTB.

Auxiliary

Cheapest team that could have won

We can also turn the knapsack problem on its head: What is the cheapest team, that could have won (= total score of 2436 or more). In the code, an objective and constraint need to be swapped.

cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Constraint(expr=sum(points[i]*M.x[i] for i in M.ITEMS) >= 2436+1)
M.weight = pyo.Objective(expr=sum(cost[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 963
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 908000; 0 nodes, 50 iterations, 0.072 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.11903929710388184
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
id name price points id.1 name.1 selected
2 395 EVENEPOEL Remco 152500 394 14 Quick-Step Alpha Vinyl Team 1.0
3 328 VINGEGAARD Jonas 144000 247 11 Jumbo-Visma 1.0
5 315 LAPORTE Christophe 96500 213 11 Jumbo-Visma 1.0
6 96 HIGUITA Sergio 87500 195 4 BORA - hansgrohe 1.0
8 219 MARTÍNEZ Daniel Felipe 75000 191 8 INEOS Grenadiers 1.0
9 335 DE LIE Arnaud 15000 191 12 Lotto Soudal 1.0
11 61 BILBAO Pello 102500 187 3 Bahrain - Victorious 1.0
21 445 ARENSMAN Thymen 22000 134 16 Team DSM 1.0
29 226 RODRIGUEZ Carlos 57500 120 8 INEOS Grenadiers 1.0
31 252 MEINTJES Louis 41500 117 9 Intermarché - Wanty - Gobert Matériaux 1.0
35 507 AYUSO Juan 40000 110 18 UAE Team Emirates 1.0
36 312 KOOIJ Olav 41500 109 11 Jumbo-Visma 1.0
41 247 HERMANS Quinten 30000 103 9 Intermarché - Wanty - Gobert Matériaux 1.0
53 228 SHEFFIELD Magnus 2500 80 8 INEOS Grenadiers 1.0
78 980 POZZOVIVO Domenico 0 55 9 Intermarché - Wanty - Gobert Matériaux 1.0
df[df['selected'] > 0.5].sum()
id                                                       4946
name        EVENEPOEL RemcoVINGEGAARD JonasLAPORTE Christo...
price                                                  908000
points                                                   2446
id.1                                                      151
name.1      Quick-Step Alpha Vinyl TeamJumbo-VismaJumbo-Vi...
selected                                                 15.0
dtype: object

The cheapest team that could have won, costs 908.000 euro, which means that only about 50 percent of the budget is used! This team shares 12 riders with the optimal team, and swaps out

  • Pogacar
  • Van Aert
  • Kung

for

  • Kooij
  • Sheffield
  • Pozzovivo

This corroborates the discrepancy between prices and scores this year: Many riders that scored high were quite cheap. And vice versa many expensive riders scored below par. Only Pogacar and Van Aert lived up to the expectations.

Worst team

Finally, the worst team is a trivial problem: just select 15 (cheap?) riders with no points. One way to have a more interesting worst team, could be to set a lower limit on the team cost, e.g. 1.745.000 euro.

count_required = 15

M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.minimize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) >= cost_limit - 5000)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 963
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective 70; 0 nodes, 19 iterations, 0.031 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.0669410228729248
Solution: 
- number of solutions: 0
  number of solutions displayed: 0
df['selected'] = [v.value for v in M.component_data_objects(pyo.Var)]
df[df['selected'] > 0.5].sort_values('points', ascending=False)
id name price points id.1 name.1 selected
147 385 ALAPHILIPPE Julian 280000 32 14 Quick-Step Alpha Vinyl Team 1.0
232 235 VIVIANI Elia 146500 14 8 INEOS Grenadiers 1.0
245 396 HONORÉ Mikkel Frølich 115000 11 14 Quick-Step Alpha Vinyl Team 1.0
308 99 KELDERMAN Wilco 109500 6 4 BORA - hansgrohe 1.0
348 109 SCHACHMANN Maximilian 137500 4 4 BORA - hansgrohe 1.0
391 401 MASNADA Fausto 78500 1 14 Quick-Step Alpha Vinyl Team 1.0
413 305 DUMOULIN Tom 127500 1 11 Jumbo-Visma 1.0
419 17 PARET-PEINTRE Aurélien 88000 1 1 AG2R Citroën Team 1.0
541 208 BERNAL Egan 193000 0 8 INEOS Grenadiers 1.0
542 508 BENNETT George 86500 0 18 UAE Team Emirates 1.0
544 784 ZAKARIN Ilnur 71500 0 29 Gazprom - RusVelo 1.0
545 78 POELS Wout 80000 0 3 Bahrain - Victorious 1.0
546 389 CATTANEO Mattia 69500 0 14 Quick-Step Alpha Vinyl Team 1.0
548 47 MOSCON Gianni 85000 0 2 Astana Qazaqstan Team 1.0
605 167 PADUN Mark 80000 0 6 EF Education-EasyPost 1.0
df[df['selected'] > 0.5].sum()
id                                                       3902
name        ALAPHILIPPE JulianVIVIANI EliaHONORÉ Mikkel Fr...
price                                                 1759500
points                                                     71
id.1                                                      142
name.1      Quick-Step Alpha Vinyl TeamINEOS GrenadiersQui...
selected                                                 15.0
dtype: object

Last year, we could find a team that used all the money and scored no points. This year, if we spent nearly all the money, we’re still getting about 70 points. Julien Alaphilippe and Tom Dumoulin are two of the worst riders one could have chosen… which I did…

Scipy solution

This final section is for those interested in the mathematical optimization aspect.

I feel like there is quite a lot of movement in the MILP solver landscape. In the land of commercial solvers, some new, mostly Chinese, solvers are speeding their ways up to the tops of the benchmarks. And some other contenders are pulling out of the benchmarks due to some Gurobi scandal. The developers of the Chinese COPT seem to be confident that they will overtake Gurobi in the next years: https://www.youtube.com/watch?v=iqiBXoJQVD8&t=258s

In open source land, HiGHs is the big newcomer. A major development with respect to last years post, is that in the summer of 2022, with Scipy version 1.9.0, the scipy.optimize submodule got extended with a milp routine, which runs HiGHs in the backend. I think this is quite a big deal. Scipy is a much more common Python library/dependency than Pyomo. For example, the Anaconda conda-forge channel has (at the time of writing) 600k installs for Pyomo, while Scipy has 31M, which is 50 times more! Moreover you don’t need to install an open source MILP solver like CBC separately. This makes MILP solvers much more broadly accessible. The main current downside is that the interface is much more barebones, and you don’t get all the nice modeling tools (named variables etc.) that you get with Pyomo.

Below, we work out the MTB knapsack problem with scipy.optimize.milp. This is really easy to do, as the Scipy documentation already has a knapsack example: https://scipy.github.io/devdocs/tutorial/optimize.html#knapsack-problem-example All we need to add is the constraint that the number of riders needs to equal 15.

import scipy.optimize as sciopt
scipy.__version__
'1.9.3'
points_arr = np.array(list(points.values()))
cost_arr = np.array(list(cost.values()))
count_arr = np.array(list(count.values()))
bounds = sciopt.Bounds(0, 1)  # 0 <= x_i <= 1
integrality = np.full_like(points_arr, True)  # x_i are integers

cost_constraints = sciopt.LinearConstraint(A=cost_arr, lb=0, ub=cost_limit)
count_constraints = sciopt.LinearConstraint(A=count_arr, lb=count_required, ub=count_required)

res = sciopt.milp(c=-points_arr, constraints=[cost_constraints, count_constraints],
           integrality=integrality, bounds=bounds)
res.fun
-3263.0
sel = np.array(df['selected'])
np.max(np.abs(res.x - sel))

The solver finds the exact same solution with the same maximum point total of 3263.

%%timeit
res = sciopt.milp(c=-points_arr, constraints=[cost_constraints, count_constraints],
           integrality=integrality, bounds=bounds)
20.9 ms ± 716 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If we benchmark it, we see that on this problem Scipy+HiGHs is 3 - 6 times faster than the Pyomo+CBC combination (depending on which timing number you take)! Part of it is probably due to the solver being interfaced as a normal library rather than a seperate executable running in a different process.

SCIP

Finally, I’d like to report some related news on the SCIP MILP solver, which has always dwelled a bit in between open source and commercial solvers (both in license and in benchmark results). In November 2022, SCIP switched to the fully open source Apache 2.0 license. SCIP can be used from Python through the PySCIPOpt package. This has 130k downloads through conda-forge so seems the least popular option at the moment, though that may change due to the recent license changes. I haven’t gotten round to giving it a go. I wouldn’t be surprised if it turns out to be the fastest open source solver on this knapsack problem.


Want to leave a comment?

Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email.