This post describes how I generated my fantasy cycling team for the MegaTomBike 2021 competition. This is a local variant of the bigger Megabike competition, organized by my friend Bob and others.
The challenge is simple: Create a team of 15 riders, each having some price tag, such that the total team cost is below 1.750.000 euro. Throughout the 2021 cycling season, riders will collect points according to their performance in all competitions on the calendar (single day & multiple day events).
This is the first edition I’m participating in since my previous participations from 2011-2013. Though I’m far from a cycling enthusiast or expert, I’ve been quite succesful in those past competitions by simply taking a quantative approach. Mathematically, MegaTomBike can be considered a variant of a 0-1 Knapsack problem, but more on that later. In 2013 and before, I was not yet familiar with discrete optimization problems and integer programming (a significant gap in the KULeuven mathematical engineering curriculum imo). My approach back then consisted of playing around in Excel, sorting riders according to some efficiency metric, and manual puzzling my way to a decent team. This approach yielded multiple top-3 final rankings (out of ~100 participating teams). Key to the successes of those teams I believe were
This year, I am tackling the team creation as a 0-1 knapsack problem:
\[\begin{align*} \text{maximize}_{x_i} \quad & \sum_{i=1}^n p_i x_i \\ \text{subject to} \quad & \sum_{i=1}^n c_i x_i \leq C \\ & \sum_{i=1}^n x_i = N \\ & x_i \in \{0,1\} \end{align*}\]where $p_i$ is the number of points collected by a rider, $c_i$ is the cost of a rider, $C$ is the total team cost limit (= 1.750.000), and $N$ is the total number of riders (= 15). $x_i$ is a binary variable which either selects or does not select rider $i$.
So, we are maximizing the total number of points, given all the constraints.
Constraint
\[\sum_{i=1}^n x_i = N (= 15)\]is how this problem differs from the standard 0-1 knapsack problem.
We already know $C$ and $N$. The rider costs $c_i$ can luckily be downloaded from the MegaTomBike website in Excel format (column ‘Waarde’).
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.read_excel(r'renners.xls').head()
Renner | Waarde | |
---|---|---|
0 | BIDARD François | 14000 |
1 | BOUCHARD Geoffrey | 13500 |
2 | CALMEJANE Lilian | 66000 |
3 | CHAMPOUSSIN Clément | 17500 |
4 | CHEREL Mikaël | 15000 |
The hard one is $p_i$. All races are still to be ridden, so we will have to look for a proxy. I have chosen to use the individual UCI road ranking points from 2020. Luckily, also the UCI website allows exporting to Excel.
pd.read_excel(r'IndividualRanking.xlsx').head()
Rank | UCI ID | Name | Nationality | Team Code | Age | Points | |
---|---|---|---|---|---|---|---|
0 | 1 | 10008888921 | ROGLIČ Primož | SLOVENIA | TJV | 31 | 4237.0 |
1 | 2 | 10014972740 | POGAČAR Tadej | SLOVENIA | UAD | 22 | 3055.0 |
2 | 3 | 10007585986 | VAN AERT Wout | BELGIUM | TJV | 26 | 2700.0 |
3 | 4 | 10007946203 | VAN DER POEL Mathieu | NETHERLANDS | AFC | 25 | 2040.0 |
4 | 5 | 10003401246 | FUGLSANG Jakob | DENMARK | AST | 35 | 1961.0 |
The hard part now is actually matching those two Excel files… Both have a column with the rider name, but not all have the exact same spelling or diacritics, for example BERNAL Egan
vs BERNAL GOMEZ Egan Arley
. Season 2021 will also have some new young riders competing who did not compete in 2020.
Below is my best effort for matching the riders in the two tables, by ignoring the case, removing diacritics (with the unidecode
module), matching only parts of names, …
import unidecode #To remove diacritics
mtb = pd.read_excel(r'renners.xls')
mtb['Renner'] = [unidecode.unidecode(name.lower()) for name in list(mtb['Renner'])]
uci = pd.read_excel(r'IndividualRanking.xlsx')
uci['Name'] = [unidecode.unidecode(name.lower()) for name in list(uci['Name'])]
mtb.sort_values('Waarde', ascending=False).head()
Renner | Waarde | |
---|---|---|
462 | roglic primoz | 400000 |
550 | pogacar tadej | 324000 |
147 | alaphilippe julian | 291000 |
583 | van der poel mathieu | 249000 |
466 | van aert wout | 242000 |
uci.head(5)
Rank | UCI ID | Name | Nationality | Team Code | Age | Points | |
---|---|---|---|---|---|---|---|
0 | 1 | 10008888921 | roglic primoz | SLOVENIA | TJV | 31 | 4237.0 |
1 | 2 | 10014972740 | pogacar tadej | SLOVENIA | UAD | 22 | 3055.0 |
2 | 3 | 10007585986 | van aert wout | BELGIUM | TJV | 26 | 2700.0 |
3 | 4 | 10007946203 | van der poel mathieu | NETHERLANDS | AFC | 25 | 2040.0 |
4 | 5 | 10003401246 | fuglsang jakob | DENMARK | AST | 35 | 1961.0 |
mtb_names = list(mtb.sort_values('Waarde', ascending=False)['Renner'])
mtb_costs = list(mtb.sort_values('Waarde', ascending=False)['Waarde'])
uci_names = list(uci['Name'])
mtb_lasts = [mtb_name.split(' ')[0] for mtb_name in mtb_names]
mtb_firsts = [mtb_name.split(' ')[-1] for mtb_name in mtb_names]
uci_lasts = [uci_name.split(' ')[0] for uci_name in uci_names]
uci_firsts = [uci_name.split(' ')[-1] for uci_name in uci_names]
n_found = 0
n_missing = 0
missing = []
match_dict = {}
for i, mtb_name in enumerate(mtb_names): #mtb.sort_values('Waarde', ascending=False)['Renner']:
mtb_split = mtb_name.lower().split(' ')
n_matches = []
for j, uci_name in enumerate(uci_names):
n_match = 0
for jj, el in enumerate(mtb_split):
if el in uci_name.split(' '):
if jj == 0 and len(el) > 3:
n_match += 2 #More weight on first name match
else:
n_match += 1
n_matches.append(n_match)
k = n_matches.index(max(n_matches))
if n_matches[k] >= 2 and uci_names[k] not in match_dict.values():
n_found += 1
uci_name = uci_names[k]
match_dict[mtb_name] = uci_name
#print(mtb_name, ' ------ ', uci_name)
else:
n_missing += 1
print(mtb_name, mtb_costs[i])
print(n_found, n_missing)
martin dan 156500
hirschi marc 140000
froome chris 131500
izagirre gorka 82500
ghirmay hailu biniam 64500
...
cavendish mark 2500
...
812 165
To conclude the data alignment: The MTB rider list has 997 riders. We are able to find an equivalent one in the UCI table for 812 out of them. 165 ones we don’t find. In the output above, those are ranked from most expensive to least expensive. Some comments:
match_dict['martin dan'] = 'martin daniel'
mtb_points = []
for mtb_name in mtb['Renner']:
try:
uci_name = match_dict[mtb_name]
points = uci[uci['Name'] == uci_name].Points.mean()
mtb_points.append(points)
except:
mtb_points.append(0.)
mtb['Points'] = mtb_points
mtb.sort_values('Waarde', ascending=False).head(10)
Renner | Waarde | Points | Efficiency | |
---|---|---|---|---|
462 | roglic primoz | 400000 | 4237.00 | 0.010592 |
550 | pogacar tadej | 324000 | 3055.00 | 0.009429 |
147 | alaphilippe julian | 291000 | 1795.83 | 0.006171 |
583 | van der poel mathieu | 249000 | 2040.00 | 0.008193 |
466 | van aert wout | 242000 | 2700.00 | 0.011157 |
25 | van avermaet greg | 225000 | 650.00 | 0.002889 |
540 | kristoff alexander | 223000 | 857.00 | 0.003843 |
880 | quintana nairo | 207500 | 865.00 | 0.004169 |
382 | valverde alejandro | 198500 | 716.00 | 0.003607 |
89 | ackermann pascal | 197000 | 1248.00 | 0.006335 |
mtb.sort_values('Points', ascending=False).head(10)
Renner | Waarde | Points | Efficiency | |
---|---|---|---|---|
462 | roglic primoz | 400000 | 4237.00 | 0.010592 |
550 | pogacar tadej | 324000 | 3055.00 | 0.009429 |
466 | van aert wout | 242000 | 2700.00 | 0.011157 |
583 | van der poel mathieu | 249000 | 2040.00 | 0.008193 |
39 | fuglsang jakob | 178000 | 1961.00 | 0.011017 |
147 | alaphilippe julian | 291000 | 1795.83 | 0.006171 |
256 | porte richie | 135000 | 1733.00 | 0.012837 |
556 | ulissi diego | 170500 | 1671.00 | 0.009801 |
213 | demare arnaud | 172500 | 1550.00 | 0.008986 |
539 | hirschi marc | 140000 | 1430.00 | 0.010214 |
As we can see, these rankings match quite nicely. Bob told me they actually based the ‘Waarde’ values on ProCyclingStats data (looking further back than last year).
Now my old approach would have been to play around with ‘efficiency’.
mtb['Efficiency'] = mtb['Points']/mtb['Waarde']
mtb.sort_values('Waarde', ascending=False).head(20)
Renner | Waarde | Points | Efficiency | |
---|---|---|---|---|
462 | roglic primoz | 400000 | 4237.00 | 0.010592 |
550 | pogacar tadej | 324000 | 3055.00 | 0.009429 |
147 | alaphilippe julian | 291000 | 1795.83 | 0.006171 |
583 | van der poel mathieu | 249000 | 2040.00 | 0.008193 |
466 | van aert wout | 242000 | 2700.00 | 0.011157 |
25 | van avermaet greg | 225000 | 650.00 | 0.002889 |
540 | kristoff alexander | 223000 | 857.00 | 0.003843 |
880 | quintana nairo | 207500 | 865.00 | 0.004169 |
382 | valverde alejandro | 198500 | 716.00 | 0.003607 |
89 | ackermann pascal | 197000 | 1248.00 | 0.006335 |
334 | ewan caleb | 189000 | 971.00 | 0.005138 |
145 | viviani elia | 180000 | 254.00 | 0.001411 |
39 | fuglsang jakob | 178000 | 1961.00 | 0.011017 |
447 | dumoulin tom | 177500 | 700.00 | 0.003944 |
401 | matthews michael | 175000 | 1022.00 | 0.005840 |
160 | evenepoel remco | 175000 | 1193.00 | 0.006817 |
239 | bernal egan | 174500 | 425.50 | 0.002438 |
227 | pinot thibaut | 172500 | 863.00 | 0.005003 |
213 | demare arnaud | 172500 | 1550.00 | 0.008986 |
556 | ulissi diego | 170500 | 1671.00 | 0.009801 |
mtb.sort_values('Efficiency', ascending=False).head(20)
Renner | Waarde | Points | Efficiency | |
---|---|---|---|---|
777 | repa vojtech | 4000 | 175.00 | 0.043750 |
940 | johansen julius | 2500 | 108.43 | 0.043372 |
567 | meisen marcel | 2500 | 100.00 | 0.040000 |
939 | hvideberg jonas iversby | 6000 | 218.00 | 0.036333 |
933 | charmig anthon | 2500 | 86.00 | 0.034400 |
740 | zingle axel | 2500 | 75.00 | 0.030000 |
656 | aniolkowski stanislaw | 11000 | 328.00 | 0.029818 |
953 | urianstad martin | 2500 | 70.00 | 0.028000 |
405 | peak barnabas | 3500 | 90.00 | 0.025714 |
741 | albanese vincenzo | 2500 | 59.00 | 0.023600 |
602 | rumac josip | 7500 | 170.00 | 0.022667 |
712 | murguialday jokin | 2500 | 56.00 | 0.022400 |
877 | pajur markus | 6000 | 126.00 | 0.021000 |
454 | kooij olav | 14500 | 304.00 | 0.020966 |
54 | romo javier | 2500 | 50.00 | 0.020000 |
791 | cuadrado unai | 2500 | 50.00 | 0.020000 |
873 | louvel matis | 3000 | 56.00 | 0.018667 |
847 | marit arne | 2500 | 46.00 | 0.018400 |
857 | weemaes sasha | 3000 | 53.00 | 0.017667 |
158 | declercq tim | 23500 | 415.00 | 0.017660 |
The spread in efficiency among top riders is not huge: 0.005 - 0.01 UCI points per euro. The best ones go up to about 0.015.
We use Pyomo as the optimization modeling environment. The problem to solve is a MILP (actually a binary linear program). We will use open source solver CBC to do the actual solving. The code below is modified from the Pyomo knapsack example.
import pyomo.environ as pyo
points = {}
cost = {}
count = {}
for i, el in mtb.iterrows():
points[el['Renner']] = el['Points']
cost[el['Renner']] = el['Waarde']
count[el['Renner']] = 1.
cost_limit = 1750000
count_required = 15
M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)
M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)
Let’s run this thing!
opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem:
- Lower bound: -inf
Upper bound: inf
Number of objectives: 1
Number of constraints: 2
Number of variables: 976
Sense: unknown
Solver:
- Status: ok
Message: CBC 2.10.3 optimal, objective -20583.33; 12 nodes, 89 iterations, 0.202 seconds
Termination condition: optimal
Id: 0
Error rc: 0
Time: 0.2946770191192627
Solution:
- number of solutions: 0
number of solutions displayed: 0
We got our solution with 976 variables in less than a second. This team would have collected 20583.33 UCI points last season. Let’s look at the team:
for v in M.component_data_objects(pyo.Var):
if v.value > 0.5:
key = str(v)[2:-1]
name = key + (20 - len(key))*' '
print('%s \t %6i \t %7i' % (name, points[key], cost[key]))
fuglsang jakob 1961 178000
kelderman wilco 1328 121000
almeida joao 1370 121500
declercq tim 415 23500
senechal florian 1325 81500
carthy hugh 821 69000
geoghegan hart tao 1292 126000
porte richie 1733 135000
hindley jai 1155 100000
kooij olav 304 14500
roglic primoz 4237 400000
van aert wout 2700 242000
nizzolo giacomo 1115 93500
vermeersch gianni 499 31000
aniolkowski stanislaw 328 11000
Before we wrap up, one final detour we take is to see what solution we’d get if we’d Mark Cavendish some reasonable amount of points, say 400 i.s.o. the current 0, which would have placed him around spot 100-150 on the UCI ranking of 2020.
points['cavendish mark'] = 400
cost_limit = 1750000
count_required = 15
M = pyo.ConcreteModel()
M.ITEMS = pyo.Set(initialize=points.keys())
M.x = pyo.Var(M.ITEMS, within=pyo.Binary)
M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)
M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)
M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)
opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem:
- Lower bound: -inf
Upper bound: inf
Number of objectives: 1
Number of constraints: 2
Number of variables: 976
Sense: unknown
Solver:
- Status: ok
Message: CBC 2.10.3 optimal, objective -20817.33; 0 nodes, 14 iterations, 0.157 seconds
Termination condition: optimal
Id: 0
Error rc: 0
Time: 0.26456785202026367
Solution:
- number of solutions: 0
number of solutions displayed: 0
We observe an increase in total team score from 20583 to 20817, about 230 points. We know we could never gain more than 400 points. The resulting team is
for v in M.component_data_objects(pyo.Var):
if v.value > 0.5:
key = str(v)[2:-1]
name = key + (20 - len(key))*' '
print('%s \t %6i \t %7i' % (name, points[key], cost[key]))
fuglsang jakob 1961 178000
kelderman wilco 1328 121000
almeida joao 1370 121500
cavendish mark 400 2500
declercq tim 415 23500
senechal florian 1325 81500
carthy hugh 821 69000
porte richie 1733 135000
hindley jai 1155 100000
hirschi marc 1430 140000
roglic primoz 4237 400000
van aert wout 2700 242000
nizzolo giacomo 1115 93500
vermeersch gianni 499 31000
aniolkowski stanislaw 328 11000
This team only swaps out two riders with respect to our previous solution: Mark Cavendish replaces Olav Kooij (also a sprinter), and the few gained euros are used to replace Tao Geoghegan Hart with Marc Hirschi. We could have done a similar consideration for a rider like Mathieu van der Poel who had a poor start to his 2020 season, but who is looking very good for 2021, but I didn’t.
I finally decided to stick to the original team and give young Dutch sprinting talent Olav Kooij his shot at glory.
The final team, submitted as ‘Poeske Scherens Fanclub’ (a double entendre in Dutch), is:
Though completely unbiased towards Belgian riders, we still have three of them. The team quite clearly is oriented towards GC riders, in particular those who did well last year (Fuglsang, Kelderman, Almeida, Geoghegan Hart, Porte, Hindley, Roglic). One could argue that what we have basically done is find the most significant differences between the UCI ranking and ProCyclingStats. This team indicates that the UCI ranking is more favorable to GC riders than ProCyclingStats.
The real question is of course: Which one is the most representative for the MegaTomBike ranking? Or even more precise: Which past ranking (or complex model with multiple years and multiple features) is the best predictor for a future MegaTomBike ranking? We could then even consider past MegaTomBike scores, which I don’t have access to currently.
Those questions I’ll leave for some future post, for example when the 2021 season is over.
Until then… go team!
Update: A promising start! After a few weeks in, already in second place! Again not many GC oriented teams among the competitors.
Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email or reply to this tweet.