MegaTomBike 2021

Knapsack approach to a fantasy cycling team

Posted by Lense Swaenen on March 27, 2021 · 16 mins read

This post describes how I generated my fantasy cycling team for the MegaTomBike 2021 competition. This is a local variant of the bigger Megabike competition, organized by my friend Bob and others.

The challenge is simple: Create a team of 15 riders, each having some price tag, such that the total team cost is below 1.750.000 euro. Throughout the 2021 cycling season, riders will collect points according to their performance in all competitions on the calendar (single day & multiple day events).

Some history

This is the first edition I’m participating in since my previous participations from 2011-2013. Though I’m far from a cycling enthusiast or expert, I’ve been quite succesful in those past competitions by simply taking a quantative approach. Mathematically, MegaTomBike can be considered a variant of a 0-1 Knapsack problem, but more on that later. In 2013 and before, I was not yet familiar with discrete optimization problems and integer programming (a significant gap in the KULeuven mathematical engineering curriculum imo). My approach back then consisted of playing around in Excel, sorting riders according to some efficiency metric, and manual puzzling my way to a decent team. This approach yielded multiple top-3 final rankings (out of ~100 participating teams). Key to the successes of those teams I believe were

  • No bias in favor Belgian riders (as opposed to the other, man-made teams)
  • No bias against riders who are unpopular due to a doping history (e.g. Michele Scarponi), due to style (e.g. Cadel Evans) or due to old age (e.g. Voigt, Garzelli).
  • Preference for GC (general classification) riders over single day specialists. The former are more predictable and consistent, and many points are to be won in multiple day events. Recent Belgian cycling history has known more single day specialists than GC specialists, so the other teams were more biased towards that type of riders.

Creating the team for 2021

Mathematical model

This year, I am tackling the team creation as a 0-1 knapsack problem:

\[\begin{align*} \text{maximize}_{x_i} \quad & \sum_{i=1}^n p_i x_i \\ \text{subject to} \quad & \sum_{i=1}^n c_i x_i \leq C \\ & \sum_{i=1}^n x_i = N \\ & x_i \in \{0,1\} \end{align*}\]

where $p_i$ is the number of points collected by a rider, $c_i$ is the cost of a rider, $C$ is the total team cost limit (= 1.750.000), and $N$ is the total number of riders (= 15). $x_i$ is a binary variable which either selects or does not select rider $i$.

So, we are maximizing the total number of points, given all the constraints.

Constraint

\[\sum_{i=1}^n x_i = N (= 15)\]

is how this problem differs from the standard 0-1 knapsack problem.

Data

We already know $C$ and $N$. The rider costs $c_i$ can luckily be downloaded from the MegaTomBike website in Excel format (column ‘Waarde’).

%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.read_excel(r'renners.xls').head()
Renner Waarde
0 BIDARD François 14000
1 BOUCHARD Geoffrey 13500
2 CALMEJANE Lilian 66000
3 CHAMPOUSSIN Clément 17500
4 CHEREL Mikaël 15000

The hard one is $p_i$. All races are still to be ridden, so we will have to look for a proxy. I have chosen to use the individual UCI road ranking points from 2020. Luckily, also the UCI website allows exporting to Excel.

pd.read_excel(r'IndividualRanking.xlsx').head()
Rank UCI ID Name Nationality Team Code Age Points
0 1 10008888921 ROGLIČ Primož SLOVENIA TJV 31 4237.0
1 2 10014972740 POGAČAR Tadej SLOVENIA UAD 22 3055.0
2 3 10007585986 VAN AERT Wout BELGIUM TJV 26 2700.0
3 4 10007946203 VAN DER POEL Mathieu NETHERLANDS AFC 25 2040.0
4 5 10003401246 FUGLSANG Jakob DENMARK AST 35 1961.0

Matching names

The hard part now is actually matching those two Excel files… Both have a column with the rider name, but not all have the exact same spelling or diacritics, for example BERNAL Egan vs BERNAL GOMEZ Egan Arley. Season 2021 will also have some new young riders competing who did not compete in 2020.

Below is my best effort for matching the riders in the two tables, by ignoring the case, removing diacritics (with the unidecode module), matching only parts of names, …

import unidecode #To remove diacritics
mtb = pd.read_excel(r'renners.xls')
mtb['Renner'] = [unidecode.unidecode(name.lower()) for name in list(mtb['Renner'])]

uci = pd.read_excel(r'IndividualRanking.xlsx')
uci['Name'] = [unidecode.unidecode(name.lower()) for name in list(uci['Name'])]
mtb.sort_values('Waarde', ascending=False).head()
Renner Waarde
462 roglic primoz 400000
550 pogacar tadej 324000
147 alaphilippe julian 291000
583 van der poel mathieu 249000
466 van aert wout 242000
uci.head(5)
Rank UCI ID Name Nationality Team Code Age Points
0 1 10008888921 roglic primoz SLOVENIA TJV 31 4237.0
1 2 10014972740 pogacar tadej SLOVENIA UAD 22 3055.0
2 3 10007585986 van aert wout BELGIUM TJV 26 2700.0
3 4 10007946203 van der poel mathieu NETHERLANDS AFC 25 2040.0
4 5 10003401246 fuglsang jakob DENMARK AST 35 1961.0
mtb_names = list(mtb.sort_values('Waarde', ascending=False)['Renner'])
mtb_costs = list(mtb.sort_values('Waarde', ascending=False)['Waarde'])
uci_names = list(uci['Name'])
mtb_lasts = [mtb_name.split(' ')[0] for mtb_name in mtb_names]
mtb_firsts = [mtb_name.split(' ')[-1] for mtb_name in mtb_names]

uci_lasts = [uci_name.split(' ')[0] for uci_name in uci_names]
uci_firsts = [uci_name.split(' ')[-1] for uci_name in uci_names]
n_found = 0
n_missing = 0
missing = []
match_dict = {}
for i, mtb_name in enumerate(mtb_names): #mtb.sort_values('Waarde', ascending=False)['Renner']:
    mtb_split = mtb_name.lower().split(' ')
    
    n_matches = []
    for j, uci_name in enumerate(uci_names):
        n_match = 0
        for jj, el in enumerate(mtb_split):
            if el in uci_name.split(' '):
                if jj == 0 and len(el) > 3:
                    n_match += 2 #More weight on first name match
                else:
                    n_match += 1
        
        n_matches.append(n_match)
        
    k = n_matches.index(max(n_matches))
    if n_matches[k] >= 2 and uci_names[k] not in match_dict.values():
        n_found += 1
        
        uci_name = uci_names[k]
        match_dict[mtb_name] = uci_name
        #print(mtb_name, ' ------ ', uci_name)
    else:
        n_missing += 1
        print(mtb_name, mtb_costs[i])
    

print(n_found, n_missing)
martin dan 156500
hirschi marc 140000
froome chris 131500
izagirre gorka 82500
ghirmay hailu biniam 64500
...
cavendish mark 2500
...
812 165

To conclude the data alignment: The MTB rider list has 997 riders. We are able to find an equivalent one in the UCI table for 812 out of them. 165 ones we don’t find. In the output above, those are ranked from most expensive to least expensive. Some comments:

  • Martin Dan was not matched to Martin Daniel. We will override this manually
  • Hirschi Marc went fine, but is twice in the MTB table
  • Froome Chris did not collect any points in 2020!
  • Cavendish Marc is coming out of retirement so did not collect any points in 2020, but should be a steal at 2.500 euro. More on this later.
match_dict['martin dan'] = 'martin daniel'

Fill UCI points column

mtb_points = []
for mtb_name in mtb['Renner']:
    try:
        uci_name = match_dict[mtb_name]
        points = uci[uci['Name'] == uci_name].Points.mean()
        
        mtb_points.append(points)
    except:
        mtb_points.append(0.)
        

mtb['Points'] = mtb_points

mtb.sort_values('Waarde', ascending=False).head(10)
Renner Waarde Points Efficiency
462 roglic primoz 400000 4237.00 0.010592
550 pogacar tadej 324000 3055.00 0.009429
147 alaphilippe julian 291000 1795.83 0.006171
583 van der poel mathieu 249000 2040.00 0.008193
466 van aert wout 242000 2700.00 0.011157
25 van avermaet greg 225000 650.00 0.002889
540 kristoff alexander 223000 857.00 0.003843
880 quintana nairo 207500 865.00 0.004169
382 valverde alejandro 198500 716.00 0.003607
89 ackermann pascal 197000 1248.00 0.006335
mtb.sort_values('Points', ascending=False).head(10)
Renner Waarde Points Efficiency
462 roglic primoz 400000 4237.00 0.010592
550 pogacar tadej 324000 3055.00 0.009429
466 van aert wout 242000 2700.00 0.011157
583 van der poel mathieu 249000 2040.00 0.008193
39 fuglsang jakob 178000 1961.00 0.011017
147 alaphilippe julian 291000 1795.83 0.006171
256 porte richie 135000 1733.00 0.012837
556 ulissi diego 170500 1671.00 0.009801
213 demare arnaud 172500 1550.00 0.008986
539 hirschi marc 140000 1430.00 0.010214

As we can see, these rankings match quite nicely. Bob told me they actually based the ‘Waarde’ values on ProCyclingStats data (looking further back than last year).

Now my old approach would have been to play around with ‘efficiency’.

mtb['Efficiency'] = mtb['Points']/mtb['Waarde']
mtb.sort_values('Waarde', ascending=False).head(20)
Renner Waarde Points Efficiency
462 roglic primoz 400000 4237.00 0.010592
550 pogacar tadej 324000 3055.00 0.009429
147 alaphilippe julian 291000 1795.83 0.006171
583 van der poel mathieu 249000 2040.00 0.008193
466 van aert wout 242000 2700.00 0.011157
25 van avermaet greg 225000 650.00 0.002889
540 kristoff alexander 223000 857.00 0.003843
880 quintana nairo 207500 865.00 0.004169
382 valverde alejandro 198500 716.00 0.003607
89 ackermann pascal 197000 1248.00 0.006335
334 ewan caleb 189000 971.00 0.005138
145 viviani elia 180000 254.00 0.001411
39 fuglsang jakob 178000 1961.00 0.011017
447 dumoulin tom 177500 700.00 0.003944
401 matthews michael 175000 1022.00 0.005840
160 evenepoel remco 175000 1193.00 0.006817
239 bernal egan 174500 425.50 0.002438
227 pinot thibaut 172500 863.00 0.005003
213 demare arnaud 172500 1550.00 0.008986
556 ulissi diego 170500 1671.00 0.009801
mtb.sort_values('Efficiency', ascending=False).head(20)
Renner Waarde Points Efficiency
777 repa vojtech 4000 175.00 0.043750
940 johansen julius 2500 108.43 0.043372
567 meisen marcel 2500 100.00 0.040000
939 hvideberg jonas iversby 6000 218.00 0.036333
933 charmig anthon 2500 86.00 0.034400
740 zingle axel 2500 75.00 0.030000
656 aniolkowski stanislaw 11000 328.00 0.029818
953 urianstad martin 2500 70.00 0.028000
405 peak barnabas 3500 90.00 0.025714
741 albanese vincenzo 2500 59.00 0.023600
602 rumac josip 7500 170.00 0.022667
712 murguialday jokin 2500 56.00 0.022400
877 pajur markus 6000 126.00 0.021000
454 kooij olav 14500 304.00 0.020966
54 romo javier 2500 50.00 0.020000
791 cuadrado unai 2500 50.00 0.020000
873 louvel matis 3000 56.00 0.018667
847 marit arne 2500 46.00 0.018400
857 weemaes sasha 3000 53.00 0.017667
158 declercq tim 23500 415.00 0.017660

The spread in efficiency among top riders is not huge: 0.005 - 0.01 UCI points per euro. The best ones go up to about 0.015.

Solving the knapsack problem

We use Pyomo as the optimization modeling environment. The problem to solve is a MILP (actually a binary linear program). We will use open source solver CBC to do the actual solving. The code below is modified from the Pyomo knapsack example.

import pyomo.environ as pyo
points = {}
cost = {}
count = {}
for i, el in mtb.iterrows():
    points[el['Renner']] = el['Points']
    cost[el['Renner']] = el['Waarde']
    count[el['Renner']] = 1.
cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()

M.ITEMS = pyo.Set(initialize=points.keys())

M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)

M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)

M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

Let’s run this thing!

opt = pyo.SolverFactory("cbc.exe")
results = opt.solve(M)
print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 976
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -20583.33; 12 nodes, 89 iterations, 0.202 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.2946770191192627
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

We got our solution with 976 variables in less than a second. This team would have collected 20583.33 UCI points last season. Let’s look at the team:

for v in M.component_data_objects(pyo.Var):
    if v.value > 0.5:
        key = str(v)[2:-1]
        name = key + (20 - len(key))*' '
        print('%s \t %6i \t %7i' % (name, points[key], cost[key]))
fuglsang jakob       	   1961 	  178000
kelderman wilco      	   1328 	  121000
almeida joao         	   1370 	  121500
declercq tim         	    415 	   23500
senechal florian     	   1325 	   81500
carthy hugh          	    821 	   69000
geoghegan hart tao   	   1292 	  126000
porte richie         	   1733 	  135000
hindley jai          	   1155 	  100000
kooij olav           	    304 	   14500
roglic primoz        	   4237 	  400000
van aert wout        	   2700 	  242000
nizzolo giacomo      	   1115 	   93500
vermeersch gianni    	    499 	   31000
aniolkowski stanislaw 	    328 	   11000

Reconsidering Mark Cavendish

Before we wrap up, one final detour we take is to see what solution we’d get if we’d Mark Cavendish some reasonable amount of points, say 400 i.s.o. the current 0, which would have placed him around spot 100-150 on the UCI ranking of 2020.

points['cavendish mark'] = 400
cost_limit = 1750000
count_required = 15

M = pyo.ConcreteModel()

M.ITEMS = pyo.Set(initialize=points.keys())

M.x = pyo.Var(M.ITEMS, within=pyo.Binary)

M.points = pyo.Objective(expr=sum(points[i]*M.x[i] for i in M.ITEMS), sense=pyo.maximize)

M.weight = pyo.Constraint(expr=sum(cost[i]*M.x[i] for i in M.ITEMS) <= cost_limit)

M.count = pyo.Constraint(expr=sum(count[i]*M.x[i] for i in M.ITEMS) == count_required)

opt = pyo.SolverFactory("cbc.exe")

results = opt.solve(M)

print(results)
Problem: 
- Lower bound: -inf
  Upper bound: inf
  Number of objectives: 1
  Number of constraints: 2
  Number of variables: 976
  Sense: unknown
Solver: 
- Status: ok
  Message: CBC 2.10.3 optimal, objective -20817.33; 0 nodes, 14 iterations, 0.157 seconds
  Termination condition: optimal
  Id: 0
  Error rc: 0
  Time: 0.26456785202026367
Solution: 
- number of solutions: 0
  number of solutions displayed: 0

We observe an increase in total team score from 20583 to 20817, about 230 points. We know we could never gain more than 400 points. The resulting team is

for v in M.component_data_objects(pyo.Var):
    if v.value > 0.5:
        key = str(v)[2:-1]
        name = key + (20 - len(key))*' '
        print('%s \t %6i \t %7i' % (name, points[key], cost[key]))
fuglsang jakob       	   1961 	  178000
kelderman wilco      	   1328 	  121000
almeida joao         	   1370 	  121500
cavendish mark       	    400 	    2500
declercq tim         	    415 	   23500
senechal florian     	   1325 	   81500
carthy hugh          	    821 	   69000
porte richie         	   1733 	  135000
hindley jai          	   1155 	  100000
hirschi marc         	   1430 	  140000
roglic primoz        	   4237 	  400000
van aert wout        	   2700 	  242000
nizzolo giacomo      	   1115 	   93500
vermeersch gianni    	    499 	   31000
aniolkowski stanislaw 	    328 	   11000

This team only swaps out two riders with respect to our previous solution: Mark Cavendish replaces Olav Kooij (also a sprinter), and the few gained euros are used to replace Tao Geoghegan Hart with Marc Hirschi. We could have done a similar consideration for a rider like Mathieu van der Poel who had a poor start to his 2020 season, but who is looking very good for 2021, but I didn’t.

I finally decided to stick to the original team and give young Dutch sprinting talent Olav Kooij his shot at glory.

Closing thoughts

The final team, submitted as ‘Poeske Scherens Fanclub’ (a double entendre in Dutch), is:

  • Fuglsang Jakob
  • Kelderman Wilco
  • Almeida Joao
  • Declercq Tim
  • Senechal Florian
  • Carthy Hugh
  • Geoghegan Hart Tao
  • Porte Richie
  • Hindley Jai
  • Kooij Olav
  • Roglic Primoz
  • Van Aert Wout
  • Nizzolo Giacomo
  • Vermeersch Gianni
  • Aniolkowski Stanislaw

Though completely unbiased towards Belgian riders, we still have three of them. The team quite clearly is oriented towards GC riders, in particular those who did well last year (Fuglsang, Kelderman, Almeida, Geoghegan Hart, Porte, Hindley, Roglic). One could argue that what we have basically done is find the most significant differences between the UCI ranking and ProCyclingStats. This team indicates that the UCI ranking is more favorable to GC riders than ProCyclingStats.

The real question is of course: Which one is the most representative for the MegaTomBike ranking? Or even more precise: Which past ranking (or complex model with multiple years and multiple features) is the best predictor for a future MegaTomBike ranking? We could then even consider past MegaTomBike scores, which I don’t have access to currently.

Those questions I’ll leave for some future post, for example when the 2021 season is over.

Until then… go team!

Update: A promising start! After a few weeks in, already in second place! Again not many GC oriented teams among the competitors.


Want to leave a comment?

Very interested in your comments but still figuring out the most suited approach to this. For now, feel free to send me an email or reply to this tweet.