Better Meta Analysis: Using Wilson Score intervals to evaluate win rates at the 2017 World Championship

With Game 5 of Team WE vs. Cloud9 complete, the Quarterfinals stage of the 2017 World Championship is over. The tournament clocks in at 107 games so far, and it’s clear which champion is strongest: Kalista is the only champion picked or banned in every single game thus far, a feat achieved by only one champion in each of the World Championships since 2013. But there are twenty champions in each game – ten picked and ten banned – and correctly choosing the other nineteen goes a long way towards winning a game.

While many analysts, broadcasters, and statistics websites use statistics like presence (pick+ban%), games played, and win rate to rank champions, none of these measurements truly capture the power level of a champion. As an alternative, we can use a Binomial Proportion Wilson Score Interval to attempt to evaluate win rates and adjust them, in order to find the “best” champions at Worlds 2017.

Binomial Proportion Wilson Score Interval

For those familiar with statistics, the terms “normal distribution” and “confidence intervals” should ring a bell. These estimators are used to find what the true average might be. Binomial proportion is similar, but is based not the normal distribution, but on binomial distribution, where only two values are possible for each item.

In League of Legends, a champion can only win or lose. There are no ties, no 1.5 wins, and no two losses in a game. That said, champions can also be banned or not even picked at all. It is because champions can be banned or not even picked that we should judge the value of a champion based on its Wilson score instead of its presence, games played, or win rate.

Wilson score intervals provide a range within which the true value (what we could call “expected win rate”) is likely to fall, because it accounts for such things as sample size, average, and desired confidence level. Wilson score presents itself similarly to a confidence interval for our purposes, since it features:

  1. Bias towards 50%: Wilson score interval adjusts every binomial average towards 50%, sometimes to a fault. It works to our advantage since LoL is a zero sum game. For every game a team wins, another team must lose.
  2. Range from 0% to 100%: normal distributions will return a value above 100% or below 0% if given a large enough z-score*, whereas Wilson score intervals will always be between 0% to 100%. Since a champion cannot have a win rate below 0% or above 100%, higher confidence levels can be used without adjustments.

*z-score is a measure of confidence level based on the standard deviation

For this article, a confidence level of 95% was chosen, which equates to a z-score of 1.96.

Building a Model

The next thing to do is compare multiple methods of applying Wilson score intervals to the 2017 World Championship pick and ban rates to see which approach produces the most accurate ranking of Champions.

Method 1: Picks Only

In this approach we calculate the lower and upper bounds of our Wilson score interval using only actual win rate and number of games played. The table is sorted in descending order by lower bound.

Champion Picks Win rate Lower Bound Upper Bound
Janna 54 72.2% 59.1% 82.4%
Galio 32 71.9% 54.6% 84.4%
Gragas 51 62.8% 49.0% 74.7%
Twitch 20 70.0% 48.1% 85.5%
Jarvan IV 48 58.3% 44.3% 71.2%
Gnar 13 69.2% 42.4% 87.3%
Rakan 30 60.0% 42.3% 75.4%
Tristana 63 54.0% 41.8% 65.7%
Taliyah 17 64.7% 41.3% 82.7%
LeBlanc 8 75.0% 40.9% 92.9%
Jayce 16 62.5% 38.6% 81.5%
Kalista 5 80.0% 37.6% 96.4%
Malzahar 5 80.0% 37.6% 96.4%
Cho’Gath 54 48.2% 35.4% 61.2%
Singed 2 100.0% 34.2% 100.0%
Sejuani 57 45.6% 33.4% 58.4%
Xayah 42 47.6% 33.4% 62.3%
Cassiopeia 14 57.1% 32.6% 78.6%
Trundle 10 60.0% 31.3% 83.2%
Syndra 40 45.0% 30.7% 60.2%

This approach immediately fails the eye test: Kalista is ranked at 12th, tied with Malzahar. Kalista is a must-ban champion on Red Side and there is no way she should be ranked at the same level as Malzahar.

This is happening because when we only use picks, the lower number of games equates to a larger range for where the true win rate might lie. Therefore, the lower bound is a smaller number. Note that every champion rated above Kalista has a win rate lower than hers. The larger sample size makes the estimated range tighter, and thus, the lower bound is higher than hers.

To rectify this, we should try including bans in the formula.

Note: From here on out, we will only be interested in the lower bound, since high win rates produce upper bounds that are very close to 100%, which doesn’t tell us much. We will no longer point out that as sample size increase, the range between upper bound to lower bound becomes tighter.

Method 2: Picks and Bans

Champion Picks+Bans Win rate Lower Bound
Kalista 107 80.0% 71.4%
Galio 97 71.9% 62.2%
Janna 83 72.2% 61.8%
LeBlanc 44 75.0% 60.6%
Gnar 37 69.2% 53.2%
Twitch 29 70.0% 51.8%
Gragas 71 62.8% 51.1%
Singed 4 100.0% 51.0%
Taliyah 51 64.4% 51.0%
Malzahar 10 80.0% 49.0%
Jayce 51 62.5% 48.8%
Jarvan IV 105 58.3% 48.8%
Rakan 74 60.0% 48.6%
Tristana 82 54.0% 43.3%
Cassiopeia 37 57.1% 41.3%
Trundle 25 60.0% 40.7%
Cho’Gath 83 48.2% 37.7%
Xayah 91 47.6% 37.7%
Kha’Zix 6 75.0% 36.5%
Sejuani 104 45.6% 36.4%

When the champions’ bans are accounted for, Kalista jumps to the top.  Perfect.

Singed, however, jumps to 8th just by adding five more bans, and Trundle sits 16th with fifteen bans. At such a low sample size, increasing the sample by a small amount will significantly increase the lower bound. Aurelion Sol’s lower bound leaped from 34.2% to 51.0%. Similarly, Trundle’s lower bound increased from 31.3% to 40.7%.

This highlights a big problem: selection bias.

Only twenty of the strongest champions are either picked or banned each game. This model implicitly claims that any champion that is not picked or banned is not felt to be strong enough by either team and should be penalized accordingly. That is not necessarily the case, since picks and bans are very contextual, and depend on what has happened in the draft up until that point.

It should also be noted that the ratio of picks to bans varies wildly from champion to champion. Jarvan IV has been picked 48 times and banned 57 times, while Twitch has been picked 20 times and banned only 9 times. Thus Twitch’s higher win rate does not provide full context. In most scenarios where Twitch is considered, he is more likely to be picked than banned, while Jarvan IV warrants bans at a much higher rate than he is picked. Thus, it makes sense to reward champions for being picked more than their win rate might suggest they should. Bans should carry more weight than picks, as picks are the result of specific champions being banned beforehand.

Method 3: Adjusted Winrates with Picks and Bans

Here we introduce adjusted win rate, which is a custom formula based on the number of games picked, banned, and neither picked or banned.

 

The formula for adjusted win rate makes the following assumptions:

  1. A champion is banned when it is expected to perform better than current win rate. It is a pretty safe assumption that champions are banned in a draft when they are expected to outperform their normal expectations. Thus the average of current win rate and 100% is assigned as a theoretical win rate for games when a champion is banned.
  2. A champion is not picked or banned when it is expected to perform worse than current win rate. As we discussed selection bias earlier, a champion’s theoretical win rate should be lower in games where it is not considered. We make the assumption that a champion with 100% win rate is likely to have a 40% win rate in an ill-fitting scenario and arrive at dividing by 2.5 by simple linear interpolation.

With the new adjusted win rates, and using total number of picks and bans for each champion as the sample size, the final result is below.

Champion Picks+Bans True Win rate Adjusted WR Lower Bound
Kalista 107 80.0% 89.5% 82.3%
Galio 97 71.9% 76.4% 67.0%
Jarvan IV 105 58.3% 68.8% 59.4%
Janna 83 72.2% 66.3% 55.6%
Sejuani 104 45.6% 56.8% 47.2%
Rakan 74 60.0% 57.1% 45.8%
Xayah 91 47.6% 55.3% 45.1%
Gragas 71 62.8% 53.6% 42.1%
Tristana 82 54.0% 50.5% 39.9%
Syndra 83 45.0% 50.0% 39.5%
LeBlanc 44 75.0% 52.7% 38.3%
Cho’Gath 83 48.2% 48.7% 38.2%
Taliyah 51 64.7% 50.0% 36.8%
Jayce 51 62.5% 49.0% 35.9%
Lulu 83 38.2% 41.1% 31.2%
Gnar 37 69.2% 45.5% 30.7%
Shen 63 42.9% 41.6% 30.3%
Kog’Maw 63 44.1% 40.8% 29.5%
Orianna 51 45.5% 38.6% 26.5%
Cassiopeia 37 57.1% 39.3% 25.3%

To put it simply, if Janna, picked 54 times and banned 29 times, were to be picked every single game, we could expect her win rate to be between 55.6% and 75.5%, 95 out of 100 times, in a tournament similar to the 2017 World Championship.

This list looks more in line with what teams consider to be the strongest champions in the game. Every single meta champion is well represented here and pocket picks with high win rates have dropped considerably in rank. Outliers dropped appropriately, like Aurelion Sol (from 13th to 30th) and Trundle (from 16th to 24th).

Final Thoughts

Wilson score intervals provide a way to condense win rate, pick rate, and ban rate into one value that can rank champions by power level and be easily interpreted.

Of course, while we can look at Wilson Score intervals to determine individual champions’ success, simply picking the strongest champions in each role will not necessarily result in a better team composition. Individual skill, team coordination, and champion synergy are all important parts of the draft. Furthermore, some teams succeed more than others with similar drafts. And more teams participated in Worlds this year, with the introduction of the play-in stage, so the difference in skill between the best and worst teams is larger than ever. While we could venture an attempt to value some wins more than others based on teams’ performances, each team only played between four to fifteen games, which is not enough to truly rank the teams.

On top of this, selection bias assumes that teams know the power levels of champions beforehand. However, the draft is a learning process and no team truly masters it. Teams’ priorities change as the tournament progresses: some teams adapt to new strategies while those who cannot are eliminated.

Wilson Score intervals help to assess champion strengt, but there is a subtle game of rock, paper, scissors in drafting a team composition, and the optimal strategy on paper is not always the winning one. I encourage everyone to simulate drafts with the Wilson Score Intervals in mind and learn for themselves that the whole is greater than the sum of its parts.


Dan is currently an analyst for Team Vitality in the EU LCS. His background is in Applied Mathematics and Economics, as well as software development. Follow him on Twitter.

4 thoughts on “Better Meta Analysis: Using Wilson Score intervals to evaluate win rates at the 2017 World Championship”

  1. Hi there,

    I really appreciate your article as I am a fan of league analytics as well. In my opinion your model is really useful to present the true meta picks assessment. Still i thing i got some constrictive criticism for your proposal.

    I am not really familiar with higher mathematics/statistics but I think I got everything quite clear. And here I have one case/question that I would like to ask: is it really necessary to use those wilson score intervals? In method 3 you just made up a great ratio (its very subjective, but I cant find anything to complain) that solves the problem just as good.

    I will use an example to picture what I mean: from your final table – syndra and leblanc, both bursty midlaners and both really close in values. I outlined some of the numbers and i got: syndra was played 40 times and bannedd 43 times (total pick&ban 83) with winrate of 45% which gives adj winrate of 50% whereas leblanc was picked 8 times, banned 36 times (total 44) with winrate of 75% and adj winrate of 52,7%.

    I experimented a little bit with adj winrate and my conclusions where that it is just fine. Leblancs 75% winrate goes down because she is not contested at all in lots of games. 75% winrate is not her true overall meta power whereas 52,7% might be. There is one thing that really was a hard nut to crack for me and it is the order. Leblanc is ahead of syndra in adj winrate whereas she is behind in lower bound of wilson score interval. I did some math and if its right the higher bound for syndra is ~61,0% and for leblanc ~67,9% therefore if ordered by higher bound leblanc would be leading.

    This being said, my opinion is that showing only lower bound and especially using it to order is not correct. The example highlights that higher bound should not be ignored as if I interpret the results correctly: if syndra was played each game in a tournament similar to 2017 world championships 95 out of 100 times we could expect her winrate to be between 38,3 and 67,9% whereas syndras to be between 39,5 and 61,0%. Is 35 to 45 better than 30 to 50? In your model yes was the answer and i disagree.

    Final conclusion after all this: in my opinion adj winrate serves the same and is easier to interpret as it is a single value and wilson score intervals are two borders that might be tricky to compare.

    Lastly, I really want to thank you for the article. I am really excited for all the analytical content for league of legends that is being created. Keep up the good work!

    Looking forward to your response,
    Patryk

    1. I’m glad you enjoyed it. Wilson score intervals are still used with adjusted win rates because of volatility of the raw win rate. Since every part of adjusted win rate depends on the raw win rate, champions with small sample size will vary wildly. Aurelion Sol dropped from 30th to 37th with C9 losing one game against WE. If we didn’t use Wilson score intervals, the fall would’ve been much larger.

      As for using the lower bound only, it’s because we are looking at the best champions. The “minimum threshold” matters more when the win rates are higher, since upper bounds are closer to 100% and are less meaningful to compare. If I were looking at the bottom of the list, the upper bound would provide more insight than the lower bound.

  2. Hi Dan,
    I also appreciate your method for ranking the champions in the world championship and I think it very useful for me. Thanks a lot. Besides, could you please refer me to some ideas or papers how to value or rank team composition. I get the data from Riot API where there are ordinal data such as damage for each champion. Besides, there are also like utility mobility and crowd control ordinal data. Do you think it makes sense to combine this kind of data and ranking champion list, which you just applied, to evaluate the team draft.

    Looking forward to your reply, thanks
    Yinan

    1. I think there are way too many variables to consider. You have to consider synergy of two or more picks, marginal value of individual traits you mentioned (mobility, CC, tankiness, DPS potential, engage, follow-up, disengage, etc.) and how good each champion is in that regard. I think using qualified opinion instead of data is better for these other factors. Just get pro players/coaches/etc. and see if you can assign numbers to them.

Comments are closed.