Find out more about the methodology used to calculate the scores
Sheep Stats and Sheep Esports are introducing a new scoring system to measure player performance over a series of matches (called BO in the following). The aim of this more scientific article is to explain how this assessment is measured, so as to be totally transparent about the method used.
The article will be divided into 2 parts. Firstly, the method used will be explained in a neutral manner, with arguments based on theory and concrete examples. In the second part, criticisms will be made of the same method in a bid to improve it and develop a critical mindset.
Method used
Foreword: Definition of the Pearson correlation coefficient.
The Pearson correlation coefficient examines the relationship between two variables and determines the influence of one on the other (Voxco definition). Its general formula is given as follows:
The r coefficient varies between 1 and -1, and its value indicates a stronger or weaker correlation, in accordance with the scale below:
In practice, this coefficient only indicates a linear dependency between the two variables. Even if it is close to 0, there may still be another type of correlation. We will therefore also test the Spearman correlation coefficient, which indicates a potential monotonic correlation between the two variables, based on the rank of these variables.
In order to give a score to the players, we will use 11 main metrics as a basis:
We have deliberately chosen to base ourselves on these metrics, as they have little or no correlation with the length of a game. While this assumption seems immediate for the 14-minute metrics, it may be more nuanced for the others.
Indeed, we might generally think that certain metrics could increase and depend on the length of the game (KP% or damage/gold, for example). However, we first checked that each metric was independent of the duration of a game, by calculating the Pearson correlation coefficient between these two elements. In all cases, the coefficient reveals a very low correlation (<0.3 at most), which confirms our choice of metrics and allows us to conclude that they are independent of the duration of a game.
In addition, the use of Spearman's correlation coefficient indicates a monotonic non-correlation.
Calculating and taking into account each of these metrics is an ideal to achieve. However, when confronted with the technical limitations and complexities of the game, some metrics may find themselves missing from the calculation. Firstly, from a technical point of view, some leagues do not allow us full access to all the statistics for a game.
With regard to the complexity of the game, it's clear, for example, that we cannot take into account a player's "CC/min" metric if the champion's spell kit does not include any CC.
In view of these difficulties, we still ensure that at least 7 metrics are taken into account in our algorithm:
Considering each match of a BO
In this approach, we will compare the metrics of each of the players in our sample with those of players playing the same role and the same champion (this includes all matches from the last 4 years, for the LFL, LVP, LCS, LEC, LCK, LPL, MSI and Worlds).
Inevitably, the size of our comparison sample is heavily dependent on the popularity of the champion being played. That's why we decided to take this into account later on. Once we have determined our sample, we want to find the probability distribution associated with it.
We will therefore use a fitter to find the best possible distribution. The distributions tested are gamma, lognorm, beta, burr and norm. For each sample, these 5 laws will be tested, and the one with the smallest error will be used. The error calculated is the sum of squares.
Here we have the gamma distribution, which best fits the distribution of our data sample.
We will therefore use the distribution of this particular distribution as a basis for our scores. Let's take the centred reduced normal distribution as an example:
Here, we want to obtain the cumulative distribution function (CDF), which indicates the percentage of samples below a given score in the distribution. In other words, this can be translated as "This score is better than 60% of the sample."
This is the basis on which we will establish a score. Let's imagine a CDF value of 80%, then we'll give the player a score of 8/10 for this metric. Let's take the case of Caps, who would have played Orianna. Let's assume that we're using a comparison sample of 1,000 matches with a midlaner playing Orianna.
Let's assume that this sample follows a normal distribution such that average = 600, standard deviation = 200. If Caps has 600 DMG/min, then Caps will have a score on this metric of 5/10. This pattern is repeated for all the metrics.
However, we would like to come back to some important points. Firstly, our sample may be subject to drift/shift over time. In concrete terms, if a Draven gets buffed on its damage, this directly implies that players who play Draven will have higher scores than they should.
A player who has 800 DMG/min with Draven will have to be rated slightly lower than he should, given that this 800 DMG/min is explained by his performance but also by the champion's buff.
Secondly, we need to restrict our sample to champions like Senna and his supports. In concrete terms, Senna can farm or not farm, so we need to compare her only to Senna who has made the same choice. To determine whether Senna is farming or not, we simply compare her CS/min with those of her Support. Here is an example of Senna ADC's CS/min before the modification, a mixing law characterises this:
The different cases can be clearly distinguished. On the left, we have the case of Senna ADCs that don't farm, and on the right those that do.
Here's what you get when you apply the filter separating the two cases:
Finally, in some cases the sample is too small to give a reliable distribution law, or worse, the sample does not follow any usual distribution law (a rare phenomenon).
So, we need to find an alternative. We have therefore defined an error threshold for the distribution such that if the error is greater than this threshold, we do not base ourselves on the distribution but on the player's ranking out of the total sample. This gives us a ratio between 0 and 1, which we multiply by 10 to give us a score.
This choice has been made because at lower levels, we will compensate for this with an attenuating divisor. So, we're well on the way to our first score. However, a number of contextual factors need to be taken into account to fine-tune our scores and bring them as close as possible to the reality and feelings of the audience.
So, we'll be asking ourselves some key questions:
- Did the player win the game?
- How long did the game last?
- Was the player innovative in his choice of champion?
To take these contextual elements into account, we will introduce an attenuating divisor denoted d. We initialise the value of d to 1, or d=1.
We then adjust the value of this attenuator as follows, depending on the answers to the previous questions:
- If the match lasts less than 25 minutes:
- If the player's team won:
- If the player played a little played champion (-100 samples): d+=0.3
- Otherwise: d+=0.25
- Otherwise:
- If the player played a little played champion (-100 samples): d+=0.05
- Otherwise: d+=0
- If the player's team won:
- Otherwise:
- If the player's team won :
- If the player played a little played champion (-100 samples): d+=0.25
- Otherwise: d+=0.20
- Otherwise:
- If the player played a little played champion (-100 samples): d+=0.15
- Otherwise: d+=0.10
- If the player's team won :
(nb: d += x means d = d+x)
We're now well on the way to a first scoring idea. However, we can't rely solely on the player's metrics to evaluate him. That's why we're going to compare a few metrics with those of his direct opponent for each match. We therefore introduce an additional metric here: the player's KDA ratio. We can easily calculate the ratio of a player's KDA to his opponent's: all we have to do is divide the player's metric by his opponent's metric.
It is also for these comparisons that we have chosen to add this particular metric. In fact, other data could be taken into account but would risk distorting the reality of a game. Damages/min, for example, would almost always favour a bruiser over a tank, which doesn't mean that the tank has lost his lane.
So, let's look at our KDA ratios. Once again, we'll adjust our divisor d according to the KDA: d+= ratioKDA*0.1 Note that we limit our KDA ratio to a value of 4, to avoid our score being altered too much by the attenuating divisor. We've also added the number of solokills the player has: d+= solokills*0.0. We've also limited the increase to 4 solokills.
Finally, we add pentakills, such that d+=0.1 if the player has done a pentakill. It should be noted that we limit this ratio to 1.8. This is a high but justified ratio, which would only occur if the player was very dominant in the game and would ensure a minimum score of 8/10. This approach makes it possible to balance the different scores calculated earlier. In fact, with a divisor of 1.25, we have the following framework:
In other words, a player who is rated very poorly on a specific metric will see a bigger drop in his score than if he had been highly rated. If we are at 0 on a metric, we drop to 2, whereas if we are at 8, we drop to 8.4. The calculation is as follows:
We can then establish a score for each metric, and then calculate the weighted average to establish the score for a part. We carry out this calculation for each game in the BO, then calculate the weighted average of the different scores that make up the BO to obtain the final score.
The weighted average is based on the player's role. So, the coefficients for this average are established according to the importance of a metric on the role of the player in question. For example, CS/min for an ADC is an important metric to measure, whereas vision score/min will be more representative of the supports.
Final score
The final score is the average (weighted according to the player's role) of the scores for each metric used for each match played. To summarise, the following diagram is used for each player: For a match, we establish a score that is the average (weighted according to the player's role) of the scores for all the metrics measured. The score for each match of a BO is then averaged.
You can find the final scores for each BO since 2023, for example for the final of the LEC 2024 winter split. Additionally, you can find the rankings based on player scores for each league. Click here for an example of the LEC 2024 summer split. Finally, this score is also available on the players' profiles, for example for Caps.
Critical reviews
Because we're always looking for ways to improve, and because it's important for us to take a critical look at our work, we've set out here a few points that could be improved or explained some of the choices we've made.
We have limited ourselves to the data returned for each match, and the metrics chosen have been based on the criteria explained earlier. If other metrics make sense, we're more than open to discussion, and can't help but encourage you to put forward your ideas, misgivings or criticisms, with the aim of improving this system.
Regarding the metrics chosen (or not), neutral objectives, and in particular the neutral objective ratio, could have been used. However, the Pearson correlation coefficient between the objective ratio and victory is 0.89, synonymous with a strong correlation. Taking this criterion into account would be equivalent to taking victory into account, which is already the case.
Still on the subject of ratios (player/direct opponent), other ratios have been taken into account, such as the ratio of kills, assists, deaths, gold, damages, etc. However, it's important to keep in mind that some match-ups mean that a high-performing player doesn't always have the same stats (the example of the tank and bruiser is a good illustration of this). The kda ratio is therefore the most general.
Exotic picks, in other words those for which a role has a small sample of matches, are difficult to measure. For the exotic pick factor we have chosen to increase the attenuation divisor in order to reduce the impact of punitive metrics and reward the risk taken. However, as time goes on, the more matches we have in our database, the more the sample size will consolidate the statistics released for these picks.
Once again, we encourage you to give us as much feedback as possible on this work and our method. We're only too happy to improve and enhance the quality of our work. Criticism, courteous and constructive, can only help us move forward. Thank you for taking the time to read us.
- Antonio Pellini -
/Comments
WRITE A COMMENT