Thursday 23 August 2012

Why Sample Size Matters.

Easily the biggest improvement that can be made in the analysis of player related football data revolves around how issues of sample size are incorporated into the process. It is universally recognized that the outcome of just one trial or event alone can add little to our knowledge of a player's real ability and if playing talent is being judged solely on data collected over a single season, it has become customary to omit data that has originated from a small number of games.

If the only information we have about a player is that he has an extraordinary strike rate from a limited number of shots, many will balk at placing him at the top of the scoring charts, choosing instead to omit his figures until more become available.

However, this approach as well as being unfair to an unknown striker off to a hot streak, doesn't entirely eliminate the problem of a player's recorded stats being a mere sample of his true ability. Even if we limit our study to players who have recorded a minimum, but arbitrarily chosen number of attempts, we haven't removed the problem of unrepresentative, randomly driven outcomes over or underrating a player's real long term ability. We are at best simply reducing the effect.

As many are aware Manchester City in conjunction with Opta have released a huge csv file of individual player data for the English Premiership during 2011/12. The range and detail of the data has been discussed over various blogs and the reaction has been, quite rightly, predominately positive. Data collection for the hobbyists can be extremely time consuming, even assuming that the data is available to collect. So Opta and Gavin Fleig's ground breaking and bold decision to turn over to the analytical community such a large amount of data for free, with the promise of more to come is to be welcomed.

This out pouring of data provides an opportunity to develop new, useful and predictive metrics. But it is also very likely that we will also see extravagant claims being made in regard to what these new metrics tells us about the perceived talent of individuals, mostly through neglecting to address sample issues.

The ultimate strength of advanced sporting analysis comes about through predicting and explaining sporting contests and the raw data is always going to be the building block for this aim. However, presenting newly sourced raw data, even with the benefit of visualization software and corrected by appearance or opportunity is merely a different way of describing what occurred on the field of play over a series of sample size restricted events.

As a descriptive archive it is valid and valuable, but once such statistics alone are used to claim knowledge of a player's real talent and potential, their usefulness becomes over stretched.

Advanced analytics only comes of age when practitioners fully acknowledge the differences between descriptive numbers and the predictive metrics that are subsequently derived from such raw data, hopefully stripped of as much random baggage as is possible. Editorial over :-).

In previous recent posts concerning crossing and goalkeeping stats, I've illustrated how the overall predictive properties of such numbers improve once sample size is addressed and particularly extreme outliers are reigned in towards the group average. So for this post I will outline how I've collected and used aggregated data to evaluate how Premiership defences dealt with both passes made into the final third and crosses made into the box during 2011/12. How many sample repetitions we may reasonable need to see before we can begin to form and opinion about talent levels across different teams and also outline some of the assumptions that are required when using aggregated data.

Aggregated data for success and failures from final third passes and crosses is available at such sites as Opta re seller, EPLIndex. However, in common with many sports it is only provided from the viewpoint of the attack. If you want to know how Stoke deal with crosses, you need to manually record the data from each game at such Opta driven apps as Fourfourtwo's Stats Zone. 380 games, each comprising two sides takes around seven hours for one season.

At the end you are left with summarized data for over 100,000 passes that were made into the final third of the pitch by the attacking sides over a season, of which around 70,000 were deemed successful and resulting in over 400 occasions when the recipient of the pass went on to ultimately score. Similarly for crosses, over 16,000 were attempted, just under 4,000 reached a teammate and over 200 led directly to goals.

Our first assumption is therefore that sheer weight of numbers makes the quality of crosses or passes comparable for all sides. A full and complete calendar of games ensures that strength of schedules are almost identical for all 20 teams, they don't obviously play themselves. But we also need to believe that each team is defending a similar proportion of hopeful longballs as it is delicate, defence splitting passes in and around the area.

Last season, Chelsea faced 4300 passes into it's final third of the pitch and 2737 of those attempts successfully reached an opponent, so they allowed a 63.5% success rate. Relegated Blackburn faced 6020 such passes and allowed a success rate of 71.3% when 4290 found their intended target. Overall the raw efficiency range between best and worst runs from just over 59% to Blackburn's 71.3%. Remember we are looking at this from a defensive perspective, so the lower their opponent's completion rate, then the better the defence and midfielder is doing a job of limiting final third completions.

Surely with trials running into the thousand we can take the efficiency rates at face value ? The large spread in efficiency ratings over such a number of repetitions indicates that causes other than random variation are certainly behind the numbers. These are likely to be partly skill driven and partly tactical. But even in such a large data collection, an improved efficiency figure results from pulling extreme values towards the mean. The adjustment for allowing pass completions in the final third are small, but even in very large sample numbers there is a case for making them.

To further illustrate the need to regress raw efficiency rates, we can take a different criteria for success and look at a defence's ability to prevent final third passes being converted into a goal by the player who received the pass. Successes for the opposition are naturally much less frequent in this case. Wolves for example allowed 38 goals from 5156 final third passes for an (in)efficiency rate of 0.74% or a goal every 135 such passes, compared to Manchester City who succumbed once every 500 passes.

On this occasion there is still team input into the differing efficiency rates recorded by different defences, but that input is less pronounced and random luck is more of a factor than when mere pass completion is used as the defining factor. You could begin to be able to evaluate at team's ability to prevent final third pass completion after a game or two, but accurately evaluating ability based on goals allowed from the same type of pass would require almost a third of a season.

Below I've listed the regressed success rates for both categories of outcome for final third pass allowed by Premiership teams during last season. The numbers are the best guess of how teams will perform in 2012/13 based on knowledge from only 2011/12 and an undoubted improvement on raw efficiency figures. Again, low efficiency figures are preferred because defences don't want their opponents scoring from or maintaining possession of passes played into the final third.

Regressed & Raw Rates At Which Defences Allowed Pass Completions or Goals From Final Third Passes In The EPL 2011/12.(Blue are the Top Five Defences, Red are the Bottom Five).

Team. Regressed
Efficiency Based on Pass Completion.
Raw Rate. Regressed
Efficiency Based on Goals Allowed.
Raw Rate.
Man City. 0.635 0.634 0.00270 0.00205
Stoke. 0.651 0.651 0.00301 0.00260
A Villa. 0.665 0.665 0.00312 0.00275
Man Utd. 0.651 0.651 0.00313 0.00267
Sunderland. 0.660 0.660 0.00314 0.00280
Liverpool. 0.614 0.612 0.00342 0.00310
Everton. 0.639 0.638 0.00381 0.00367
Spurs. 0.630 0.629 0.00382 0.00368
WBA. 0.680 0.680 0.00395 0.00387
Newcastle. 0.650 0.650 0.00396 0.00388
Fulham. 0.687 0.688 0.00429 0.00434
Chelsea. 0.636 0.635 0.00433 0.00441
Swansea. 0.677 0.678 0.00452 0.00465
Wigan. 0.672 0.673 0.00452 0.00466
Norwich. 0.685 0.686 0.00454 0.00466
QPR. 0.674 0.675 0.00464 0.00480
Arsenal. 0.596 0.593 0.00500 0.00536
Blackburn. 0.711 0.713 0.00516 0.00548
Bolton. 0.642 0.641 0.00548 0.00600
Wolves. 0.680 0.681 0.00649 0.00737

Briefly digesting the figures, the regressed first and third columns are much more likely to be the kind of rates enjoyed by each team this season and the second and fourth columns are the rates that each team actually recorded during 2011/12.

There's mixed news for Arsenal, who were the most impressive team at denying passes reaching their intended target, but better only than the three relegated sides at prevent received passes turning almost instantly into a goal. They are likely to post similar figures to last year in the first category, but should show natural improvement when trying to deny teams a goal. Stoke, the weekend opponents of The Gunners share the honours with Manchester City in goal prevention efficiency terms, reinforcing their commitment to denying opponents opportunities in the face of little desire for ball retention.

Regressed & Raw Rates At Which Defences Allowed Completions or Goals From Crosses In The EPL 2011/12. (Blue are the Top Five Defences, Red are the Bottom Five).

Team. Regressed
Efficiency Based on Cross Completion.
Raw Rate. Regressed
Efficiency Based on Goals Allowed.
Raw Rate.
Man City. 0.218 0.200 0.0132 0.0110
Stoke. 0.220 0.207 0.0131 0.0109
Arsenal. 0.223 0.210 0.0128 0.0087
Everton. 0.225 0.215 0.0132 0.0115
Chelsea. 0.227 0.218 0.0136 0.0134
WBA. 0.229 0.224 0.0121 0.0073
Sunderland. 0.231 0.228 0.0136 0.0136
Norwich. 0.233 0.228 0.0132 0.0118
Liverpool. 0.233 0.231 0.0141 0.0165
Newcastle. 0.233 0.232 0.0133 0.0118
Fulham. 0.235 0.233 0.0126 0.0088
A Villa. 0.237 0.236 0.0139 0.0146
Spurs. 0.240 0.239 0.0142 0.0163
Swansea. 0.241 0.246 0.0130 0.0107
Man Utd. 0.242 0.250 0.0138 0.0142
QPR. 0.242 0.249 0.0146 0.0177
Blackburn. 0.242 0.248 0.0141 0.0153
Wolves. 0.243 0.250 0.0156 0.0225
Bolton. 0.247 0.260 0.0155 0.0221
Wigan. 0.247 0.260 0.0138 0.0143

The same methodology can be used in relation to the rate at which defences allow crosses to find an attacking player and how often they are converted. Once again raw rates describe exactly what happened over a series of matches, but regressed figures will be more predictive in future seasons. Manchester City only allowed 1 in 5 successful crosses in 2011/12, but a slightly less impressive 2 in 9 wouldn't surprise this term. Similarly, WBA's defensive set up weathered an average of 137 crosses before giving up a goal directly from the cross ball, but under more neutrally lucky conditions opponents can expect to average slightly more than 80 crosses to score during the current season.

Stoke's Defence Prevent Yet Another Cross From Reaching it's Intended Target.
These numbers indicate that even large team attempt totals do not guarantee that simple raw rates can be taken at face value. Therefore individual player statistics are bound to necessitate even larger amounts of group average rates being added to their numbers. Gradually football is realising that they must follow other sports and regress their raw stats to add mightily to their value and while interactive state of the art presentation of newly released data is to be welcomed, we must not complacently allow it to become the orthodoxy for evaluating actual team or player talent.

To register to download the EPL data from Manchester City, Opta and Gavin Fleig click on this link

1 comment:

  1. Cool blog mate.

    soccerstronghold All the football news,football transfers, wonderkids and football videos in one place