It is heartening to see interest in statistically understanding the game of football starting to gather momentum,even if it is still very much in it's infancy.Data driven analysis both for it's own sake and to challenge some of the more entrenched views of a particular game is often a way to capture and delight new audiences.However,there are also many pitfalls along the way and there is a responsibility to ensure that one flawed,but heartfelt view of a sporting contest is not replaced by an equally incomplete and entrenched one.
Data can and is used to prove almost any preconceived notion and the chosen weapon of choice is almost always "small sample size".Cherry picking limited data,usually relating to one particular team and spread over just one season should be a clear warning note.The data collection process can be labourious and time consuming and therefore it's often beyond temptation to resist a headline grabbing post based on limited,unusual but almost certainly random fluctuations from the norm.
To illustrate this point here's a slightly contrived example.Last season,Team B had a longest run of 2 consecutive away wins,whilst Team M's longest sequence lasted just one game.Team B's sequence of away wins was 30% above the league average and Team M's was 40% below it.Notice how the use of percentages enhances the (non) effect.B were Blackpool and M were Manchester United.That's not to say that amusing footnotes to an EPL season such as this are not of interest,but they are trivia and not repeatable lasting trends.
Randomness exists and there's no reason to expect it to be absent from sporting events no matter how strong a pundit or analyst wants to be able to fully explain what occurs on the field.Players and ex players,who tend to make up the majority of football pundits in the UK,can be excused for erring on the side of talent as the only factor that goes towards deciding a football match.It must be difficult enough to stay focussed while plying your trade in front of 50,000 passionate fans and it would hardly help if you allowed the nagging possibility into your head that luck will play a part in the game's outcome.Talent plays it's part and that can be measured reasonably well,but chance also contributes and that has to be accounted for.
So is larger sample sizes the way to go? Probably.But here too we must tread carefully.Here's another (even more contrived) example.You find an obscure league,but can find no record of match results or league tables,but you do find shot,corners,possession data etc and you use this to construct a predictive algorithm and subsequently to predict the results of many games.A friend who is familiar with the league then provides you with many season's worth of results.So you batch up every game where you predicted that the home team had a 40% chance of winning the game (there are conveniently exactly 100 games in the sample).To your delight you find that your predictions match reality extremely closely.40% of the home teams actually won.
However,your friend quickly points out that half of your 100 games involve very poor teams hosting very good teams,(he estimates the home sides have about a 10% chance of winning) and the other 50 games involve very good teams (with a 70% win chance) hosting poor sides.As you knew nothing of the league none of these facts were apparent to you,but the poor sides on average won 5 of their 50 games and the good home sides won 35 games.Combined they gave you your predicted 40% strike rate despite you wildly overestimating the ability of half of the group and wildly underestimating that of the other half.
The above scenario has been exaggerated for effect,but all groups of large samples suffer from the same problem.Hidden errors or uncertainties exist and can conspire to make a sample of games appear perfectly consistent with our preferred view of the world by simply cancelling each other out.
The rise in independent analysis of sports has arisen in the past through a dissatisfaction with more traditional,subjectively based viewpoints and so there lies a responsibility to fully investigate our methods for any unseen systematic bias in the data,reject spurious sample data no matter how seductive the conclusion may be and generally resist the temptation to love our models too much.
In the next couple of posts I'll show how deeper scrutiny of large samples of games can reveal competing factors that could easily be missed,but once revealed give us a better insight into how the in running win probabilities change depending upon the current game situation.