Wednesday 29 November 2017

Over Performers Aren't Always Just Lucky.

Firstly, this isn't another post about whether Burnley are good at blocking shots because "yes they are".

Instead it's about applying some kind of context to levels of over or under performance to a side's performance data. And attempting to attribute how much is the result of the ever present random variation in inevitably small samples and how much is perhaps due to a tactical wrinkle and/or differing levels of skill.

Random variation termed as "luck" is probably the reddest of rags to a casual fan or pundit, disinterested or outwardly hostile to the use of stats to help to describe their beautiful game.

It's the equivalent for anyone with a passing interest in football analytics of "clinical" being used ad nauseam, all the way to the mute button by Owen Hargreaves.

Neither of these two catch-all, polar opposite terms used in isolation are particularly helpful. Most footballing events are an ever shifting, complex mixture of the two.

I first started writing about football analytics through being more than mildly annoyed that TSR (or Total Shot Ratio, look it up) and its supporters constantly branded Stoke as being that offensive mix of "rubbish at Premier League football" and constantly lucky enough to survive season after season.

And then choosing the Potters as the trendy stats pick for relegation in the next campaign as their "luck" came deservedly tumbling down.

It never did.

Anyone bothered enough to actually watch some of their games could fairly quickly see that through the necessity of accidentally getting promoted with a rump of Championship quality players, Stoke or more correctly Tony Pulis, were using defensive shapes and long ball football to subvert both the beautiful game and the conclusions of the helpful, but deeply flawed and data poor, TSR stat.

There weren't any public xG models around in 2008. To build one meant sacrificing most of Monday collecting the data by hand and Thursday as well when midweek games were played.

But, shot data was readily available, hence TSR.

At its most pernicious, TSR assumed an equality of chance quality.

So getting out-shot, as Stoke's setup virtually guaranteed they would be every single season, was a cast iron guarantee of relegation once your luck ran out in this narrow definition of "advanced stats",

Quantifying chance quality in public was a few years down the road, but even with simple shot numbers, luck could be readily assigned another constant bedfellow in something we'll call "skill".

There comes a time when a side's conversion rate on both sides of the ball is so far removed from the league average rates that TSR relied upon that you had to conclude that something (your model) was badly broken when applied to a small number of teams.

We don't need to build an xB model to see Burnley as being quite good at blocking shots, just as we didn't need a labouriously constructed expected goals model to show that Stoke's conversion disconnects were down to them taking fewer, good quality chances and allowing many more, poorer quality ones back in 2008.

Last season, the league average rate at which open play attempts were blocked was 28%. Burnley faced 482 such attempts and blocked 162 or 34%

A league average team would have only blocked 137 attempts under a naive, know nothing but the league average, model.

Liverpool had the lowest success rate under this assumption that every team has the same in built blocking intent/ability. They successfully blocked just 21% of the 197 opportunities they had to put their bodies on the line.

You're going to get variation in blocking rate, even if each team has the same inbuilt blocking ability and the likelihood of a chance being blocked evens out over the season.

But you're unlikely to get the extremes of success rates epitomized by Burnley and Liverpool last season.

You'll improve this cheap and cheerful, TSR type blocking model for predictive purposes by regressing towards the mean both the observed blocking rates of Liverpool and Burnley.

You'll need to regress Liverpool's more because they faced many fewer attempts, but the Reds will still register as below average and the Claret and Blues above.

In short, you can just use counts and success rates to analysis blocking in the same way as TSR looked at goals, but you can also surmise that the range and difference in blocking ability that you observe may be down to a bit of tactical tinkering/skillsets as well as randomness in limited trials.

In the real world, teams will face widely differing volumes, the "blockability" of attempts will vary and perhaps not even out for all sides and some managers will commit more potential blockers, rather than sending attack minded players to create havoc at the other end of the field.

With more data, and I'm lucky to have access to it in my job, you can easily construct an xB model. And some teams will out perform it (Burnley). But rather than playing the "luck" card you can stress test your model against these outliers.

There's around a 4% chance that a model populated with basic location/shot type/attack type parameters adequately describes Burnly's blocking returns since 2014.

That's perhaps a clue that Burnley are a bit different and not just "Stoke" lucky.

The biggest over-performing disconnect is among opponent attempts that Burnley faced that were quite likely to be blocked in the first place. So that's the place to begin looking.

And as blocking ability above and beyond inevitably feeds through into Burnley's likelihood of conceding actual goals, you've got a piece of evidence that may implicate Burnley as being a more acceptable face of over-performance in the wider realms of xG for the enlightened analytical  crowd to stomach than Stoke were a decade ago.

Wednesday 22 November 2017

An xG Timeline for Sevilla 3 Liverpool 3.

Expected goals is the most visible public manifestation of a data driven approach to analyzing a variety of footballing scenarios.

As with any metric (or subjective assessment, so beloved of Soccer Saturday) it is certainly flawed, but useful. It can be applied at a player or team level and can be used as the building block to both explain past performance or track and predict future levels of attainment.

Expected goals is at its most helpful when aggregated over a longer period of time to identify the quality of a side's process and may more accurately predict the course of future outcomes. rather than relying on the more statistically noisy conclusion that arise from simply taking scorelines at face value.

However, it is understandable that xG is also frequently used to give a more nuanced view of a single game, despite the intrusion of heaps of randomness and the frequent tactical revisions that occur because of the state of the game.

Simple addition of the xG values for each goal attempt readily provides a process driven comparison against a final score, but this too has obvious, if easily mitigated flaws.

Two high quality chances, within seconds of each other can hardly be seen as independent events, although a simple summation of xG values will fail to make the distinction.

There were two prime examples from Liverpool's entertaining 3-3 draw in Sevilla, last night.

Both Firmino goals followed on within seconds of another relatively high quality chance, the first falling to Wijnaldum, the second to Mane.

Liverpool may have been overwhelming their hosts in the first half hour, they were alert enough to have Firmino on hand to pick up the pieces from two high quality failed chances, but a simple summation of these highly related chances must overstate Liverpool's dominance to a degree.

The easy way around this problem is to simulated highly dependent scoring events as such, to prevent two goals occurring from two chances separated by one or two seconds.

It's also become commonplace to expand on the information provided by the cumulative xG "scoreline" by simulating all attempts in a game, with due allowance for connected events, to quote how frequently each team wins an iteration of this shooting contest and how often the game ends stalemated.

Here's the xG shot map and cumulative totals from last night's match from the InfoGolApp.

There's a lot of useful information in the graphic. Liverpool outscored Sevilla in xG, they had over half a dozen high quality chances, some connected, compared to a single penalty and other, lower quality efforts for the hosts.

Once each attempt is simulated and the possible outcomes summed, Liverpool win just under 60% of these shooting contests, Sevilla 18%, with the remainder drawn.

Simulation is an alternative way of presenting xG outputs rather than as totals that accounts for connected events, the variance inherent in lots of lower quality attempts compared to fewer, better chances and also  describes most likely match outcomes in a probabilistic way that some may be more comfortable with.

Liverpool "winning" 2.95-1.82 xG may be a more intuitive piece of information for some (although as we've seen it may be flawed by failing to adequately describe distributions and multiple, common events), compared to Liverpool "winning" nearly 6 out of ten such contests.

None of this is ground breaking, I've been blogging about this type of application for xG figures for years, But there's no real reason why we need to wait until the final whistle to run such simulations of the attempts created in a game.

xG timelines have been used to show the accumulation of xG by each team as the game progresses, but suffer particularly from a failure to highlight connected chances.

In a simulation based alternative, I've run 10,000 attempt simulations of all attempts that had been taken up to a particular stage in last night's game.

I've then plotted the likelihood that either Liverpool or Sevilla would be leading or the game would be level up based on the outcome of those attempt simulations.

Liverpool's first dual attempt event came in the first minute. Wijnaldum's misplaced near post header, immediately followed by Firmino's far post shot.

Simulated as a single event, there's around a 45% chance Liverpool lead, 55% chance the game is still level and (not having had an attempt yet) a 0% chance Sevilla are ahead.

If you re-run the now four attempt simulation following Nolito's & Ben Yedder's efforts after 19 minutes, a draw is marginally the most likely current state of the game, followed by a lead for either team.

A flurry of high quality chances then make the Reds a near 90% to reach half time with a lead, enabling the halftime question as to whether Liverpool are deservedly leading to be answered with a near emphatic, yes.

Sevilla's spirited, if generally low quality second half comeback does eat into Liverpool's likelihood of leading throughout the second half, but it was still a match that the visitors should have returned from with an average of around two UCL points.