Wednesday 23 August 2017

Chance Quality From 1999.

Back in the late 90's when Gazza's career was on the wane and what might become football analytics was mainly done in public on gambling newsgroups, shot numbers where the new big thing.

"Goal expectation", calculated from a weighted and smoothed average from a side's actual number of goals from their last x number of matches, was often the raw material to use to work out the chances of Premier League high flyers, Leeds beating mid table Tottenham.

Shot numbers (which included headers) then became the new ingredient to throw into the mix and a team's shooting efficiency quickly became a go to stat.

Multi stage precursors to goal expectation models where further developed when shot data became available which was broken down into blocks, misses and on target attempts.

To score, a side had to avoid having their shots blocked, then get them on target and finally beat David James.

This new data allowed you to attach team specific probabilities to each stage of progression towards a goal and arrive at a probabilistic estimate of a team's conversion rate per attempt.

Unlike today's xG number, the figure told you nothing specific about a single shot, nor was it particularly useful in helping to describe the outcome of a single game, even with double digit attempts.

Aggregated over a larger series of matches by necessity, this nuanced conversion rate, that included information about a side's ability to avoid blocks, get their efforts on target and thereafter into the goal, allowed you to deduce something about a side's preferred attacking and defensive style.

Also if that preference persisted over seasons, this team specific conversion rate could be used alongside each team's raw shot count in the recent past to create novel, up to date and hopefully predictive set of defensive and attacking performance ratings.

Paper and pencil only lasts slightly longer than today's hard drive, so unfortunately I don't have any "goal expectation" figures for Liverpool circa 2002.

However, with the additional, detailed data from 2017, I decided to re-run these turn of the century, slightly flawed goal expectation models to see if these old school, team specific conversion rates offer anything in today's more data rich climate.

To distinguish them from today's xG I've re named the output as "chance quality".

Chance quality is an averaged likelihood that a side would negotiate the three stages needed to score.

Arsenal had the highest average chance quality per attempt in 2015/16.

The Gunners were amongst the most likely to avoid having their attempts blocked, those that weren't blocked were most likely to be on target and those that were on target were most likely to result in a goal.

Leicester, in their title winning season also created high quality chances per attempt, but Tottenham appeared to opt for quantity verses quality. They were mid table for avoiding blocks and finding their target, but their on target attempts were, on average among the least likely to result in a goal.

Only Palace of the surviving sides were less likely to score with an on target attempt than Spurs.


Here's the same chance quality per attempt, but for attempts allowed, rather than created by the non relegated teams from the 2015/16 season.

The final two columns compare the estimated goal totals for each team using their shot count in that season and their conversion, chance quality from the previous year, to their actual values.

The thinking back in 2000 was that conversion rate from a previous season remained fairly consistent into the next season and so multiplying a side's chance quality by the number of shots they subsequently took or allowed would give a less statistically noisy estimate of their true scoring abilities.

Here's the correlation between the estimated and actual totals using chance quality from 2015/16 and shot numbers from 2016/17 to predict actual goals from 2016/17.


There does appear to be a correlation between average chance quality in a previous year, attempts made the next season and actual goals scored or allowed.

The correlation is stronger on the defensive side of the ball, perhaps suggesting less tinkering with the back 3, 4 or 5.

With full match video extremely rare in 2000, it might have been tempting to assume chance quality had remained relatively similar for most sides and any discrepancy between actual and predicted was largely a product of randomness.

Fortunately, greater access to granular data, availability of extensive match highlights and Pulisball, as a primitive benchmark for tactical extremes, has made it easier to recognise that tactical approaches and chance quality often varies, particularly if there is managerial change.

In this post I compared the distribution of xG for Stoke under Pulis' iron grip (fewer, but high chance quality attempts) and his successor Mark Hughes (higher attempt volumes, but lower quality attempts).

Subsequently, under Hughes, Stoke have tended to morph towards the Hughes ideal and away from Pulis' more occasional six yard box offensive free for all.

So a change of manager could lead a a genuine increase or decrease in average chance quality, which in turn might well alter a side's number of attempts. And any use of an updated version of chance quality should come with this important caveat.

For anyone who wants to party like it's 1999, here's the average chance quality per attempt from the 2016/17 season using this pre-Twitter methodology allied to present day location and shot type information.

Use them as a decent multiplier along with shot counts to produce a proxy for the more detailed cumulative xG now available during the upcoming season or as a new data point to assist in describing a side's tactical evolution across seasons.

In 2016/17, Crystal Palace improved their chance quality compared to 2015/16 with half a season of Allardyce and Arsenal maintained their reputation for trying to walk the ball into the net.

All data is from infogolApp, where 2017 expected goals are used to predict and rate the performance of teams in a variety of leagues and competitions.

Monday 14 August 2017

Liverpool's Split Personality

Everyone likes a good mystery and Constantinos Chappas provided the raw material for a great one when he posted this breakdown of Liverpool's points per game performance in 2016/17 against the six teams from Everton and above and against the remaining 13 sides.

It's a great piece of work from Constantinos and Liverpool's split personality when playing very well against title contenders and Everton compared to when they do less well against lower class teams has generated much speculation.

These have generally fallen into two mutually exclusive groups, either narrative based tactical flaws of Klopp and Liverpool or odds based simulations that attempt to explain away the split as mere randomness.

It is unlikely that either approach will wholly account for Liverpool's apparent failure to dispatch mid and lower table teams with the authority they appeared to preserve for the league's stronger sides.

Football is awash with randomness as well as tactical nuances, so it seems much more likely that a combination of factors will have contributed to the 2016/17 season.

It's a simple task to simulate multiple seasons, often using bookmaker's odds as a proxy for team strength to arrive at the chances that a side, not necessarily Liverpool might exhibit a split personality.

However, it's a stretch to then conclude that either chance was the overriding factor or it can be excluded as a cause merely because this likelihood falls above or below an arbitrary level of certainty.

There is so much data swirling around football at the moment, particularly ExpG, that it seems helpful to use these number to shed some light on Constantinos' intriguing observation.

Rather than a pregame bookmaker's estimate a a side's chance, we have access to ExpG figures for all of Liverpool's 2016/17 matches.

ExpG have arisen from the tactical and talent based interaction that took place on the field and spread over 90+ minutes of all 38 games they perhaps provide a larger sample of events with which to explain a series of game outcomes, rather than simply using 38 individual sets of match odds, however skillfully assembled.

One aspect of a low scoring sport, such as football, where ExpG struggles is how teams adopt different approaches to achieve the aim of winning the most available number of points.

A side may take a fairly comfortable lead early in a contest and then chose to commit more to defence against a weaker or numerically deficient opponent.

An extreme case was Burnley's win over Chelsea, where early actual goals allowed the visitors to concede large amounts of ExpG and just few enough actual ones to handsomely lose the ExpG contest, but win the match.

ExpG figures are inevitably tainted by actual real events, such as goals and red cards, but it is still at its most useful when used in conjunction with simulations to attempt to describe the range and likelihood of particular events occurring.

Scoring first (and 2nd and 3rd, along with Chelsea going down to 10 men) was a big assistance to Burnley and Andrew Beasley has written about the importance of the first goal here, for Pinnacle.

If we look at the size of the ExpG figures for all goal attempts in a game and the order in which they arrived, there may be enough data that is not distorted by actual events to estimate which side was most likely to open the scoring, allowing them then to be able to more readily dictate how the game evolves.

In games against the 13 lowest finishing teams, Liverpool took the initial lead 16 times, compared to a most likely figure of 15.

With the interaction of attempts allowed and taken, Liverpool ended up 1-0 to the good or bad or goalless throughout about as often as their process deserved.

They fared much better against the top teams.

In those 12 games Liverpool took the 1-0 lead nine times compared to a most likely expectation of just six based on the ExpG in their games.

It was around a 7% chance that an average team repeats this if Liverpool carve out and allow the chances for them.

It's understandable to look to the heights that may be achieved, rather than the lowly foothills left behind.

But based on Liverpool's 2016/17 process from an ExpG and first goal perspective, perhaps their relatively disappointing record against lower grade sides is not the outlier, but rather their exceptional top 6 results are.

Scoring fewer first goals than they actually did in these top of the table clashes would likely decrease their ppg in these games, while inevitably increasing those of their six challengers.

This would shift the top six group gradually to the right in the initial plot and Liverpool slightly more substantially to the left until they perhaps formed a more homogenous group with no outlier.

It's traditional to wind up with "nothing to see, randomness wins again", particularly when one set of data is taken from a small, extreme inducing sample of just 12 inter connected matches per team.

But we now have the data, a place to look and video to see if there is some on pitch, if possibly transient cause to the effect of Liverpool finding the net first in big games or if the usual suspect in Constantinos'  mystery does indeed turn out to be the major guilty party.

All data from @InfoGolApp

Tuesday 8 August 2017

"It's All about The Distribution Part 2"

First the disclaimer, this isn't a "smart after the event" explanation for Leicester's title season.

It is a list of the occasional, nasty or pleasant surprises that can occur and the limitations of trying to second guess these when using a linear, ratings based model.

Building models based around numbers and averages do work extremely well for the majority of teams in the majority of seasons.

But as the financial world found to the cost of others, neglecting distributions, especially ones that appear normal, but hide fatter than usual tails can leave you unprepared for the once in a lifetime event.

The previous post looked at a hypothetical five team scenario, where the lowest rated, but under exposed side had a much better chance of winning a contest than implied by the respective ratings, simply because the distribution of potential ratings were markedly different for this side.

Again, full disclosure, this model wasn't from football, it was a five runner race run at Uttoxeter and Team 5 was actually a very lightly raced horse against exposed rivals.

I assumed that the idea that distributions of potential performance sometimes matters also carries over into football and the obvious example of an unconsidered team taking a league by storm was Leicester's 2015/16 title winning season.

I went back to 2014/15 and produced some very simple expected goals ratings for all 20 sides going into the 2015/16 season.

I also looked at how diverse and spread out the performance ratings from 2014/15 were for each side.

Three teams whose performances had fluctuated most and might be considered as having a bit more meat in their distribution tails and might be less likely to adhere to their "average" expectations were champions, Chelsea, West Ham and Leicester.

I then set up a distribution for each team based around their average rating and the standard deviation from their individual game by game performances in 2014/15.

I then drew from these tailored distributions as a basis to simulate each game in the 2015/16 season, Leicester's winning season.

And this is how the Foxes and their fellow in and out teams fared in simulations that take from a distribution, rather than a rating.


Leicester project as a top half team, who were as likely to finish in the top two as they were to be relegated and West Ham put themselves about all over the place, but predominately in the top half, which is where they ended up.

Chelsea have a minute chance of ending up tenth, so kudos to Mourinho for breaking this particular model.

There are some really interesting figures emerging today, both for teams and players and usually it's fine to run with the average.

But these averages live in distributions and when these distributions throw up something inevitable, if unexpected, as the bankers found out, someone has to pay.

"It's All About The Distribution".

You've got five teams.

One is consistently the best team, their recruitment is spot on with a steady stream of younger replacements ready and able to take over when their starts peak and wane.

Then we've got two slightly inferior challengers, again the model of consistency, with few surprises, either good or bad.

The lowest two rated teams complete the group of five.

The marginally superior of these also turns in performances that only waver slightly from their baseline average.

For the final team, however we have very limited information about their abilities, partly due to a constantly changing line up and new acquisitions.

The current team has been assembled from a variety of unfashionable leagues and results and we only have a handful of results by which to judge them.

So we group together the initial results of similarly, newly assembled teams to create a larger sample size to describe what we might get from such a team.

Instead of a distribution that resembles the four, more established teams, we get one that is much more inconsistent. Some such teams did well, others very badly.

The distribution of performances for the first four sides is typical of teams from this mini league, whereas the distribution we have chosen to represent the potential upside and downside of this unexposed side is not.

Team 5's distribution has a flatter peak and fatter tails, both good and bad.

The average "ratings" of the five teams are shown below.

Team 5 has the lowest average rating, but by far the largest standard deviation based on the individual ratings of the particular cohort of sides we have chosen to represent them.

As Team 5 is the lowest rated, they're obviously going to finish bottom of the table, a lot, but just to confirm things we could run a simulation based on the distribution of performances for all five teams.

First we need to produce a distribution that mimics the range of performances for the 5 teams and we'll draw a random number from that distribution to decide the outcome of a series of contests.

The highest performance number drawn takes the spoils.

Run 10,000 simulated contests and Team 5 does come last more frequently than any other side, roughly half the tournaments finish with Team 5 in last position.

However, because their profiled performances are inconsistent and populated by a few very good performances, they actually come first more frequently than might be expected from their average performance rating.

In 10,000 simulations, Team 5 comes first 22% of the time, bettered only by Team 1, whose random draw of ratings based on their more conventional distribution of potential performances grants them victory 36% of the time.

Not really what you'd expect simply from eyeballing the raw ratings.

Team 5, based on the accumulated record of teams that have similar limited data, are likely to be sometimes very bad, but occasionally they can produce excellent results.

Such as Leicester when they were transitioning into a title winning team?

As someone once said at an OptaProForum.......

"It's all about the distribution"

......and simple averages can sometimes miss sub populations that could be almost anything.

Straight line assumptions, extrapolated from mere averages will always omit the inevitable uncertainty that surrounds such teams or players, where data is scarce and distribution tails might be fatter than normal.

Friday 4 August 2017

What Might Leicester Get from Kelechi Iheanacho?

Hidden behind Neymar's unveiling in Paris was Kelechi Iheanacho's departure from Manchester City to last season's Champions League quarter finalists, Leicester City.

There's probably no need to measure the height of Iheanacho's transfer fee in piles of tenners, but it does amount to a substantial investment in young talent for the East Midlands side and an opportunity  for Kelechi to gain larger amounts of playing time, especially from kick off.

His stats are impressive for a young player.

Any playing time at such a raw age, particularly at a regular title contender is impressive and during his 1275 minutes he's scored 12 from 50 shots (24% conversion rate, without the need for a calculator) and provided 4 assists.

Many appearances have been from the subs bench and it is well known that scoring generally accelerates as the game progresses, so he'll have had a slight boost from that.

He's not really been thrown in solely against the Premier League minnows.

The weighted expected goals conceded by the teams he has faced is only slightly above the league average and he's scored against teams such as Stoke, Spurs, Stoke, Manchester United, Bournemouth, Stoke, Swansea and Southampton.

Nothing too much to worry about him being a flat track bully, although he does quite like Stoke.

In simpler, pre expected goals times, you would take his 24% conversion rate and regresses it fairly heavily towards the league average rate to get a more realistic future expectation.

Devoid of any shot location context, Iheanacho's conversion rate since 2015/16 is second only to Llorente at Swansea, another 50 odd attempt player and just ahead of renowned goalscorer, Gary Cahill.

Small samples often lead to unrepresentative extremes and if any media outlet is still quoting raw conversion rates in this enlightened era, they'll probably be disappointed in the long run.

Higher volume shooters over the two seasons Iheanacho's been around in the Premier League are peaking at around 18% conversion rates and as a group, players with 40 or more attempts are converting around 1 in ten.

Regressing his 24% rate by around 50% wouldn't have been out of order and back in the day you would probably pitch him it at around a 17% conversion rate, which is still elite and wait for more data.

Nowadays, lots of Heisenberg expG models are attempting to extract the truth from lots of noisy data produced by players whose fitness peaks and troughs, along with their team mates and opponents.

Most will put Iheanacho's cumulative expected goals from his 50 attempts at around 9 expG compared to his actual total of 12 goals.

Act is > ExpG, case solved, he's an above average finishing capture.

But this doesn't account for natural randomness in a process or outrageous good fortune (such as
the ball hitting you on the back and looping into the net against Swansea in December 2015).

Here's the range of simulated successful outcomes for an average finisher, assuming he could have got onto the end of Iheanacho's 50 attempts.

There's roughly a 14% chance an average Premier League finisher scores as many or more goals than the 12 that Leicester's new signing managed at Manchester City and his highlighted 24% strike rate slightly pales under the scrutiny of shot type and location.

It's also wise to see if your Heisenberg model at least roughly matches the actual distribution of output from the many guinea pigs who are run through it.... and Inheancho is initially a pretty poor fit.

The chance that his actual distribution of goals from his attempts is consistent with the model used in the simulations, is only around 1 in 1000.

In these cases it is well worth looking at each attempt, the outcome and the attached expG value.

The problem with Iheanacho fitting the model is that two of his goals come from very low probability chances (the aforementioned back deflected goal at Swansea) and the remaining ten come from virtually the ten most likely goal scoring opportunities he received.

He's scored one long range shot against Southampton, one with his back against the Swans and then nails almost every high quality chance with an expG above 0.4 that he's presented with.

Mitigate for the fluke and the model fit becomes more forgiving.

Delving into the attempts, looking at the outcomes and seeing where the (imperfect) model breaks down can tell us a lot more about Leicester's £25 million purchase than merely saying "he over-performs his ExpG".

He may thrive on quality chances, he certainly has done in his short time in the Premier League.

Over the previous two campaigns, Manchester City created the second highest proportion of the high quality chances that Iheanacho excels at converting.

Around 7% of Manchester City's created attempts have an ExpG in excess of 0.4 in my model.

Leicester are third in this list over the last two seasons, also with around 7% of their chances being high quality ones, suggesting he's a decent fit for the Foxes.

However, numerically, Manchester City are much more prolific both overall and in this creative area. Their play makers carve out five such highest quality chances every four games, compared to just three for Leicester.

Iheanacho may be able to bridge that gap between the two Cities by his positional nous and undoubted pace, but he'll also be competing with Leicester's main beneficiary of these high quality chances, a quarter of which fell to Jamie Vardy.

In short, just a few caveats to one of the upcoming season's major purchase by a team outside the top six.