Thursday 29 June 2017

Big Chance or No Big Chance.

There has been a fair bit of comment recently around big chances and their inclusion or not in shot based expected goals models.

Big chances are, as the name suggests, a partly subjective addition to the Opta data feed which describes a goal attempt.

Along with undeniable parameters, such as shot location, type and pre-shot build-up details, the big chance attempts to add information, such as the level of defensive pressure or the positioning of the keeper.

While such information may enhance any conclusion about the quality of an individual chance and assist in converting a purely outcome based approach to team evaluation to a more probabilistic, process based one, it may become prey to cognitive biases, such as outcome biases.

I thought I'd quickly build two models, using the Opta data feed we use to power the Infogol app and see how each performs when put to some of the common uses of an ExpG model.

One model uses big chances (BC), whilst the other does not (NBC).

Such models are primarily used either as descriptive of past matches and/or predictive of future performances.

Typically, pre-shot data is collected from a previous season or number of seasons and the relationship between this data to a discrete outcome, such as whether a goal is scored is found using logistic regression.

We can then use the results of the previously modelled regression to assign the probability that any future chance will result in a goal based on recent historical precedent.

The advantages of using ExpG models is that shots are much more numerous than goals and hopefully the process of chance creation with an attached probabilistic measurement of success will better describe a side's underlying abilities compared to actual goals, which are perhaps more prone to random streaks.

                     Cumulative ExpG Totals for 2015/16 Modelled from 2014/15 Opta Data.

Here's the cumulative ExpG totals for the 2015/16 Premier League, modelled using data from the previous season. These type of figures are often used as a basis to predict the future performance of a side.

The top model doesn't use big chances as a parameter, but the second does and while there is some variation between models, the correlation measured in Exp GD is strong between the two models.

For those wishing to use an ExpG approach to produce a probabilistic estimation of team quality, there seems little difference in larger sample sizes between a big or non big chance based model.

It would appear that, in the long term at least, chance quality information is also retrieved from non big chance Opta parameters and more importantly is distributed to individual teams in a similar way to a big chance model.

In short, both models give Exp GD of similar values for most sides.

However, cumulative totals can give near identical values, but be very different at the granular level.

Model BC may assign a much bigger probability to excellent opportunities and smaller ones to weaker opportunities, while model NBC may do the polar opposite and the errors in the latter may fortuitously balance out to give near equal cumulative totals.

The first model would describe future reality better than the second.

To test both models, I arranged the goal attempts for all 20 teams in ascending chance quality,divided these into groups and then compared the actual number of goals scored in each of these subsets to the number predicted by each model.

                      How Well Does the Predicted Distribution of Outcomes Match Reality.

(Green = acceptable match, brown - poor match).

The results of this goodness of fit test is shown above.

Where the probabilistic model prediction for each subset largely agrees with the actual distribution of outcomes for 201516, we get a large p value. There's a decent chance that the variation we see between prediction and reality is just down to chance.

Using the usual 5% threshold, there are two teams from the model constructed without big chances where the actual distribution of outcomes is so far removed from the predictions that chance may be largely ruled out as the cause.

In this case, Liverpool and Stoke.

The model constructed with big chances included as a variable has three teams where chance looks an unlikely candidate for the variation seen in the two distributions. Liverpool (again), Everton and Swansea.

So while cumulative ExpG values tend to show only small variations between a BC and a non BC model, differences do emerge at a more granular level and these differences for this season and these two models does not appear to be systematically in favour of the BC or non BC model.

In short, ExpG is a product of a model and all models vary and these differences and the conclusions we draw may be most evident in smaller shot samples

No comments:

Post a Comment