Saturday, 13 December 2014

How Well Might Total Shot Ratio in Football Travel ?

One of the most familiar of the crossover stats from hockey to football is total shot ratio. It is often used as a proxy for "dominance" of possession and supposes that a side that is constantly out-shot or out shoots their opponent will eventually reap their just rewards.

The majority of the football based research has been focused quite naturally on the English Premier League, most notably by James Grayson, who demonstrates both the repeatability and predictive nature of the stat for the EPL.

Very little of the detailed validation that James has undertaken for the EPL has been done on any other the other world leagues, but this hasn't prevented a combination of a respectable TSR and a relatively low league position earning a non EPL side team an "unlucky" tag.

If we first consider the case of possession, It has largely been discarded as credible measure of team strength. Barcelona's use of the tactic, with varying degrees of success, suffered at the feet of Bayern Munich in the 2013 ECL semi final.

Spain's finest held the majority of the ball, but fell to a 7-0 two legged thrashing. And in more humble surroundings, Stoke mostly under Pulis, graciously allowed themselves to be continually "dominated" on all fronts, even at home, but still won enough games to safely avoid the drop each season.

Stoke Shrug & Say No to TSR or Possession Stats.
However, despite possession's decidedly mixed record of success, it is possible to construct a scatter plot of possession against match outcome that does appear to present a pleasingly angled line of best fit that appears to show more possession positively linked to more success on the field.

It is only when the top three or four teams are removed from the plot that the correlation for the remaining teams becomes largely random.

The constant presence of a handful of sides which dominate in a league, not only in wins, but in possession and shots by simply outclassing the majority of their opponents, therefore, may potentially create a misleading correlation, where only a weak one exists for the majority of teams.

So might the illusion of possession being generally correlated to results be repeated to some degree, in some leagues for TSR.

The Scottish Premiership has largely been a two horse race between Rangers (until their demotion following their liquidation) and Celtic. The league has an unusual format, comprising of 12 sides and the table splits in two once each team has played three matches against each of their rivals, enabling a 38 game season to be played out.

In common with the EPL, a Scottish Premiership side's TSR in the first half of the season appears to show a strong correlation to goal difference in the remainder. Therefore, TSR appears to do a good job of predicting a useful future quality of teams in Scotland's top flight.

The coefficient of correlation is a very healthy 0.71 for the four seasons from 2010/11 onwards. However, the scatterplot is fairly lopsided, there is very nearly daylight between the six points in the top right hand corner, four Celtic seasons and the only two from Rangers, and the rest.

As with possession, where a strong correlation is inferred because of the constant presence of atypical sides which partly outclass the rest, have Celtic and Rangers exaggerated the implied predictive power of TSR for the remainder of the teams, outside of the "Old Firm" by helping to create an impressive r value?

If we crudely remove the six "Old Firm" points from the plot, the scatter plot and the coefficient of correlation between past TSR and future goal difference for the remaining ten per season Scottish sides alters dramatically.

If we remove Rangers and Celtic, r falls to 0.14 as does the predictive power of the relationship. TSR may still tell us something about the remainder of the season for these non elite sides, but without a highly likely combination of a regular, out shot, defeat by either Celtic or Rangers, our conclusions become much less dramatic.

The separation of abilities in the EPL is perhaps less stark than in Scotland's top flight, but a division along the lines of Simon Gleave's "Superior seven and threatened thirteen" has largely existed over the last decade.

Once again, the coefficient of correlation is high if we include all Premiership sides and although the clustering of teams at the middle and bottom isn't as pronounced as in Scotland, there does appear to be an effect.

If we remove every game involving at least one of Simon's superior seven, we get a truncated season in which the threatened thirteen only contest home and away matches against one of the twelve remaining teams from this group.

If we now split the season in two and regress a threatened thirteen side's TSR in the first half of the season to their goal difference in the remainder of the season, the coefficient of correlation again falls. This time from 0.73 to 0.29.

Admittedly, the sample size has also fallen in both cases when the very best sides are removed, but the confidence with which TSR is used indiscriminately across and perhaps within leagues to evaluate certain types of sides perhaps should be questioned.

Just as the past possession record of sides is largely irrelevant to many future outcomes, (even when we can easily produce a scatter plot between possession and outcome to create the illusion of a strong, universal relationship), previous team TSR might also be much less of a factor in the prediction of some types of games.

In conclusion, we seem to get high r values for the relationship between TSR at half season and future goal difference in the remaining games, in leagues where there are a couple of consistently dominant teams, in every aspect of play.

Barca and Real Madrid in Spain, Bayern Munich in Germany, Celtic and a soon to be reunited Rangers in Scotland, the superior seven in the EPL.

The coefficient of correlation for the previous four full seasons, for EPL, Spain, Germany and Scotland are, respectively 0.77, 0.73 0.68 and 0.71.

In France, where dominant sides are less in number and also less dominant, r is 0.51, in Italy it is 0.53.

And most pertinently perhaps, in the bottom tier of the English football league, where runaway winners are rare, r for TSR to future goal difference from the mid season point over the last four seasons is barely 0.4.

So at the very least, the clear assertion that TSR is a good indicator of likely future performance, in every league and within every league at every level, ranging from the English Championship to the MLS and the Egyptian league to the Australian professional league, should also be backed up by the kind of thorough and rigorous validation that James has carried out for the EPL.

Without that, the strength of any possible correlation can only be guessed at.

Monday, 8 December 2014

Weighted Shots v Unweighted Shots As A Predictor of Future Goal Difference in the EPL.

Tom Tango has recently presented an alternative to Corsi in hockey that weights shots differently depending on whether they resulted in goals, saves, misses or blocks.

One of the logical tests of the new metric is see how well it correlates to useful team information, such as future goal difference, compared to projecting from previously used metrics, such as unweighted shot differential or ratios.

The expectation voiced in many hockey circles was that because the "Tango" correlated almost perfectly to the traditional Corsi metric, the added information hoped for by weighting different types of shots would be negligible, at best.

In a typical concise and insightful post, here, Tango addresses the issue of the virtually perfect correlation between both metrics. Pointing out that using basic shot data from identical samples to test the correlation to out of sample data, such as future goal difference, gave different coefficients of correlation depending on whether the Corsi or Tango was used.

In short, weighted shots showed higher r values, despite the strong correlation between the two metrics.

 r Values for Weighted & Unweighted Shot Differential and Ratios when Correlating to Future Premiership Goal Difference.

After X Games r for Total Shot Ratio r for Shot Differential. r for Weighted Shot Differential
2 0.49 0.51 0.57
6 0.70 0.71 0.77
10 0.70 0.71 0.76
15 0.74 0.74 0.80
18 0.73 0.74 0.80
20 0.73 0.74 0.79
24 0.72 0.73 0.77
30 0.65 0.66 0.69
34 0.55 0.55 0.56

Tango's defence of his new metric can be summed up in this extract from the linked post.

"But more amazing is that even though the correlation of Corsi to Tango (both based on the same samples) was close to r=1, when we correlate each to out-of-sample data (in this case, goal differential from OTHER games), Tango correlated at r=.50, while Corsi was r=.44.  Or if you prefer r-squared, it’s .25 to .19, respectively."

I have therefore repeated the exercise for the Premiership, using three flavours of shot based metrics in one part of the season and testing the correlation between these at an individual team level and goal difference for teams in the remainder of the season.

And the weighting of shots also appears to make a difference in soccer as well as in hockey. Correlation peaks around mid-season, but at every stage, weighting proved a superior correlation to goal difference in the remainder of the season compared to unweighting.

It also makes intuitive sense to reflect the extra information present in a goal compared to just a shot.

Saturday, 6 December 2014

Using Weighted Shots To Predict Goal Difference in Subsequent Games.

As a follow up to the previous post, here is the changing relationship between a side's weighted shot differential compared to their opponents, for goals, shots that went wide and shots that were saved after a certain number of matches and the goal difference in the remainder, as suggested in the comments by Tango.

The EPL has 38 games,

So, for example, the projection for the goal difference in games 3 to 38 inclusive is found by multiplying the oerall GD in games 1 and 2 by 2.9, adding the differential of shots that went wide after two games multiplied by 0.43 and adding the save differential after two games multiplied by 1.24. The constant is universally close to zero.

r for each regression after each number of games is in the final column and peaks at mid season.

Correlation Between Shot Type Differential in Previous Games & Goal Difference in Remaining Games.

After x Games Goal Difference  Coefficient Shots Wide Differential Coefficient Shots Saved Differential  Coefficient Relative Weight for GD Relative Weight for Wide Shots  Relative Weight for Saved Shots r
2 2.9 0.43 1.24 1 0.1 0.4 0.57
3 2.5 0.59 0.92 1 0.2 0.4 0.67
4 2.3 0.40 0.71 1 0.2 0.3 0.72
5 1.9 0.38 0.57 1 0.2 0.3 0.76
6 1.8 0.33 0.45 1 0.2 0.3 0.77
7 1.6 0.25 0.37 1 0.2 0.2 0.79
8 1.4 0.24 0.30 1 0.2 0.2 0.79
9 1.2 0.220.27 1 0.2 0.2 0.76
10 1.1 0.17 0.24 1 0.2 0.2 0.76
11 1.05 0.15 0.19 1 0.1 0.2 0.77
12 .99 0.15 0.16 1 0.1 0.2 0.78
13 .91 0.13 0.14 1 0.1 0.2 0.81
14 .82 0.11 0.14 1 0.1 0.2 0.81
15 .75 0.09 0.14 1 0.1 0.2 0.80
16 .72 0.09 0.12 1 0.1 0.2 0.81
17 .6 0.06 0.14 1 0.1 0.2 0.81
18 .55 0.02 0.15 1 0 0.3 0.80
19 .46 0.02 0.14 1 0 0.3 0.78
20 .43 0.01 0.13 1 0 0.3 0.79
21 .39 0.01 0.12 1 0 0.3 0.78
22 .36 0.01 0.10 1 0 0.3 0.79
23 .31 0 0.09 1 0 0.3 0.76
24 .28 0 0.09 1 0 0.3 0.77
25 .25 0.01 0.08 1 0 0.3 0.75
26 .20 0.01 0.08 1 0 0.4 0.74
27 .17 0.01 0.07 1 0 0.4 0.72
28 .15 0 0.07 1 0 0.4 0.70
29 .13 0.01 0.05 1 0.1 0.4 0.69
30 .13 0.01 0.04 1 0.1 0.3 0.69
31 .11 0.01 0.03 1 0.1 0.3 0.66
32 .08 0.01 0.03 1 0.1 0.3 0.64
33 .06 0.01 0.02 1 0.2 0.4 0.60
34 .05 0.01 0.02 1 0.1 0.5 0.56
35 .04 0 0.02 1 0 0.5 0.48
36 .02 0 0.02 1 0 0.7 0.40
37 .02 0 0 1 0 0.2 0.32