Saturday 13 December 2014

How Well Might Total Shot Ratio in Football Travel ?

One of the most familiar of the crossover stats from hockey to football is total shot ratio. It is often used as a proxy for "dominance" of possession and supposes that a side that is constantly out-shot or out shoots their opponent will eventually reap their just rewards.

The majority of the football based research has been focused quite naturally on the English Premier League, most notably by James Grayson, who demonstrates both the repeatability and predictive nature of the stat for the EPL.

Very little of the detailed validation that James has undertaken for the EPL has been done on any other the other world leagues, but this hasn't prevented a combination of a respectable TSR and a relatively low league position earning a non EPL side team an "unlucky" tag.

If we first consider the case of possession, It has largely been discarded as credible measure of team strength. Barcelona's use of the tactic, with varying degrees of success, suffered at the feet of Bayern Munich in the 2013 ECL semi final.

Spain's finest held the majority of the ball, but fell to a 7-0 two legged thrashing. And in more humble surroundings, Stoke mostly under Pulis, graciously allowed themselves to be continually "dominated" on all fronts, even at home, but still won enough games to safely avoid the drop each season.

Stoke Shrug & Say No to TSR or Possession Stats.
However, despite possession's decidedly mixed record of success, it is possible to construct a scatter plot of possession against match outcome that does appear to present a pleasingly angled line of best fit that appears to show more possession positively linked to more success on the field.

It is only when the top three or four teams are removed from the plot that the correlation for the remaining teams becomes largely random.

The constant presence of a handful of sides which dominate in a league, not only in wins, but in possession and shots by simply outclassing the majority of their opponents, therefore, may potentially create a misleading correlation, where only a weak one exists for the majority of teams.

So might the illusion of possession being generally correlated to results be repeated to some degree, in some leagues for TSR.

The Scottish Premiership has largely been a two horse race between Rangers (until their demotion following their liquidation) and Celtic. The league has an unusual format, comprising of 12 sides and the table splits in two once each team has played three matches against each of their rivals, enabling a 38 game season to be played out.

In common with the EPL, a Scottish Premiership side's TSR in the first half of the season appears to show a strong correlation to goal difference in the remainder. Therefore, TSR appears to do a good job of predicting a useful future quality of teams in Scotland's top flight.

The coefficient of correlation is a very healthy 0.71 for the four seasons from 2010/11 onwards. However, the scatterplot is fairly lopsided, there is very nearly daylight between the six points in the top right hand corner, four Celtic seasons and the only two from Rangers, and the rest.

As with possession, where a strong correlation is inferred because of the constant presence of atypical sides which partly outclass the rest, have Celtic and Rangers exaggerated the implied predictive power of TSR for the remainder of the teams, outside of the "Old Firm" by helping to create an impressive r value?

If we crudely remove the six "Old Firm" points from the plot, the scatter plot and the coefficient of correlation between past TSR and future goal difference for the remaining ten per season Scottish sides alters dramatically.

If we remove Rangers and Celtic, r falls to 0.14 as does the predictive power of the relationship. TSR may still tell us something about the remainder of the season for these non elite sides, but without a highly likely combination of a regular, out shot, defeat by either Celtic or Rangers, our conclusions become much less dramatic.

The separation of abilities in the EPL is perhaps less stark than in Scotland's top flight, but a division along the lines of Simon Gleave's "Superior seven and threatened thirteen" has largely existed over the last decade.

Once again, the coefficient of correlation is high if we include all Premiership sides and although the clustering of teams at the middle and bottom isn't as pronounced as in Scotland, there does appear to be an effect.

If we remove every game involving at least one of Simon's superior seven, we get a truncated season in which the threatened thirteen only contest home and away matches against one of the twelve remaining teams from this group.

If we now split the season in two and regress a threatened thirteen side's TSR in the first half of the season to their goal difference in the remainder of the season, the coefficient of correlation again falls. This time from 0.73 to 0.29.

Admittedly, the sample size has also fallen in both cases when the very best sides are removed, but the confidence with which TSR is used indiscriminately across and perhaps within leagues to evaluate certain types of sides perhaps should be questioned.

Just as the past possession record of sides is largely irrelevant to many future outcomes, (even when we can easily produce a scatter plot between possession and outcome to create the illusion of a strong, universal relationship), previous team TSR might also be much less of a factor in the prediction of some types of games.

In conclusion, we seem to get high r values for the relationship between TSR at half season and future goal difference in the remaining games, in leagues where there are a couple of consistently dominant teams, in every aspect of play.

Barca and Real Madrid in Spain, Bayern Munich in Germany, Celtic and a soon to be reunited Rangers in Scotland, the superior seven in the EPL.

The coefficient of correlation for the previous four full seasons, for EPL, Spain, Germany and Scotland are, respectively 0.77, 0.73 0.68 and 0.71.

In France, where dominant sides are less in number and also less dominant, r is 0.51, in Italy it is 0.53.

And most pertinently perhaps, in the bottom tier of the English football league, where runaway winners are rare, r for TSR to future goal difference from the mid season point over the last four seasons is barely 0.4.

So at the very least, the clear assertion that TSR is a good indicator of likely future performance, in every league and within every league at every level, ranging from the English Championship to the MLS and the Egyptian league to the Australian professional league, should also be backed up by the kind of thorough and rigorous validation that James has carried out for the EPL.

Without that, the strength of any possible correlation can only be guessed at.

Monday 8 December 2014

Weighted Shots v Unweighted Shots As A Predictor of Future Goal Difference in the EPL.

Tom Tango has recently presented an alternative to Corsi in hockey that weights shots differently depending on whether they resulted in goals, saves, misses or blocks.

One of the logical tests of the new metric is see how well it correlates to useful team information, such as future goal difference, compared to projecting from previously used metrics, such as unweighted shot differential or ratios.

The expectation voiced in many hockey circles was that because the "Tango" correlated almost perfectly to the traditional Corsi metric, the added information hoped for by weighting different types of shots would be negligible, at best.

In a typical concise and insightful post, here, Tango addresses the issue of the virtually perfect correlation between both metrics. Pointing out that using basic shot data from identical samples to test the correlation to out of sample data, such as future goal difference, gave different coefficients of correlation depending on whether the Corsi or Tango was used.

In short, weighted shots showed higher r values, despite the strong correlation between the two metrics.

 r Values for Weighted & Unweighted Shot Differential and Ratios when Correlating to Future Premiership Goal Difference.

After X Games r for Total Shot Ratio r for Shot Differential. r for Weighted Shot Differential
2 0.49 0.51 0.57
6 0.70 0.71 0.77
10 0.70 0.71 0.76
15 0.74 0.74 0.80
18 0.73 0.74 0.80
20 0.73 0.74 0.79
24 0.72 0.73 0.77
30 0.65 0.66 0.69
34 0.55 0.55 0.56

Tango's defence of his new metric can be summed up in this extract from the linked post.

"But more amazing is that even though the correlation of Corsi to Tango (both based on the same samples) was close to r=1, when we correlate each to out-of-sample data (in this case, goal differential from OTHER games), Tango correlated at r=.50, while Corsi was r=.44.  Or if you prefer r-squared, it’s .25 to .19, respectively."

I have therefore repeated the exercise for the Premiership, using three flavours of shot based metrics in one part of the season and testing the correlation between these at an individual team level and goal difference for teams in the remainder of the season.

And the weighting of shots also appears to make a difference in soccer as well as in hockey. Correlation peaks around mid-season, but at every stage, weighting proved a superior correlation to goal difference in the remainder of the season compared to unweighting.

It also makes intuitive sense to reflect the extra information present in a goal compared to just a shot.

Saturday 6 December 2014

Using Weighted Shots To Predict Goal Difference in Subsequent Games.

As a follow up to the previous post, here is the changing relationship between a side's weighted shot differential compared to their opponents, for goals, shots that went wide and shots that were saved after a certain number of matches and the goal difference in the remainder, as suggested in the comments by Tango.

The EPL has 38 games,

So, for example, the projection for the goal difference in games 3 to 38 inclusive is found by multiplying the oerall GD in games 1 and 2 by 2.9, adding the differential of shots that went wide after two games multiplied by 0.43 and adding the save differential after two games multiplied by 1.24. The constant is universally close to zero.

r for each regression after each number of games is in the final column and peaks at mid season.

Correlation Between Shot Type Differential in Previous Games & Goal Difference in Remaining Games.

After x Games Goal Difference  Coefficient Shots Wide Differential Coefficient Shots Saved Differential  Coefficient Relative Weight for GD Relative Weight for Wide Shots  Relative Weight for Saved Shots r
2 2.9 0.43 1.24 1 0.1 0.4 0.57
3 2.5 0.59 0.92 1 0.2 0.4 0.67
4 2.3 0.40 0.71 1 0.2 0.3 0.72
5 1.9 0.38 0.57 1 0.2 0.3 0.76
6 1.8 0.33 0.45 1 0.2 0.3 0.77
7 1.6 0.25 0.37 1 0.2 0.2 0.79
8 1.4 0.24 0.30 1 0.2 0.2 0.79
9 1.2 0.220.27 1 0.2 0.2 0.76
10 1.1 0.17 0.24 1 0.2 0.2 0.76
11 1.05 0.15 0.19 1 0.1 0.2 0.77
12 .99 0.15 0.16 1 0.1 0.2 0.78
13 .91 0.13 0.14 1 0.1 0.2 0.81
14 .82 0.11 0.14 1 0.1 0.2 0.81
15 .75 0.09 0.14 1 0.1 0.2 0.80
16 .72 0.09 0.12 1 0.1 0.2 0.81
17 .6 0.06 0.14 1 0.1 0.2 0.81
18 .55 0.02 0.15 1 0 0.3 0.80
19 .46 0.02 0.14 1 0 0.3 0.78
20 .43 0.01 0.13 1 0 0.3 0.79
21 .39 0.01 0.12 1 0 0.3 0.78
22 .36 0.01 0.10 1 0 0.3 0.79
23 .31 0 0.09 1 0 0.3 0.76
24 .28 0 0.09 1 0 0.3 0.77
25 .25 0.01 0.08 1 0 0.3 0.75
26 .20 0.01 0.08 1 0 0.4 0.74
27 .17 0.01 0.07 1 0 0.4 0.72
28 .15 0 0.07 1 0 0.4 0.70
29 .13 0.01 0.05 1 0.1 0.4 0.69
30 .13 0.01 0.04 1 0.1 0.3 0.69
31 .11 0.01 0.03 1 0.1 0.3 0.66
32 .08 0.01 0.03 1 0.1 0.3 0.64
33 .06 0.01 0.02 1 0.2 0.4 0.60
34 .05 0.01 0.02 1 0.1 0.5 0.56
35 .04 0 0.02 1 0 0.5 0.48
36 .02 0 0.02 1 0 0.7 0.40
37 .02 0 0 1 0 0.2 0.32

Thursday 4 December 2014

The Weighting Game.

Shot counts verses goal counts as a predictor of future performance is a debate that that is being fought out not only in football, but also in hockey. Sample size is at the heart of the issue. Goals are obviously more important in terms of who wins the match, but they are relatively rare events. Whereas shots accumulate at a faster rate, building up sample size, but play only an intermediate role in deciding the outcome.

It is perhaps unfortunate that a distinction has arisen between shots and goals because they are merely classifications of a single larger group. Namely, they are all goal attempts, but with different actual outcomes.

Goals are shots (or headers) that result in a goal, saves are on target shots that are saved and misses are shots that go high or wide of the target.

The most recent rumblings from hockey arises from renowned sabermetrician, Tom Tango's use of different types of shots (he includes blocked efforts also), from the first half of a season to predict goals or specifically goal differential from the second half.

He uses the different types of shot differential, with appropriate weightings in the first half of a season to predict goal differential in the second. The post can be found here and the application to football is obvious.

I have therefore updated a similar approach using data from Joe B's football data site (So no blocked shot data as a separate category). The aim was slightly different. I set out to determine the final goal difference for EPL teams, based on their goal difference and shooting differential at various times during the season.

Final goal difference is strongly correlated to finishing position and with the odd exception finishing position is also related to team strength. And knowing where a side is likely to finish well before they actually arrive at that position is an obvious advantage if we wish to know how they are likely to perform during that 38 game journey.

I therefore split goal attempts into shots that went into the net (or goals), attempts that were saved and off target attempts and totaled the cumulative differential between each Premiership team and their opponents from the second match of the season until the penultimate game.

For example, after 14 games, Arsenal currently have a +7 goal difference, a +91 differential in shots that went wide and +34 differential in shots that were saved.

I then regressed these differentials against the final goal difference after the 38th game to get the changing relationship between the three variables and the side's ultimate goal difference after two games all the way up to 37 games played.

All three types of shots are important in predicting the final goal difference of teams. But the relative importance in predicting future goal difference from shots that are saved or go wide, declines in relation to the importance of shots that result in a goal (or goals for short), as the number of games increases.

In addition, the values of the coefficients is also dependent upon how many matches are in the sample. The respective goals, wide shots/headers and saved shots/headers coefficients are 3.91, 0.43 and 1.24 when calculated after just two matches and, as you would probably expect 1.02, 0, and 0 after 37.

So far this season each team has played 14 matches and the coefficients for current goals, wide shots and save differentials when used to predict future final goal difference are respectively 1.91, 0.13 and 0.14. If we use these figures for each team, the final projected league goal difference for each side is as shown below.

Projected Final Goal Difference Using Shot Differentials After 14 Games.

Projected GD.
Man City
Man Utd
West Ham
West Brom
C Palace
Aston Villa

At the moment these figures are merely another rating system, albeit one that appears to reasonably predict the likely quality of the current side. Villa, for example appear to have been fortunate in the way in which they have won numerous single goal victories. And a wider appraisal incorporating extra shot information reduces their rating compared to their current league position.

To illustrate how the projections have fluctuated for a single team, here's how the projected final goal difference has varied for Arsenal using the updated coefficients after each game week of the 2014/15 campaign to date.

Projected Final Goal Difference For Arsenal from Shot Differentials Updated Weekly.

Games Played by Arsenal. Final GD Projection.
    After 2 Games.            +13
3 23
4 16
5 24
6 25
7 17
8 22
9 24
10 31
11 27
12 26
13 27
14 28

We can demonstrate the use of such ratings and perhaps their predictive potential by converting these weighted shot derived ratings into match odds and comparing them with a reliable benchmark, such as the current bookmaking odds.

Stoke entertain Arsenal on Saturday. Arsenal's projected final goal difference is a rounded up +28, Stoke's is an also rounded up -4. Or +0.74 and -0.1 per game, conveniently in the currency of goals.

Home field is running at 0.38 of a goal. So Arsenal are 0.84-0.38 of a goal superior away to Stoke.

Arsenal should be, based on our projections, 0.46 of a goal superior, on average at the Britannia. If we run this figure through a Poisson, we might get a 47% chance of an Arsenal win, 26% for Stoke and 27% the draw. Best odds, as of Thursday night are 50%, 23% and 27%.

So the projections broadly agree with a robust business model, at least for the Potters and below I've applied the method to the remaining games this weekend.

Odds Derived from Shot Differentials and Final Goal Difference Projections for Week 15.

Game (Home Team First!) Home Win %
(Predicted/Best Price)
Away Win%. Draw%.
Man City v Everton 67/63 14/16 19/21
Liverpool v Sunderland 62/64 15/14 23/22
Newcastle v Chelsea 19/13 57/63 24/24
Spurs v C Palace 53/63 21/14 26/23
Stoke v Arsenal 26/23 47/50 27/27
QPR v Burnley 47/47 26/25 27/28
WHU v Swansea 49/41 24/30 27/29
A Villa v Leicester 45/43 28/28 27/29
Southampton v Man U 48/36 25/37 27/27
Hull v WBA 39/40 33/31 28/29

The majority of the odds fall within touching distance of those available to bet on and those few that don't do so for rational reasons, such as Manchester United's chaotic "getting to know you" phase, combined with Southampton's recent injuries.

By weighting shot types and applying coefficients appropriate to the number of matches played, it appears possible to project team strength with sufficient accuracy to mimic the bookmakers appraisal of Premiership teams.

Tuesday 2 December 2014

Dixie Dean. Head and Shoulders Above the Rest.

In this post I looked at the goal scoring record of Wayne Rooney compared to other England international strikers and particularly the difficulty of comparing scoring feats spread across very different eras when scoring environments varied.

The game has undergone many fundamental changes since its inception in the late 19th century, either tactical or through tinkering with the laws of the game, most notably the offside rule. And the effects of these changes can be seen in the average number of goals that were scored in total in league matches contested in the top flight of English football.

The early matches were particularly goal laden, briefly averaging just over 4.5 goals per match, but with noticeable post war peaks occasionally arresting the decline, the average has settled at a level just above 2.5 goals per match.

Therefore, the number of goals a top striker might expect to claim in a season was largely dependent upon the goal environment when he played and the maximum number of games he could and did actually play.

Of course, the holder of the most league goals scored in a division one season lies with Everton legend, Dixie Dean, who scored 60 goals in the 1927-28 season. However, not only was that season played in the midst of a post Great War scoring peak following the relaxation of the offside rule,(an average of 3.82 league goals were scored per match), but the season consisted of a maximum of 42 games.

Uneven numbers of games can be partly accounted for by taking individual goals per game. Dean, as far as I can find appeared to play in only 39 of the possible 42, although his scoring record was bolstered by penalty kicks, a luxury not available to the very earliest players.

Even with the occasional spot kick, Dean's record of 1.54 goals per game is astonishing. But to attempt to level the playing field further we can use the general relationship between the goals per game rate of scoring of the division's top scorers and the goal environment that prevailed at the time, denoted by the average goals scored per game.

For simplicity, the regression indicates that the top goal scorers across the eras have scored their goals per game at a quarter the rate of the average total goals per match during that season.

So Dean, playing when a match might average nearly four goals in total, would have been expected to score a goal a game, were he to follow the habit of the league's leading scorers across the ages.

Leading Scorers Relative to the Goal Environment at the Time

From the table above, even though Dean was playing when goals were more common and the top tier was an expanded version of its current form, his record is unsurpassed. No player gets close to his over performance compared to the expected rate based on the goal environment in 1927.

If Dean had performed to the typical level of a league leading scorer, he would have been expected to claim 39 goals in 1927-28, rather than his actual total of 60.

Virtually every decade of the last two centuries are represented in the list and there is a mixture of the familiar and the not so familiar names in the list of the top division's most formidable scorers.