Pages

Thursday 29 June 2017

Big Chance or No Big Chance.

There has been a fair bit of comment recently around big chances and their inclusion or not in shot based expected goals models.

Big chances are, as the name suggests, a partly subjective addition to the Opta data feed which describes a goal attempt.

Along with undeniable parameters, such as shot location, type and pre-shot build-up details, the big chance attempts to add information, such as the level of defensive pressure or the positioning of the keeper.

While such information may enhance any conclusion about the quality of an individual chance and assist in converting a purely outcome based approach to team evaluation to a more probabilistic, process based one, it may become prey to cognitive biases, such as outcome biases.

I thought I'd quickly build two models, using the Opta data feed we use to power the Infogol app and see how each performs when put to some of the common uses of an ExpG model.

One model uses big chances (BC), whilst the other does not (NBC).

Such models are primarily used either as descriptive of past matches and/or predictive of future performances.

Typically, pre-shot data is collected from a previous season or number of seasons and the relationship between this data to a discrete outcome, such as whether a goal is scored is found using logistic regression.

We can then use the results of the previously modelled regression to assign the probability that any future chance will result in a goal based on recent historical precedent.

The advantages of using ExpG models is that shots are much more numerous than goals and hopefully the process of chance creation with an attached probabilistic measurement of success will better describe a side's underlying abilities compared to actual goals, which are perhaps more prone to random streaks.

                     Cumulative ExpG Totals for 2015/16 Modelled from 2014/15 Opta Data.



Here's the cumulative ExpG totals for the 2015/16 Premier League, modelled using data from the previous season. These type of figures are often used as a basis to predict the future performance of a side.

The top model doesn't use big chances as a parameter, but the second does and while there is some variation between models, the correlation measured in Exp GD is strong between the two models.


For those wishing to use an ExpG approach to produce a probabilistic estimation of team quality, there seems little difference in larger sample sizes between a big or non big chance based model.

It would appear that, in the long term at least, chance quality information is also retrieved from non big chance Opta parameters and more importantly is distributed to individual teams in a similar way to a big chance model.

In short, both models give Exp GD of similar values for most sides.

However, cumulative totals can give near identical values, but be very different at the granular level.

Model BC may assign a much bigger probability to excellent opportunities and smaller ones to weaker opportunities, while model NBC may do the polar opposite and the errors in the latter may fortuitously balance out to give near equal cumulative totals.

The first model would describe future reality better than the second.

To test both models, I arranged the goal attempts for all 20 teams in ascending chance quality,divided these into groups and then compared the actual number of goals scored in each of these subsets to the number predicted by each model.

                      How Well Does the Predicted Distribution of Outcomes Match Reality.



(Green = acceptable match, brown - poor match).

The results of this goodness of fit test is shown above.

Where the probabilistic model prediction for each subset largely agrees with the actual distribution of outcomes for 201516, we get a large p value. There's a decent chance that the variation we see between prediction and reality is just down to chance.

Using the usual 5% threshold, there are two teams from the model constructed without big chances where the actual distribution of outcomes is so far removed from the predictions that chance may be largely ruled out as the cause.

In this case, Liverpool and Stoke.

The model constructed with big chances included as a variable has three teams where chance looks an unlikely candidate for the variation seen in the two distributions. Liverpool (again), Everton and Swansea.

So while cumulative ExpG values tend to show only small variations between a BC and a non BC model, differences do emerge at a more granular level and these differences for this season and these two models does not appear to be systematically in favour of the BC or non BC model.

In short, ExpG is a product of a model and all models vary and these differences and the conclusions we draw may be most evident in smaller shot samples

Saturday 24 June 2017

You Don't Need Goals to Change Game State

I’ve written previously about the concept of game state and how a side prioritises their attacking and defensive resources.

It is well known that trailing sides often increase their attacking output when they are behind compared to when they were either level or ahead and this in turn impacts on the amount of defending their opponents are obliged to do.

Dependent upon the relative abilities of the two competing teams, a side seeking to get back on level terms often takes more shots and also accrues more products of attacking play, such as corners than was previously the case.

However, game state, as simply defined as the current score line does seem limiting and I’ve previously quoted the example of a top side playing out a goalless draw with a lesser team.

While the level scoreline would be increasingly welcome to the lower rated team as the game progressed, the opposite would apply for the better side in the matchup.

Therefore, quantifying “game state” should perhaps be done in terms that include the changing expectations of each team due to the passage of time and scoreline, rather than simply the scoreline.

I’ve suggested using the expected points each side would get on average from a match as a suitable baseline with which to begin measuring the evolving state of the game.

Here’s an example.

Chelsea entertains Everton and based on pregame home win/draw/away win estimations, Chelsea would expect to average 2.1 points compared to around 0.71 points for the visitors from the fixture.

40 minutes into a still goalless game and these numbers have respectively fallen to 1.9 and risen to 0.81. After 67 minutes and still no goal and Chelsea are faring even less well (1.66) and Everton are up to an average expectation of 0.90 points.

There have been no goals, but the state of the game is constantly drifting away from Chelsea’s expectations and surpassing Everton’s “par for the course”.

Chelsea's game state environment is gradually becoming less palatable to them and Everton's more so, simply through the passage of time and if this feeds through into the relative approaches of the sides, it should be seen in the match data.

Here’s a memorable 0-0 from 2016/17 when Burnley took a point in a stalemate at Old Trafford.

The host’s average expected points total started at around 2.3 points at kick-off compared to 0.55 points for the visitors, but it had fallen by over 10% when half time failed to see a score. So a gradual erosion of expectations, rather than a precipitous decline.

Burnley’s modest expectation was up to over 50% of their original with 20 minutes remaining and with United’s now tumbling by nearly a quarter compared to kick-off, their shot count began increasing as Burnley’s stalled.

        How Manchester United Piled on the Attempts as Burnley Frustrated them at OT.


This switch towards a more overtly attacking stance from the side leaking initial expectation as time elapses in a level match, forces their opponent to adopt a more defensive outlook and appears to be mirrored, on average in all such matches from the 201617 Premier League season.

72% of the goal attempts taken when the scoreline was level in 201617 were taken by the side whose expected points had slipped below their pregame estimation. Perhaps an important consideration when nearly half of all goal attempts from 201617 came while the scores were level.

Across all score lines, the inferior team in a match who had managed to improve their pre-game position, either through remaining level or taking a lead, attempted 31% of shots while that position persisted, but such sides upped this to nearly 46% against superior opponents when their current points expectation fell below their initial expectation.

These figures tally with intuition about how games develop, even in the absence of goals.

Therefore, the amount of change in a team’s pregame expectation may be a viable extension to the more commonly applied mere scoreline when assessing game state, particularly when we are still awaiting an initial goal.

For example, it is commonly assumed that increased shot volume from a side that finds themselves in a disadvantageous game state is partially balanced by a more packed defence.

This may lead to the expected goals from identical pitch locations being lower when defensive pressure is greater.

To try to test this I included a variable for game state within an expected goal model.for the 201617 Premier League, based around this continuous, time elapsed and score dependent calculation, rather than merely using the current scoreline.

Overall, a team playing with a current expected points total that had dipped well below their pre-game expectations, converted chances at a lower rate than identical chances where game state was much less of a factor.

In addition, as teams played with a poorer game state, their goal attempts were also more likely to be blocked by defenders than in similar situations when their game state environment wasn't as dire.

As an example, a side who had improved their position compared to pre-game by around 40% of their initial points expectation might convert a decent shot from the heart of the penalty area around 44% of the time.

But when faced with the same chance when their points expectation had fallen by a similarly large amount, they appear to only convert the opportunity 37% of the time.

This may be due to fewer defenders being around in the first instance as their opponents perhaps chased a goal of their own compared to the second situation when defence might be a higher priority for their opponents.

Thursday 15 June 2017

Early Season Strength of Schedule

With the major European leagues currently enjoying their summer holidays, it is left to a handful of competitions to provide club based action until early August.

One such league is Brazil's Serie A, a fascinating mix of player and managerial churn, exciting skillful youngsters, paired with former internationals, slowly winding down their illustrious careers and lots of shooting from distance.

Tonight sees the completion of week seven of the twenty team league, so while we have accumulated some new information about the 2017/18 version of teams such as Santos, Sao Paulo, Corinthians and less know sides, such as Gremio and Bahia, that information comes courtesy of an unbalanced schedule.

Prior to week seven, Flamengo had played three of the current bottom four and no side from the top half of the table, whereas Vasco da Gama had faced the current top two and only two sides outside the top ten.

The challenges faced by these two sides were likely to vary in their degree of difficulty,

Delving deeper into each side's most recent games, including matches from 2016/17 may be a more reliable indicator of their respective future prospects, but it is understandable that a six game season to date also invites comment in isolation.

Predicting the future arc of a team's season is always welcome, but celebrating achievement over a shorter time frame, even if some of it has come from a sprinkling of unsustainable randomness also deserves attention.

How can advanced stats and strength of schedule adjustments assist?

It's natural to look firstly at the record of the side in question, but it is their opponents that possess the richest seam of data from 2017/18's fledgling season.

Vasco has played Palmeiras, Bahia, Sport, Fluminese, Corinthians and Gremio prior to last night and in turn each of their opponents has also played five other opponents in addition to Vasco.

Combined, Vasco's opponents have played 36 games, nearly a full season and have played every side in Serie A at least once, bar Corinthians.

We have a ton of accumulated data from goals to expected goals for Vasco's opponents, but only six games of data for Vasco themselves and the same is true for the remaining 19 teams.

It's natural to expect even this limited, if recent achievement does contain some signal relating to future performance and Ben Cronin over at Pinnacle has written this article about the correlations between Premier League position after six games and final position and the FT's John Burn-Murdoch also tweeted this excellent visualisation correlating current league position during the 2013/14 season with finishing position in May.

To adjust for strength of schedule, we might take expected goal differential, rather than league position as the performance related output for each team and utilise the interrelated collateral form lines are created after a few weeks of the season

Team A may not have played team B yet, but they may have played team C, who have played team B.

We are left with 20 simultaneous equations, with a side's opponents on one side and their actual expected goal differential output on the other. Solve these we have new expected goals differentials that more fully represent the difficulty of each team's schedule.

In short, it is the basis for so called power ratings.



Here's how Serie A teams were ranked by expected goals differential prior to week seven and how that ranking changed when we allowed for the sometimes heavily unbalanced schedules played.

Vasco were ranked 13th on expected goal differential, but jumped into the top 10 to 9th when their harsh early schedule was applied.

Ponte Preta dropped four places to 15th in view of an apparently benign group of initial opponents.

In theory this seems fine, but does schedule strength add anything to our knowledge of a side going forward if we choose to limit ourselves to data from just this single season?

As Ben and John have admirably demonstrated, there is a correlation between league position at various stages of the season and finishing position.

Here's a limited (due to workload) example from a previous Premier League season using simply goal differential rather than expected goals.

13 games into the 2013/14 season, Spurs were ranked 13th by goal difference, 10th when strength of previous schedule was applied and 9th in the actual table. They finished 6th.

Their position in the table after 13 games better predicted their finishing spot, followed by strength of schedule adjusted goal difference and lastly actual goal difference.

As a whole though ranked, strength of schedule adjusted goal difference from week 13 did best of the three, producing ranked correlations of 0.77 for league position and actual goal difference after 13 games, but rising to 0.80 when strength of schedule corrections were applied and the teams re ranked after 13 matches each.

In short, there is signal in limited early season data and as a means of predicting final finishing position there may be some improvement if we rank by a schedule adjusted performance indicator.

All Brazilian data from InfAppoGol

Sunday 11 June 2017

Take On Me

A quick data viz spin through some of the less readily available attacking stats from the 2016/17 Premier League.

Aside from a penalty kick, the take on is the contest in a football game that most directly pits together the attacking and defensive attributes of individuals.

The ability to break apart a defensive structure by beating an opponent in a one on one contest is a hugely valuable asset, particularly if it takes place deep into opposition territory as demonstrated by England's opening goal against Scotland.

Similarly, conceding possession from an attacking move can also leave a side vulnerable to counters.

So who's perpetually trying to be creative in the opposition box and who might leave his side vulnerable to a costly turnover in less advanced areas of the field.

Here's the plots for the Top Six. The left hand side of the plot is closest to the opponent's goal and players who have played few minutes have been omitted.







Data from InfoGolApp

Friday 9 June 2017

Visualising Premier League Defence

A quick follow up to the last post on the defensive actions of players in the 2016/17 Premier League.

Numerical values, of course are the mainstay of any attempt at a deeper analysis of the defensive side of football, but it is also useful to have a visualisation of the data from which to derive a quick overview and comparison of different players.

The previous post looked to quantify the number of defensive actions particular positions were responsible for and where on the pitch they took place.

This post looks at individual players and both the amount of defensive actions they partake in, corrected to per 90 minutes and also whether these occur closer to their own goal or higher up the field.



Here's the plots for the three main challengers to Chelsea from 2016/17.

The pitch has been split into ten equal portions, sorted by distance to the centre of the defending team's goal line and the volume of defensive actions have been counted in each of the ten sectors.

The right hand end of the spark line plot is the nearest sector to the team's own goal and the vertical line denotes half way.

The plot shows where and how often, either through instruction or necessity, a player is involved in the defensive efforts of his side and who is given free rein to concentrate on other aspects of team play.

All data from @InfoGolApp

Tuesday 6 June 2017

All For One.....Defensive Lines in the Premier League.

While the attacking side of football was always going to be the focus of advanced analytics it is perhaps surprising that defensive metrics have received such little attention.

Aside from team wide expected goals allowed, more granular defensive metrics have barely progressed beyond mere counting of defensive actions such as tackles and challenges (player on player) and interceptions and clearances (player on ball).

There are exceptions, the universally excellent Colin Trainor here and there are excuses, particularly the scant availability of data relating to defensive actions.

Defence is also more overtly a team responsibility and whereas heroic last ditch tackles do occur and prevent a chance from turning into a shot, it is the overall structure and ability to create pressure on the team in possession that also exerts a great deal of influence.

So off the ball events are likely more important in defining an excellent defence than say decoy runs are to adding information to the attacking process, where shots, headers and key passes are more intuitively useful as an indication of repeatable process.

However, it can still be useful to add descriptive context to the defensive actions that are beginning to become available, such as interceptions, tackles and ball recoveries.

A simple division of how these defensive actions are shared out amongst the different playing positions and where on average on the field these actions are happening may add flesh towhat has previously been dry bones.

There are problems, especially the diversity of team formations, 17 different ones were employed in the 2016/17 Premier League 4231 proving most popular and 3142 the least and the definitive classification of positions also becomes less certain.

We can begin to look at both the share of defensive duties undertaken by a designated position both on average across the league and particularly within a team, along with the average area of the field where these actions occur.

These may then be a useful guide as to where either by choice or force as side defends its goal.


Firstly, here's a summary of the average distance from the centre of the goal where a defensive action occurred for designated positions during the 2016/17 Premier League season.

As you'd expect strikers and attacking players carry out their defensive duties the furthest away from their own goal. defensive midfielders creep closer to their own goal and defenders more so.


Now here's the share of defensive duties undertaken by the most commonly defined playing positions. Again there are no surprises, defensive positions are responsible for the lion's share of the recorded defensive events, but they do set baselines from which we can compare different teams to begin to tease out deviations from the norm.



Here's the average position from a side's own goal where the designated playing positions are taking part in a defensive action.

Usually strikers are involved in the defensive actions that take place highest up the field and central defenders are the group of playing positions who are mixing it nearest to their own goal.

The final column simply subtracts the first distance from the second to hopefully quantify the area within which most of a side's defensive actions are occurring within.

Burnley were the most compressed, defensively in 2016/17, requiring their designated strikers to help out in their own half, on average 42 yards from their own goal, while holding one of the deepest defensive lines in the league just 27 yards from goal, on average.

The majority of Burnley's defensive actions took place in a 15 yard perpendicular distance between these two lines of defensive action.

Leicester's defensive efforts, in contrast were the most spread out, with their strikers contribution spilling out into the opponents half of the pitch and their defence holding the deepest line of the 20 sides.

They perhaps needed a midfielder who could do the work of two.

Liverpool's high press is evident with the average position for defensive actions from their strikers taking place just inside their opponents half of the field and they also contribute the highest proportion of defensive actions in comparison to the attackers from other teams.

Part of this inflated striking defensive contribution will be down to the Reds utilising above average numbers of strikers, but it does seem that being part of such an attacking set up requires a spirited contribution towards the defensive cause as well.

All data is taken from the InfogolApp

Saturday 3 June 2017

Francesco Totti's Ageing Curve

40 year old Francesco Totti ended his 25 year association with AS Roma when he appeared for the final half hour of last weekend's game with Genoa.

Totti has played over 600 Serie A matches, clocking up over 47,000 minutes of playing time, while scoring 250 league goals, although 71 of those have come from 12 yards and over that period, Roma has enjoyed consistent success, rarely dropping out of the top four positions.

As league careers go, Totti's has therefore been played at a very similar level, where Roma has been regularly amongst the best club sides in Italy and he has largely avoided injury.

Between 1994-95 and 2014-15 he has played at least 1,000 minutes in each and every season, peaking in 2006-07 when he managed 3,034 on the field minutes.

As such he is an ideal subject to see where is performance levels stopped improving and began that inevitable, age related decline, albeit from a very high level.

Quantifying the performance achieved by a players over the course of their careers is problematical. Playing time can often be used as a proxy, but goal output is perhaps the most easily accessible benchmark for an attacking player's current and previous level of play.

Here's the, inevitably noisy plot of how Totti's non penalty goals per 90 have changed from one season to the next over his long career.

The trend line indicates that improvement is replaced by decline when the horizontal axis is breached by the trend and this occurred when Totti was just over 28 & 1/2.

This doesn't of course mean that he suddenly because a poor player, merely that his best years, on average and from a scoring perspective were most likely behind him. Although as he subsequently demonstrated, he was still capable of contributing to Roma, perhaps in a slightly different role.

So footballers are all prey to ageing, although some have such high levels of innate talent that they can, like Totti prolong their time spent at the highest level because their aged talents are still above those peak years of less talented contemporaries.

Which brings us to tonight's champions League final, featuring Ronaldo. A player who has had a more varied league career, spanning Portugal, England and Spain, but judged against his own highest standards, has been himself in decline since just prior to his 28th birthday.

Thursday 1 June 2017

Charting Liverpool's Expected Goal Surge Under Jurgen Klopp

Everyone with a passing interest in the developing football analytics movement will by now have heard of expected goals.

While far from  perfect, in common with most models, it does do an excellent job of examining the process behind the creation and attempted execution of goal scoring opportunities in a sport, such as football which has relatively few actual scoring events.

Much of the progress in recent years has revolved around improving both the descriptive and predictive qualities of the metric by incorporating firstly the shot type as well as location and also other pre-shot information, such as how the attack developed, often used as a proxy for defensive pressure.

Less attention has been paid to how the values of expected goals are presented for individual sides or players, with often a simple cumulative addition of the expected goals created and conceded being deemed sufficient for individual matches or seasons.

Simulations of each individual attempt using the expected goal value associated with that shot or header is an easy alternative, but this also converts the raw granular data into the different currency of win probability, when used on a single game or expected position or league points won if applied over a larger number of matches.

Retaining information about the distribution of the quality of the chances created, rather than simply taking a summation of the individual elements, is useful because of the way such distributions contribute towards the final range of possible outcomes.

Spreading your cumulative expected goals over a few shots compared to many has a different potential payoff.

In the former, you are foregoing the potential for an occasional bumper score line for the increased likelihood that you may be lucky and good enough to score at least one, which often yields some kind of return in a low score environment.

I first wrote about this here in 2014.

Here's an extreme example.

Would you rather have a penalty kick, with an ExpG value of 0.8 or eight shots, each with an ExpG value of 0.1.

The cumulative ExpG is 0.8 in both cases, but if the range of outcomes were combined in a match scenario, the lone penalty would win 35% of such games and the more frequent, but less likely attempts would win just 28% of the contests despite also summing to 0.8 ExpG.

Therefore, ExpG distribution matters.



Here's the distribution of the ExpG chances created by Brendan Rodgers' and Jurgen Klopp's Liverpool over their most recent 48 game span.

The opportunities have been grouped and counted by increasing ExpG per attempt and compared to the average league for quality and quantity, adjusted to a 48 game sequence.

The majority of chances created by a side has a relatively low expectations of scoring, falling between an expectation of near zero, rising to around a 15% chance.

Attempts with higher ExpG values are much less numerous, ranging up to so call big chances, where historically a team has been more likely to score than not.

Therefore, a secondary axis has been used to produce definition on these much rarer groups of bigger chances.

There's not much between the current Klopp managed Liverpool and the man he replaced, Rodgers in the lowest expectation region of chances created.

Klopp's side is above the average, volume-wise for attempts in the three initial groups that are quantified by the left hand axis, ranging from 0-0.15 expG.

Rodgers edges ahead in the volume of chances created with a grouped ExpG of between 0.2-0.25, the counts for which are shown on the right hand axis.

Once we encounter chances with a likely historical likelihood of 35% or greater, the present Liverpool set up dominates both the league standard and Rodgers' Reds.

No penalty kicks have been included.

Data from @Infogol