Monday 24 December 2012

How Real Is Manchester City's Late Scoring Spree ?

As well as providing the analytics community with a rich vein of raw data, Manchester City is also themselves the subject of many of the sound bite stats that appear on a daily basis. Unsurprisingly for a highly successful side, many of these statistical nuggets of information revolve around goal scoring exploits.

Last month Dzeko was a super sub. But a single strike, accounting for only 20% of the goals scored while he has been on the pitch as a replacement since early November has seen the "super sub" tag largely disappear from print and Dzeko's scoring rate as a late entrant has fallen closer to earth.

Gareth Barry's winning goal against Reading in the 93rd minute of the match appeared to merely reinforce City's apparent ability to score late goals at will and none were more important than the two stoppage time strikes in the final match of last season, when victory over ten man QPR secured the title. From the beginning of the 2011/12 campaign up to and including Saturday's last gasp win over the bottom club, City have scored 24 goals in the 85th minute or later, twice the number of their nearest rivals, United.

Reported in isolation the sound bite stat appears impressive. The narrative is clearly intended to portray City as an incredibly dangerous attacking side late in matches, certainly much more potent than their neighbours. With the backing of data going back over 56 previous matches, we appear to be looking at a cast iron case.

Since 2011/12, City have arguably been the best side in the Premiership, scoring 127 goals over those 56 games. So the first legitimate question is how many goals would such a team expect to score over that run of matches after the 85th minute ?

We can describe each of those games in terms of the initial goal expectancy that City would expect record over numerous repetitions of the 56 games. Goal times appear to be rounded up as two of City's goals, recorded as 85 minute strikes, were scored after 84 minutes and 30 seconds but before the 85th minute was reached. If we allow for the actual amount of "normal" time played, the actual stoppage time played, Manchester City's initial goal expectation in each match and the gradual increase in scoring rate which occurs for all teams as the contest progresses, we can calculate the goal expectation for City for every minute described in the recently circulated "85 minute" stat.

City is a top side and their average goal expectancy per game since August 2011 from the 85 minute and beyond is just over 3 tenths of a goal. Therefore, they would have expected to score 17 goals over that time span and their actual total of 24 is an impressive 40% higher. By delving deeper into the numbers, we appear to be confirming the validity of the sound bite.

However, there are more obstacles to overcome. 56 games appears impressive, but we have looked at on average just the final ten minutes of playing time for each game. In reality our sample size is only 567 minutes of actual playing time or the near equivalent of only six completed games. Manchester City in a run of six actual matches has produced an equivalent of 24 goals once over the last two seasons. As have United. If City (and United) can score at such impressive rates over a six match run selected from a season and  half of games, it shouldn't be a great surprise that they can do likewise in a non random selection of 56 end games.

It's understandable that stats like this appear once such an event as a late goal has occurred, but as with super subs, this virtually guarantees a biased sample. The first of City's games in the 56 match run was a 4-0 win over Swansea on the opening day of the 2011/12 season, which saw Aguero scoring in injury time. The last game in the sequence was Reading on Saturday, when Barry did likewise. Probably inadvertently, we again have selective cut off points which start and finish with the attribute we are trying to measure in our sample.

If we extend the sample to included 2010/11 when City were still good enough to win the Cup, finish third with the same manger and many of the same players, we find that the gap between City and their nearest rivals over the period in the late goal stakes has shrunk from 12 goals to just 3. If we insist on comparing United and City over just the last two seasons we can also close the gap again to three by choosing the 81st minute instead of the by no means special 85th minute. The more you try to be fair and unbiased, the more biased cutoff points appear in the data.

City score lots of late goals because they are one of the two best teams currently in the Premiership and we can manipulate the apparent size of this advantage by taking results from different, but similar samples. If you dissect games into bite sized chunks of time, patterns will emerge that are neither representative of a team's real ability, nor repeatable in future contests. Quotes based on such inadvertently manipulated data are enticing, but have the power to mislead. As part of a wider picture such dicing of data may be useful, but in isolation they are prone to large corrections.

Poisson simulations using City's actual goal expectations from the 85th minute onwards in their last 56 games saw 24 or more late goals arriving in around 7% of the trials. City's 24 actual goals is an impressive achievement, but as larger samples and different timescales appear to indicate, a lower longterm figure should be expected, with little to chose between the Manchester clubs in this particular talent.....whatever the misleading sound bites might say.

Thursday 20 December 2012

Red Cards and Goals in the EPL.

The goal scoring expectation of a side is reduced if they are shown a red card and they also become likely to concede goals in the remainder of the match at a higher rate than previously. But are more talented sides more able to overcome a numerical disadvantage than less talented ones and  how does this impact on the amount of match goals we should expect to see in a contest in which a red card is shown.

In my guest post I look at every red card match in the EPL since 2005, comparing the goal expectancy values immediately prior to the dismissal with the reality of how 10 actually fared against 11.

To read the rest of the post follow this link

Tuesday 11 December 2012

Home Sweet Home.

Over the last couple of decades almost a third of the current Football League and Premiership clubs have upped sticks and left their previous homes for pastures new. Evocative names such as Saltergate, Layer Road, Leeds Road, The Goldstone Ground and The Victoria Ground have passed into history, to be replaced by more impersonal monikers such as The Ricoh Arena, The King Power (formerly the Walkers) Stadium, The Reebok, The Britannia and the imaginatively named Cardiff City Stadium.

Many of the bulldozed grounds had housed the local football team for generations and often the proposed move was met with a lukewarm response from many lifelong fans. Much improved facilities and the increasingly impractical locations of many original grounds meant that some of the early fan resistance has evaporated, but it is equally true that the new grounds needed time, coupled with some memorable results before they were fully accepted as home.

Home advantage is an almost constant feature of professional sports and is especially prevalent in team based sports. In football the home side's improved level of performance compared to their road game can be easily quantified. Although still a major component in driving the result of a sporting contest, home field advantage appears to be steadily declining in English football. Below I've plotted the average difference per game in goals scored by the hosts and those scored by the visitors combined for each of the four English leagues. Towards the end of the 80's a host side scored on average half a goal more than did their opponents, winning around 46% of such matches in the process. By last season, with an occasional blip such as 2009/10 HFA has shown a steady decline including a couple of season when it has dipped below three tenths of a goal.

Many theories exist to explain this constant feature of the footballing landscape. Crowd pressure may unduly influence the referee to favour the team with the most fans in the ground, almost always the home side. Studies have been carried out showing that incidents shown to referees with and without crowd noise results in different decision being made, but all of the refs in the study where amateurs from the Staffordshire FA. Hardly representative of a Premiership or Football league official. The constant decline over the last two decades would also appear to indicate that some factors are still evolving, with tactical formations being a likely and under investigated cause of the away side's increasing ability to compete.

The most widely accepted traditional reason for home field usually involves familiarity with your surroundings and a desire to protect territory increasing testosterone levels enabling the home side to enjoy a marginal advantage in the many individual contests which occur during a game, leading to an accumulated advantage at home compared to away fixtures. We can use the recent glut of relocations to try to test the familiarity claims.

A side beginning life in a new ground will be no more familiar with the surroundings than many of their opponents. Often the first opportunity a team will have to play on their new ground will be the opening league fixture and in the case of Stoke in 1997, The Potters had already played four away fixtures before their new ground was ready to host a match.

Summer 1997 & The Brit is well on the way to being just a few weeks late.

If home field advantage is partly driven by familiar surroundings, a team may sacrifice part of that advantage when they move homes and a decreased home premium may be present in the results. One season's worth of results for a single team may simply through a noisy sample throw up occasional home and away splits that aren't representative of larger samples. Home or away specialists can be seen every year in every division, but few if any retain the trait over larger numbers of trials. So instead of looking at the 27 individual cases where a side has played at a new ground I have aggregated all the cases.

We also need to account for the different home field environments in which each of the 27 individual seasons were played. Walsall left Fellows Park for The Bescot Stadium in 1990, when home English football sides were enjoying a home advantage of almost half a goal. In 2005/06 when Swansea moved to The Liberty it had fallen to around 0.35 of a goal.

                        How Levels of Home Field Advantage Change With A Change Of Ground.

Season. Average HFA (in goals)  for Relocating Teams. Weighted Average HFA for All Seasons. Team HFA as a % of League Average.
3 Years Before Move. 0.40 0.40 100
2 Years Before Move. 0.40 0.40 100
Last Year In Old Ground. 0.50 0.41 122
First Year In New Ground. 0.34 0.39 88
Second Year In New Ground. 0.34 0.37 92
Third Year In New Ground. 0.37 0.39 95

Above I've charted the average HFA for all 27 sides in a couple of seasons prior to and post their move. I've compared this figure to the average weighted home field advantage for English football as a whole in the relevant years. In short a figure of 100% in the final column shows that the group of 27 were delivering the league average for home field advantage. Above 100 indicates they were enjoyed an enhanced HFA compared to the league as a whole and figures below 100 indicate a reduced comparable HFA.

The 27 teams were split fairly evenly across all four divisions, some were successful, while others celebrated their relocation by getting relegated just before their move. There was no reason to suppose that the group were consistently more proficient than usual at home (if such a species exists) and in the two years prior to their last season at their new home, the group produced dead eyed average HFA in each season.

In their final year they performed markedly better at home and while 27 completed seasons is a large sample size, I am reluctant to dive in and state that the players, fans and opponents are intent on giving the old ground a send off to remember. Of more interest is the dip in HFA that occurs in the first year at a new venue and only gradually recovers towards league expectations over successive seasons. Overall the 27 teams struggled to give the levels of home performance that their efforts on the road suggested they should have been capable of.

Some teams managed the transition effortlessly. Chesterfield had a great season at home, but Colchester struggled and overall more teams mirrored Colchester's experience, possibly indicating that home advantage had become more of a neutral experience that only improved with the passage of time.

A new ground gives the opportunity to examine results where one of the most cited causes of home advantage may be reduced. While competing and opposite factors may also be present after such ground hops, such as increased, if not always utilised extra capacity, there seems reasonable grounds to support the theory that familiarity breeds more points.

Saturday 8 December 2012

The Trouble With Pythagoras.

In this recent post  I looked at how the Pythagorean approach to converting runs scored and conceded in baseball or points totals in American Football can be utilized to give a more representative win and loss record over a season long timescale. A large number of narrow victories can inflate at side's final league record, but much of this success may be down to randomly fluctuating fortunes and there is no guarantee that these will be repeated in the future. A team's scoring record can partly capture such bouts of good or bad fortune and the luck bearing contribution to league points can be identified using the Pythagorean method, once draws and scoring environments have been accounted for.

Simply eyeballing a side's goal difference can also achieve the same aim and Newcastle were the poster side for over achievement last season when they claimed 5th spot with  goal difference of +5. The competitive balance within the current Premiership is relatively fixed from year to year and Newcastle's goal difference would usually have only been good enough for seventh spot, if not lower. They were out of place, probably by a couple of spots. The case of Newcastle has been extensively covered and eight wins by a one goal margin, coupled with three reasonably heavy defeats were the main factors behind their depressed goal difference and elevated finishing position.

Pythagorean expectation captured Newcastle's atypical season, but so did anyone who took a passing interest in the table or results. So how can this cross over from Sabermetrics begin to be used beyond spotting transparent outliers ?

Much of the effort in transferring Pythag to football has revolved around reducing the error associated with predicted final points totals and actual totals in the same campaign. A Premiership season of 38 games is usually sufficient for skill to begin to overwhelm randomness and the best teams invariably rise to the top. Therefore trying to match your improved model of reality with actual reality is a reasonable aim. A team's true worth is often hidden, but the distortion is reduced in sports such as football where skill is a considerable factor. However, care should be taken not to overfit  a Pythag model of reality to the random elements that occur in matches over the season.

Taking data such as goals scored and conceded over a season to create a model and then fitting that model to those very same matches runs the very real risk of forcing your creation to conform to random noise as well as signal. Once let loose on new data any predictive qualities may well be compromised as solid patterns reveal themselves as little more than randomness. Extensive, out of sample testing is much the way to go in attempting to validate a model based conclusion.

A second stumbling block is the aggregation of data. A glut of narrow wins or defeats may show up in a 38 game season's worth of scoring events. But a hefty, often red card assisted defeat can hang heavy over a side's goal difference as a result of the low scoring in football.

Manchester United 8 Arsenal 2 and United 1 City 6, with a couple of red cards had the capacity to play havoc with a carefully tended Pythag bought up in the USA, where individuals are ejected, but teams are often allowed to remain at full strength. As luck would have it both of United's results eventually cancelled each other out, although the Arsenal result hung heavy on each side's goal difference early in the season. Data aggregation has it's benefits, but one, unusually high scoring game can also be smeared over a whole group of games resulting in a distorted representation of what actually occurred.

There was a time in the top flight when 1-0 wins on the back of an impressive defensive display was widely admired and some present day sides still possess the quality of defenders and tactical nous to engineer such results as part of their normal matchday experience. Five 1-0 victories coupled to a 6-1 defeat accrues 15 actual points, a goal for and against tally of 6-6 and a reputation for being fortunate, over achievers. Six 1-1 draws gets a team just 6 points, the same goal tally and an "unlucky" tag. But Pythagoras treats both teams the same and gives them each a "true" expected points total of around 8 for those six games.

Real life examples will rarely be as extreme, but if we know the actual, individual results, we should try to use that information. One way around this Pythagorean "draining the detail from the data" problem is to treat each match individually and then aggregated the expected points. Thus, a team which managed to run up the score in a single match wouldn't be credited with the ability to be equally threatening in front of goal under more competitive conditions.

A 7-0 win would tend towards three expected Pythag points for that match and a narrow 1-0 win would lead to a Pythagorean contribution that was nearer to two league points, acknowledging the range of outcomes that may occur when defending a narrow lead. The attractive concept of downgrading teams succeeding on the back of winning a lot of close matches would be retained, without the season wide points inflation for a side enjoying "one of those days" and winning a game or two by a wide margin.

If the Pythagorean method is to have any use over and above the many similar techniques that already exist for football, it has to be prepared to look at matches on a game by game basis to maximize it's unique selling point, namely the ability to begin to identify some of the randomness that is incorporated into a team's actual record. In my previous post I looked at the predictive power of the Pythagorean league points totals from one season to the next using aggregated scoring data. Repeating the exercise, but on an individual match by match basis and then summing the expected league points, leads to an improved correlation between "true" Pythagorean points totals in season N-1 and a team's actual points haul in season N.

Identifying attributes that contribute to a side's success is an important aim for analysis and one way to test if a model has achieved this aim is to see if it has predictive qualities. Pythag appears to be reasonably predictive of future performance and applying the method to individual matches also opens the way for a predictive Pythag for yet to be played, single matches rather than merely confining it's use to seasonal points totals.

However, it is competing in a crowded, well tested market, where tools already exist to duplicate it's output. There is scope and a requirement for much further development.

Check out Martin Eastwood's Blog for an Excellent Pythag Primer.

Friday 7 December 2012

What the Bookmaker's Prices Told Us About Shakhtar v Juventus.

Much of the pregame chatter surrounding the Shakhtar verses Juventus UEFA Champions League tie concerned the mutually beneficial outcome should the teams draw their final group game. Shakhtar were already assured of qualification to the knockout stages, but a point would secure them top spot in the group and eliminate the possibility that they would face likely fellow table toppers, Barcelona in the first round of ties. Juventus were on much more precarious ground, a draw would also ensure that they progressed as runners up, but defeat and the highly likely win for Chelsea against Nordsjaelland would see them eliminated.

Draws are the one result that can be predicted with the least certainty. They are most likely to occur when two teams have a similar chance of winning the match outright, most usually when an inferior team, boosted by home field advantage entertains a marginally superior side. A prolonged propensity for both sides to participate in relatively low scoring matches also helps to increase the chances of a stalemate. However, even if these requirements are fulfilled the probability of deadlock rarely rises much above 30%.

Mid table teams are more likely to be involved in draws than title contenders or relegation candidates, but the discrepancy is much less extreme than it is for wins or losses. The partly random way in which draws materialise is further illustrated by the lack of season on season correlation. A team which draws extensively one season isn't guaranteed to carry this tendency onto future campaigns.

Rarely will a draw be offered at prices shorter than 2/1 or 3.0 in decimal terms, yet by mid afternoon of the Shakhtar/Juventus game the price had contracted to around 2.18, indicating a likely chance of around 45%.

Many have begun to realize the value of using bookmakers prices as a free and valuable resource. An accurate assessment of the true probability of a sporting outcome occurring is essential to any successful bookmaker, although weight of money can sometimes distort prices. We can therefore use these prices to try to piece together how this match was perceived by a combination of expert and mass market opinion.

Juventus had previously entertained Shakhtar in matchday two on October the 2nd. So early in the competition there would likely be no extraneous factors which would distort the price of the match. Prices would have reflected the relative abilities of each side along with home field advantage for the Italians. Juventus were priced up at a best price 1.61 or around 62% with the draw at 3.9 or 26% . If we use these prices from October to project a price for the rematch in December, assuming relative stability of each team's ability we would make Juventus marginal favourites once the venues where flipped and more importantly the draw would be pitched as a 27% chance.

So we had a price for the draw set at mid afternoon of the final group match at 45% when everything pointed to an expected price for a "normal" contest being in the region of 27%. The assumption across the net was that a draw was assured as it ensured a best case outcome for Shakhtar and a second best case outcome for Juve. And as that assumption gained credence, weight of money dragged the price to even higher levels of certainty. By kickoff the odds of a draw had further contracted to 10/11 or 52%.

Given the sometimes tainted history of some European club sides, were the odds telling us that the draw had been agreed or where they telling us something about the way in which the game was expected to be played out ?

The concluding matches of many competitions are atypical of what has gone before. May in the Premiership sees more goals than mid season, less cards and a collection of meaningless and meaningful matches. Stoke have faced a seemingly endless stream of late season matches where team priorities are mixed. None more so than a final game visit of Reading, already locked into a Premiership playoff position early in Pulis' initial tenure, where Stoke required a win to ensure another season of Championship football. An initially enthusiastic Reading tamely folded to a characteristic single goal defeat. In short, the Royals did what was expected of them by teams embroiled in a relegation scrap with Stoke, but ultimately they took a more relaxed stance and prepared for future matches.

A more contemporary example was seen on Wednesday night when the team selection and subsequent performance of an already qualified Manchester United saw them taste defeat at the hands of a committed Cluj, who won and narrowly failed to progress. Differing priorities, rather than collusion lead to defeats for Reading and Manchester United.

Draw prices of 45 or even 50% are virtually unheard of pregame, but they do exist in running and the goal expectancy of the Shakhtar/Juve match would decay by an amount corresponding to a 45% draw probability after around 50 minutes of initial stalemate.

The net predictably abounded with conspiracy theories regarding the match, complicated by the Russian connection between Shakhtar and Chelsea, the fall guys in any agreed draw. But more experienced opinions appeared to be quantifying the chances that the game would exist as a true contest for around 50 minutes, at which point fair play would be satisfied and the game could then be allowed to peter out to the most likely outcome under that scenario, namely a draw.

The ebb and flow of the game indicates that Juventus out shot Shakhtar by two to one until their fortuitous winner around the hour and then the hosts out shot their visitors by the same amount. So the flow of the game appeared to see Shakhtar content to protect their top slot, while their visitors attempted to claim it from them and then the host attempting to reclaim their prize once the own goal gifted it to Juve. Without the goal, the expected truce may have been forthcoming.

Fixed results are thankfully rare in football and when agreed draws do occur, the odds on offer are a lot shorter than a shade of even money, with odds of 1.2 not uncommon. What we saw initially on Wednesday was almost certainly an experienced odds maker deducing that an evenly matched pair of teams, on the night would play out a game that could become uncompetitive should it remain stalemated relatively early in the second half and even if a goal was scored, the game would contain many persuasive routes to an ultimate draw. And that opinion was reflected in the initial, mid afternoon prices for the draw, an unusual pricing for an uncommon set of circumstances.

* For anyone confused about converting the variety of different odds commonly seen to probabilities, an excellent  primer can be found here.

Thursday 6 December 2012

The FA Cup in an Era of Premiership Dominance.

The FA Cup has been dominated by the top flight teams since the start of the Eighties, with the very best Premiership outfits inevitably to the fore in the most recent past. The only real crumb of comfort for teams outside of the top flight is the random nature of the draw that can see Premiership teams eliminating each other. Four or more all Premiership 3rd round ties has led to late stage involvement for Championship sides over the last decade. Check out my guest post here for much more detail.

Sunday 2 December 2012

Expected Points Graph For Reading 3 Manchester United 4.

Reading 3 Manchester United 4.

Stoke City have become the benchmark for a team attempting to make the not inconsiderable leap from Championship high flier to Premiership survivour. A combination of defensive resilience and maximizing scoring opportunities, especially from set plays are two of the founding principles of Pulis ball and while there was evidence on Saturday night that Reading have embraced the second, they were sadly lacking in the former.


1-0, Robson-Kanu, 8'
1-1, Anderson, 13'
1-2, Rooney (pen), 16'
2-2, Fondre, 19'
3-2, Morrison, 23'
3-3, Rooney, 30'
3-4, v Persie, 34'

Attempting to outscore Premiership opponents, especially the very best is rarely a profitable approach for EPL newcomers, the more goals there are in a game then the more likely it is that the better team will score the lions share. Reading's matches were averaging three goals per game prior to the visit of Manchester United on Saturday evening, above the league average and well in excess of Stoke's recent survival years back in the top flight. Stoke matches have averaged a low of 2.2 and a high of 2.4 goals per game since their return to the Premiership and are currently averaging below two goals so far in 2012/13. If Premiership strugglers are going to upset the very best it's more likely to occur in a low scoring game. Of the 50 defeats suffered by United to teams outside the rest of the big four, half have been single goal defeats.

Entertaining as the open spectacle was, an ultimate United win was hardly a surprise. Only briefly, when they led 3-2 on the half hour did The Royals claim favouritism in the match and a United win would have been even more assured if v Persie's "fifth" goal had stood.

The first half did illustrate the potency of corner kicks , as described here, especially when the ball is delivered to the correct areas and attackers work hard at freeing themselves from defensive attention. Reading's second and third goals were text book examples of the art of scoring from a set piece.

The seven goal, first half scoring spree was undoubtedly unusual, even for two sides whose matches are likely to contain more than the average number goals this season. One such half of football per hundred Premiership seasons would still be an optimistic expected rate for a 3-4 half time scoreline. Inevitably, pundits were predicting more of the same in the second period, but pregame estimated scoring rates are more often a better predictor of what may occur than is a single 45 minutes of action. Viewers were primed for the Premiership total goals record of eleven for one match to be threatened, if not breached, but no more scores were forthcoming.

A goalless second period was around a 20% chance and it was an 80% chance that there would be no more than two goals scored in the last 45 minutes. The chances of Reading and Manchester United combining to provide the five or more additional goals required to breach the single game record could be measured at below 2%.