Entertainment at it's peak. The news is by your side.

Predicting Football Results with Statistical Modelling


Football (or soccer to my American readers) is filled with clichés: “It’s a recreation of two halves”, “taking it one recreation at a time” and “Liverpool luxuriate in did no longer consume the Premier League”. You’re less liable to listen to “Treating the different of targets scored by each group as just Poisson processes, statistical modelling suggests that the dwelling group luxuriate in a 60% likelihood of a hit recently”. Nonetheless right here’s basically a diminutive of cliché too (it has been discussed right here, right here, right here, right here and particularly smartly right here). As we’ll luxuriate in, a in point of fact easy Poisson model is, smartly, overly simplistic. Nonetheless it’s a true starting level and an amazing intuitive means to be taught about statistical modelling. So, within the occasion you purchased right here right here having a see to form money, I hear this guy makes £5000 per 30 days with out leaving the dwelling.

Poisson Distribution

The model is founded on the different of targets scored/conceded by each group. Groups that had been increased scorers within the past luxuriate in the next likelihood of scoring targets within the kill. We’ll import all match outcomes from the currently concluded Premier League (2016/17) season. There’s a lot of sources for this knowledge within the market (kaggle,, github, API). I constructed an R wrapper for that API, however I’ll plod the csv route this time around.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from scipy.stats import poisson,skellam

epl_1617 = pd.read_csv("")
epl_1617 = epl_1617[['HomeTeam','AwayTeam','FTHG','FTAG']]
epl_1617 = epl_1617.rename(columns={'FTHG':  'HomeGoals', 'FTAG':  'AwayGoals'})
HomeTeam AwayTeam HomeGoals AwayGoals
0 Burnley Swansea 0 1
1 Crystal Palace West Brom 0 1
2 Everton Tottenham 1 1
3 Hull Leicester 2 1
4 Man Metropolis Sunderland 2 1

We imported a csv as a pandas dataframe, which contains a lot of knowledge for each of the 380 EPL games within the 2016-17 English Premier League season. We restricted the dataframe to the columns by which we’re interested (namely, group names and numer of targets scored by each group). I’ll leave out most of the code that produces the graphs in this submit. Nonetheless don’t alarm, it is advisable well per chance uncover that code on my github page. Our task is to model the last round of fixtures within the season, so we must consume the last 10 rows (each gameweek includes 10 suits).

epl_1617 = epl_1617[:-10]
epl_1617.point out()
HomeGoals    1.591892
AwayGoals    1.183784
dtype: waft64

You’ll gaze that, on common, the dwelling group ratings more targets than the away group. Right here is the so called ‘dwelling (discipline) profit’ (discussed right here) and isn’t particular to soccer. Right here’s a handy time to introduce the Poisson distribution. It’s a discrete likelihood distribution that describes the likelihood of the different of occasions within a particular time duration (e.g 90 mins) with a identified common price of incidence. A key assumption is that the different of occasions is just of time. In our context, this methodology that targets don’t change into more/less possible by the different of targets already scored within the match. As a replacement, the different of targets is expressed purely as honest an common price of targets. If that became unclear, per chance this mathematical formulation will form clearer:

represents the fashionable price (e.g. common different of targets, common different of letters you receive, etc.). So, we are able to take care of the different of targets scored by the dwelling and away group as two just Poisson distributions. The space below reveals the proportion of targets scored when put next to the different of targets estimated by the corresponding Poisson distributions.

We are able to utilize this statistical model to estimate the likelihood of specfic occasions.

The likelihood of a scheme is merely the sum of the occasions the put the two groups score the the same quantity of targets.

Veil that we beget into myth the different of targets scored by each group to be just occasions (i.e. P(A n B) = P(A) P(B)). The adaptation of two Poisson distribution is basically called a Skellam distribution. So we are able to calculate the likelihood of a scheme by inputting the point out aim values into this distribution.

# likelihood of scheme between dwelling and away group
skellam.pmf(0.0,  epl_1617.point out()[0],  epl_1617.point out()[1])
# likelihood of dwelling group a hit by one aim
skellam.pmf(1,  epl_1617.point out()[0],  epl_1617.point out()[1])

So, with any luck it is advisable well per chance quiz how we are able to adapt this means to model particular suits. We correct must know the fashionable different of targets scored by each group and feed this knowledge correct into a Poisson model. Let’s luxuriate in a see on the distribution of targets scored by Chelsea and Sunderland (groups who carried out 1st and last, respectively).

Constructing A Mannequin

You are going to peaceable now be convinced that the different of targets scored by each group could well per chance even be approximated by a Poisson distribution. Ensuing from a somewhat pattern dimension (each group performs at most 19 dwelling/away games), the accuracy of this approximation can differ a good deal (especially earlier within the season when groups luxuriate in performed fewer games). Corresponding to before, we could well per chance now calculate the likelihood of diverse occasions in this Chelsea Sunderland match. Nonetheless in deserve to take care of each match one by one, we’ll create a more accepted Poisson regression model (what is that?).

# importing the instruments required for the Poisson regression model
import statsmodels.api as sm
import statsmodels.formulation.api as smf

goal_model_data = pd.concat([epl_1617[['HomeTeam','AwayTeam','HomeGoals']].build(dwelling=1).rename(
            columns={'HomeTeam': 'group', 'AwayTeam': 'opponent','HomeGoals': 'targets'}),
            columns={'AwayTeam': 'group', 'HomeTeam': 'opponent','AwayGoals': 'targets'})])

poisson_model = smf.glm(formulation="targets ~ dwelling + group + opponent", knowledge=goal_model_data, 
Generalized Linear Mannequin Regression Results
Dep. Variable: targets No. Observations: 740
Mannequin: GLM Df Residuals: 700
Mannequin Family: Poisson Df Mannequin: 39
Hyperlink Procedure: log Scale: 1.0
Formula: IRLS Log-Likelihood: -1042.4
Date: Sat, 10 Jun 2017 Deviance: 776.11
Time: 11: 17: 38 Pearson chi2: 659.
No. Iterations: 8
coef std err z P>|z| [95.0% Conf. Int.]
Intercept 0.3725 0.198 1.880 0.060 -0.016 0.761
group[T.Bournemouth] -0.2891 0.179 -1.612 0.107 -0.641 0.062
group[T.Burnley] -0.6458 0.200 -3.230 0.001 -1.038 -0.254
group[T.Chelsea] 0.0789 0.162 0.488 0.626 -0.238 0.396
group[T.Crystal Palace] -0.3865 0.183 -2.107 0.035 -0.746 -0.027
group[T.Everton] -0.2008 0.173 -1.161 0.246 -0.540 0.138
group[T.Hull] -0.7006 0.204 -3.441 0.001 -1.100 -0.302
group[T.Leicester] -0.4204 0.187 -2.249 0.025 -0.787 -0.054
group[T.Liverpool] 0.0162 0.164 0.099 0.921 -0.306 0.338
group[T.Man City] 0.0117 0.164 0.072 0.943 -0.310 0.334
group[T.Man United] -0.3572 0.181 -1.971 0.049 -0.713 -0.002
group[T.Middlesbrough] -1.0087 0.225 -4.481 0.000 -1.450 -0.568
group[T.Southampton] -0.5804 0.195 -2.976 0.003 -0.963 -0.198
group[T.Stoke] -0.6082 0.197 -3.094 0.002 -0.994 -0.223
group[T.Sunderland] -0.9619 0.222 -4.329 0.000 -1.397 -0.526
group[T.Swansea] -0.5136 0.192 -2.673 0.008 -0.890 -0.137
group[T.Tottenham] 0.0532 0.162 0.328 0.743 -0.265 0.371
group[T.Watford] -0.5969 0.197 -3.035 0.002 -0.982 -0.211
group[T.West Brom] -0.5567 0.194 -2.876 0.004 -0.936 -0.177
group[T.West Ham] -0.4802 0.189 -2.535 0.011 -0.851 -0.109
opponent[T.Bournemouth] 0.4109 0.196 2.092 0.036 0.026 0.796
opponent[T.Burnley] 0.1657 0.206 0.806 0.420 -0.237 0.569
opponent[T.Chelsea] -0.3036 0.234 -1.298 0.194 -0.762 0.155
opponent[T.Crystal Palace] 0.3287 0.200 1.647 0.100 -0.062 0.720
opponent[T.Everton] -0.0442 0.218 -0.202 0.840 -0.472 0.384
opponent[T.Hull] 0.4979 0.193 2.585 0.010 0.120 0.875
opponent[T.Leicester] 0.3369 0.199 1.694 0.090 -0.053 0.727
opponent[T.Liverpool] -0.0374 0.217 -0.172 0.863 -0.463 0.389
opponent[T.Man City] -0.0993 0.222 -0.448 0.654 -0.534 0.335
opponent[T.Man United] -0.4220 0.241 -1.754 0.079 -0.894 0.050
opponent[T.Middlesbrough] 0.1196 0.208 0.574 0.566 -0.289 0.528
opponent[T.Southampton] 0.0458 0.211 0.217 0.828 -0.369 0.460
opponent[T.Stoke] 0.2266 0.203 1.115 0.265 -0.172 0.625
opponent[T.Sunderland] 0.3707 0.198 1.876 0.061 -0.017 0.758
opponent[T.Swansea] 0.4336 0.195 2.227 0.026 0.052 0.815
opponent[T.Tottenham] -0.5431 0.252 -2.156 0.031 -1.037 -0.049
opponent[T.Watford] 0.3533 0.198 1.782 0.075 -0.035 0.742
opponent[T.West Brom] 0.0970 0.209 0.463 0.643 -0.313 0.507
opponent[T.West Ham] 0.3485 0.198 1.758 0.079 -0.040 0.737
dwelling 0.2969 0.063 4.702 0.000 0.173 0.421

Whereas you happen to’re uncommon about the smf.glm(...) portion, it is advisable well per chance uncover more knowledge right here (edit: earlier versions of this submit had erroneously employed a Generalised Estimating Equation (GEE)- what’s the adaptation?). I’m more occupied with the values presented within the coef column within the model summary table, which are analogous to the slopes in linear regression. Corresponding to logistic regression, we beget the exponent of the parameter values. A obvious worth implies more targets (), while values closer to zero picture more just outcomes (). In opposition to the bottom of the table it is advisable well per chance presumably gaze that dwelling has a coef of 0.2969. This captures the fact that dwelling groups in overall score more targets than the away group (namely, =1.35 times more possible). Nonetheless no longer all groups are created equal. Chelsea has a coef of 0.0789, while the corresponding worth for Sunderland is -0.9619 (form of announcing Chelsea (Sunderland) are greater (grand worse!) scorers than common). At last, the opponent* values penalize/reward groups in accordance to the high quality of the opposition. This relfects the defensive strength of each group (Chelsea: -0.3036; Sunderland: 0.3707). In other words, you’re less liable to attain against Chelsea. With any luck, that every one makes each statistical and intuitive sense.

Let’s open making some predictions for the upcoming suits. We merely dawdle our groups into poisson_model and it’ll return the anticipated common different of targets for that group (we desire to speed it twice- we calculate the anticipated common different of targets for each group one by one). So let’s quiz what number of targets we quiz Chelsea and Sunderland to attain.

poisson_model.predict(pd.DataFrame(knowledge={'group':  'Chelsea', 'opponent':  'Sunderland',
                                       'dwelling': 1},index=[1]))
poisson_model.predict(pd.DataFrame(knowledge={'group':  'Sunderland', 'opponent':  'Chelsea',
                                       'dwelling': 0},index=[1]))

Staunch take care of before, now we luxuriate in two Poisson distributions. From this, we are able to calculate the likelihood of diverse occasions. I’ll wrap this in a simulate_match honest.

def simulate_match(foot_model, homeTeam, awayTeam, max_goals=10): 
    home_goals_avg = foot_model.predict(pd.DataFrame(knowledge={'group':  homeTeam, 
                                                            'opponent':  awayTeam,'dwelling': 1},
    away_goals_avg = foot_model.predict(pd.DataFrame(knowledge={'group':  awayTeam, 
                                                            'opponent':  homeTeam,'dwelling': 0},
    team_pred = [[poisson.pmf(i, team_avg) for i in range(0, max_goals+1)] for team_avg in [home_goals_avg, away_goals_avg]]
    return(np.outer(np.array(team_pred[0]), np.array(team_pred[1])))
simulate_match(poisson_model, 'Chelsea', 'Sunderland', max_goals=3)
array([[ 0.03108485,  0.01272529,  0.00260469,  0.00035543],
       [ 0.0951713 ,  0.03896054,  0.00797469,  0.00108821],
       [ 0.14569118,  0.059642  ,  0.01220791,  0.00166586],
       [ 0.14868571,  0.06086788,  0.01245883,  0.0017001 ]])

This matrix merely reveals the likelihood of Chelsea (rows of the matrix) and Sunderland (matrix columns) scoring a particular different of targets. Let’s enlighten, along the diagonal, each groups score the the same the different of targets (e.g. P(0-0)=0.031). So, it is advisable well per chance calculate the percentages of scheme by summing the complete diagonal entries. All the things below the diagonal represents a Chelsea victory (e.g P(3-0)=0.149). Whereas you happen to consume Over/Beneath markets, it is advisable well per chance estimate P(Beneath 2.5 targets) by summing the entries the put the sum of the column number and row number (each starting at zero) is lower than 3 (i.e. the 6 values that create the greater left triangle). Fortunately, we are able to utilize accepted matrix manipulation functions to create these calculations.

chel_sun = simulate_match(poisson_model, "Chelsea", "Sunderland", max_goals=10)
# chelsea consume
np.sum(np.tril(chel_sun, -1))
# scheme
# sunderland consume
np.sum(np.triu(chel_sun, 1))

Hmm, our model gives Sunderland a 2.7% likelihood of a hit. Nonetheless is that correct? To evaluate the accuracy of the predictions, we’ll overview the probabilities returned by our model against the percentages offered by the Betfair alternate.

Sports actions Making a wager/Trading

Not like mature bookmakers, on making a wager exchanges (and Betfair isn’t the finest one- it’s correct the finest), you wager against other folks (with Betfair taking a commission on winnings). It acts as a form of stock market for sports occasions. And, take care of a stock market, as a outcome of the ambiance tremendous market speculation, the costs available at Betfair replicate the upright trace/odds of those occasions going down (in theory anyway). Beneath, I’ve posted a screenshot of the Betfair alternate on Sunday 21st Would possibly per chance (just a few hours before those suits started).

The numbers contained within the bins picture the finest available costs and the amount available at those costs. The blue bins signify abet bets (i.e. making a wager that an occasion will happen- going lengthy utilizing stock market terminology), while the red bins picture lay bets (i.e. making a wager that something won’t happen- i.e. shorting). Let’s enlighten, if we had been to wager £100 on Chelsea to consume, we could well receive the customary quantity plus 100*1.13= £13 could well per chance peaceable they consume (for certain, we could well lose our £100 within the occasion that they didn’t consume). Now, how will we overview these costs to the probabilities returned by our model? Properly, decimal odds could well per chance even be converted to the probabilities pretty with out issues: it’s merely the inverse of the decimal odds. Let’s enlighten, the implied likelihood of Chelsea a hit is 1/1.13 (=0.885- our model build the likelihood at 0.889). I’m specializing in decimal odds, however it is advisable well per chance presumably additionally be accustomed to Moneyline (American) Odds (e.g. +200) and fractional odds (e.g. 2/1). The connection between decimal odds, moneyline and likelihood is illustrated within the table below. I’ll follow decimal odds for the reason that imaginable selections are both uncommon to me (Moneyline) or correct tiring (fractional odds).

Likelihood of Occurence (EPL Fixtures 21st Would possibly per chance 2017)

Supply: Betfair Substitute
Match Dwelling Procedure Away
Arsenal v Everton 71.4 % 17.5 % 11.6 %
Burnley v West Ham 42 % 27.8 % 30.8 %
Chelsea v Sunderland 88.5 % 8.7 % 3.4 %
Hull v Tottenham 10.9 % 17.2 % 71.9 %
Leicester v Bournemouth 53.5 % 24.4 % 23.3 %
Liverpool v Middlesbrough 87.7 % 9.5 % 3.6 %
Man Utd v C Palace 41.7 % 29 % 29.9 %
Southampton v Stoke 57.1 % 24.4 % 19.2 %
Swansea v West Brom 43.1 % 28.6 % 29 %
Watford v Man Metropolis 5.1 % 10.2 % 85.5 %

So, now we luxuriate in our model probabilities and (if we trust the alternate) we all know the upright chances of each occasion going down. Ideally, our model would name cases the market has underestimated the chances of an occasion going down (or no longer going down within the case of lay bets). Let’s enlighten, in a in point of fact easy coin toss recreation, have faith within the occasion you had been offered $2 for each $1 wagered (plus your stake), within the occasion you guessed accurately. The implied likelihood is 0.333, however any legitimate model would return a likelihood of 0.5. The odds returned by our model and the Betfair alternate are when put next within the table below.

Match Dwelling Procedure Away
Arsenal v Everton Betfair 0.714 0.175 0.116
Predicted 0.533 0.226 0.241
Incompatibility 0.181 -0.051 -0.125
Burnley v West Ham Betfair 0.42 0.278 0.308
Predicted 0.461 0.263 0.276
Incompatibility -0.041 0.015 0.032
Chelsea v Sunderland Betfair 0.885 0.087 0.034
Predicted 0.889 0.084 0.027
Incompatibility -0.004 0.003 0.007
Hull v Tottenham Betfair 0.109 0.172 0.719
Predicted 0.063 0.138 0.799
Incompatibility 0.046 0.034 -0.08
Leicester v Bournemouth Betfair 0.535 0.244 0.233
Predicted 0.475 0.22 0.306
Incompatibility 0.06 0.024 -0.073
Liverpool v Middlesbrough Betfair 0.877 0.095 0.036
Predicted 0.77 0.161 0.069
Incompatibility 0.107 -0.066 -0.033
Man Utd v C Palace Betfair 0.417 0.29 0.299
Predicted 0.672 0.209 0.119
Incompatibility -0.255 0.081 0.18
Southampton v Stoke Betfair 0.571 0.244 0.192
Predicted 0.496 0.277 0.226
Incompatibility 0.075 -0.033 -0.034
Swansea v West Brom Betfair 0.431 0.286 0.29
Predicted 0.368 0.266 0.366
Incompatibility 0.063 0.02 -0.076
Watford v Man Metropolis Betfair 0.051 0.102 0.855
Predicted 0.167 0.203 0.631
Incompatibility -0.116 -0.101 0.224

Green cells illustrate alternatives to form a hit bets, in accordance to our model (the opacity of the cell depends on the implied difference). I’ve highlighted the adaptation between the model and Betfair in absolute terms (the relative difference also can very smartly be more relevant for any procuring and selling technique). Transparent cells show cases the put the alternate and our model are in tall settlement. Strong colours point out that both our model is harmful or the alternate is harmful. Given the simplicity of our model, I’d lean in the direction of the latter.

One thing’s Poissony

So could well per chance peaceable we wager the dwelling on Manchester United? Doubtlessly no longer (though they did consume!). There’s some non-statistical reasons to withstand backing them. Alive to football fans would gaze that these suits picture the last gameweek of the season. Most groups luxuriate in very diminutive to play for, that methodology that the suits are less predictable (especially after they involve unmotivated ‘bigger’ groups). Compounding that, Man United had been space to play Ajax within the Europa Closing three days later. Man United supervisor, Jose Mourinho, had even confirmed that he would leisure the predominant group, saving them for the grand more predominant closing. In a identical vogue, injuries/suspensions to key gamers, managerial sackings would render our model incorrect. Never underestimate the importance of arena knowledge in statistical modelling/machine finding out! Shall we additionally mediate of improvements to the model that could well per chance incorporate time when exasperated by old suits (i.e. more most original suits could well per chance peaceable be weighted more strongly).

Statistically speaking, is a Poisson distribution even appropriate? Our model became founded on the realization that the number targets could well per chance even be accurately expressed as a Poisson distribution. If that assumption is misguided, then the model outputs shall be unreliable. Given a Poisson distribution with point out , then the different of occasions in half that time duration follows a Poisson distribution with point out /2. In football terms, in accordance to our Poisson model, there could well per chance peaceable be an equal different of targets within the predominant and 2nd halves. Unfortunately, that doesn’t seem to select upright.

epl_1617_halves = pd.read_csv("")
epl_1617_halves = epl_1617_halves[['FTHG', 'FTAG', 'HTHG', 'HTAG']]
epl_1617_halves['FHgoals'] = epl_1617_halves['HTHG'] + epl_1617_halves['HTAG']
epl_1617_halves['SHgoals'] = epl_1617_halves['FTHG'] + epl_1617_halves['FTAG'] - 
epl_1617_halves = epl_1617_halves[['FHgoals', 'SHgoals']]

We luxuriate in now irrefutable proof that violates a traditional assumption of our model, rendering this complete submit as pointless as Sunderland!!! Or we are able to create on our coarse first strive. Moderately than a in point of fact easy univariate Poisson model, we also can wish more success with a bivariate Poisson distriubtion. The Weibull distribution has additionally been proposed as a viable different. These also can very smartly be topics for future blog posts.


We constructed a in point of fact easy Poisson model to foretell the implications of English Premier League suits. Despite its inherent flaws, it recreates several parts that could well per chance be a necessity for any predictive football model (dwelling profit, varying offensive strengths and opposition high quality). In conclusion, don’t wager the rent money, however it’s a true starting level for more refined life like objects. Thanks for finding out!

Read More

Leave A Reply

Your email address will not be published.