The aim of this post is to try to replicate the main results of a research recently published in the following paper which we will reference in our work by the name of the principal author: Heinrich R. Grieve (HRG)
We did not study the whole paper, the details of the results in particular, but focused mainly on testing the main proposition of the research as set out in its Abstract:
“There is an old soccer wisdom that a goal scored just before halftime has greater value than other goals. Many dismiss this old wisdom as just another myth waiting to be busted. To test which is right we have analysed the final score difference through linear regression and outcome (win, draw, loss) through logistic regression. We use games from many leagues, control for the halftime score, comparing games in which a goal was scored after 1 minute remained of regulation time with games in which it was scored before the 44th minute. “
Hypothesis and testing
The main hypothesis we set out to prove is in HRG words:
“Goals scored just before halftime have greater value than other goals.”
We interpret ‘greater value’ as winning proportionally more matches when such goals are scored than when they are not. Also define ‘before halftime’ as the time when goals are scored: in the last five minutes (41-45) or the last two (44-45). And we set out to discover if either or both these two hypothesises are true: Hyp_1 (41-45) and Hyp_2 (44-45).
Please note that, although not explicitly stated by the authors, we assume that goals scored in extra (HT) time are included the 41-45 and 44-45 intervals.
Method of testing the hypotheses
The way we are going to prove/disprove such hypothesis is by testing if the categorical variable outcome changes significantly when goals are scored as specified in Hyp_1 and Hyp_2. Outcome takes the values of one of the three results a match can have: AW=Away Win, D=Draw and Home Win= HW. Two conditions must be satisfied in order to prove that either Hyp_1 or Hyp_2: first, that the number of matches Won is greater when either is true; second, that such increase is statistically significant (P<0.05).
Techniques of analysis
While logistic regression would be the method of choice for many and was that used by HRG, we will use instead CHAID Decision Tree (https://www.statisticssolutions.com/non-parametric-analysis-chaid/) as implemented in the software KnowledgeSEEKER (Angoss). We are very familiar with this software and believe it to be more suitable for this type of problem. It also has the advantage, in contrast with other software tools, of providing for a transparent process of analysis, and easy to understand results.
A CHAID classification analysis with this KnowledgeSEEKER requires that a variable with two or more categories be specified as dependent and all others in the dataset taken as independent. For our analysis we have taken outcome (AW, D, HW) as the dependent categorical variable.
A number of dataset have been used by HRG whose details can be found in their paper. We are very thankful that they kindly shared them with us. We started by using a small one (WF_BA) made up of matches in the Champion and Europe Leagues in order (Seasons 1998-2014). The objective was to observe and quickly validate our chosen method of analysis. Once happy with the results, we tackled the larger dataset WF_WFC with data of matches of many top football leagues worldwide.
We split the dataset WF_BA into two dataset: Champion and Europe league matches, which we analysed separately. Also we did not use the whole of the WF_WFC dataset (results of top world leagues 1998-2014 seasons) as HRG have done , but selected only matches from the following leagues: Premier (England), Serie A (Italy), Bundesliga (Germany), Ligue 1(France) and the Primera (Spain). These Leagues were chosen because universally recognised as the most competitive in the world. We also considered the fact that the variance in strength among teams in these leagues is rather smaller than what we believe it would be in other leagues. Matches are therefore likely to be more balanced and include fewer extreme (outliers) results, and this would provide for more balanced results.
Much time was taken in parsing the data to fit KnowledgeSEEKER. We also deleted all variables pertaining to the 2nd Half, as these are not needed for the analysis. Those included in our final datasets are listed in the table below. The shaded one are not really necessary to prove/disprove the hypothesis but were included in order to see (for future research) if any would also influence outcome.
The dataset analysed with the numbers of matches (in brackets) are: Champion League (505), Europe League (538), Top 5 Euro Leagues (16,321). Please note that the number of matches of the Top 5 leagues found in the dataset WF_WFC does not correspond to those indicated in the S1 Appendix by HRG; those used are shown in the graphic tree below:
For each dataset we applied the procedure below:
- Input the data into the analysis software KnowledgeSeeker
- Select outcome as the dependent variable
- Build decision trees (graphic) to test the two hypothesis: Hyp_1 and Hyp_2
- Observe the results and decide if these hypothesises have been proved or disproved
- Brief comment on the result
The process of analysis, the building of the decision tree, starts by splitting the outcome between Home and Away. It ends by comparing the outcome of when the last goal was scored: in the last 5 min vs the first 40 min. The Chi Square statistical test is used to decide if these outcomes differ significantly (P < 0.05) .
CHAID graphic trees explained
The results of the process of analysis by KnowledgeSEEKER are given in the form of graphic trees. Since we are aware that many readers may not be familiar with this representation we explain it below by showing a step-by-step analysis of the Top 5 European Leagues.
- We start with outcome, the dependent variable, the top node of our tree (Graphic_1), which shows the result (stats) of all matches. From the top down we have 4,572 AW (AwayWins) or 28.01% of all wins, 3,357 | 20.57% D(raws) , and 8,392|51.42% HW (HomeWins).
- We then split outcome by the binary variable last_goalbefHT to find out which team scored the last goal before halftime (HT). So, we get (Graphic_2) the outcomes (highlighted) when the goal was scored by the A(way) and the H(ome ) teams, which respectively won 55.95% and 73.56% of their matches.
- We are now ready to test our Hyp_1 and Hyp_2 (Graphic_3). First, we duplicate the tree in Graphic_2 so that we can test both hypothesises. We then split both H(ome) and A(way) outcomes of each by the binary variables LastG41_45 (Hyp_1) and LastG44_45 (Hyp_2), which specify respectively when the last goal was scored: in the last 5 or 2 min (Yes) or the first 40 min (No). The results of these tests are summarised as follows:
Hyp_1 (left tree). Fewer matches are won by Away teams (53.77 vs 56.75) and more by Home ones ( 74.05% vs 73.41%) when scoring in the last 5 min. The results are (more or less) significant (0.08, 0.06) in both cases, hence there is strong evidence the Hyp_1 is true for H but not for A.
Hyp_2 (right tree). Both Away and Home teams win fewer matches when scoring in the last 2 min, and the result is almost significant (0.07) for Home, less so (0.17) for Away . Therefore we have to reject Hyp_2 is for both case.
All results have been obtained following the process of analysis just shown above. For some we will publish their graphic trees, while for others only summary tables. For the sake of transparency and understanding, where summary tables are shown the corresponding graphic trees will be given in the Appendix to this report.
Analysis 1 – All results (Home and Away)
The first analysis aims to prove or disprove Hyp_1 and Hyp_2 at a general level; that is without discriminating between Home and Away results.
Hyp_1 – Last goal (41-45)
We can see that scoring goals in the last five minutes does not bring a significant advantage (all P > 0.05) in any of the leagues. However in the Champions such goal causes a large increase in AW(ins): from 26.45% to 34.51% which suggests some support for Hyp_1, somewhat strengthened by a low P = 0.16.
Hyp_2 – Last goal (44-45)
This analysis show a small gain/loss in matches won when a goal is scored in the last 2 min. However, none of the results is even close to be significant, and therefore we must reject Hyp_2.
Analysis 2 – Home and Away results
Hyp_1 – Last goal (41-45)
Europe Top 5 – Scoring in the last 5min is bad news for the Away team: they lose more games (-2.88%). Home teams, however, gain a little from such goals (+0.64%). Both results are very close to be significant and therefore Hyp_1 is true for Home teams but not for Away ones.
Champions – A goal in the last 5min does bring an advantage (+7.77%) to Away teams and a small disadvantage (-0.36%) to Home. However neither gain is significant.
Europe – Surprisingly, scoring in the last 5min means significantly fewer wins (-12,56%) for Away teams – a significant result (P=0.03). Home wins too are down, but the loss is small and not significant. The same applies for goals in the last 2min.
Hyp_2 – Last goal (44-45)
Europe Top 5 – For Away teams the results are similar to those for goals scored in the last 5 min – they lose more matches (-3.30%). Home teams also lose more matches (-0.13%). All results are close to being significant (0.17 and 0.07), and therefore me must reject Hyo_2.
Champions – A goal in the last 2 min wins more games (+2.37 %) for Away teams but loses more (-0.10%) for Home ones. However both results are not significant, and therefore do not support Hyp_2.
Europe – Not surprisingly, result are similar to scoring in the last 5 min and show significantly fewer wins (-14.71%) for Away teams. Home wins too are down, but the loss is small (-1.21%) and thus not significant.
Summary of results
The results for all the datasets tested are summarised in the following table:NS= Difference not Significant at 95% Confidence level (P<0.05)
So, is it true that scoring a goal in the last 5 minutes gives an advantage? The results give a very mixed picture.
The short answer is: “It depends”. It depends on the particular competition, and on whether a team is playing Home or Away. Only in the Champion league such goals lead to an advantage, but a small one and only for Away teams. In contrast, there is a narrow but significant gain for Home teams in the Top 5. But, the overall picture shows that goals in the last 5 minutes do not affect outcomes, and when they do, the result is more often negative than positive.
The results for goals in the last 2 min are similar: their influence on the outcome of a match is significantly more negative than positive. This is particularly true for the Europa league, where such goals cause a significant loss to Away teams, in contrast to a small advantage in the Champions League.
Our analyses therefore lead us to reject both hypotheses. As a general rule: goals scored in the last 5 or 2 min of the first half of a match do not give an advantage. They may do so, however, in some competition and for a particular venue (H/A).
The lack of a general rule – where many assume there is one – suggests there is an argument against using a dataset combining many diverse leagues for such research; the results may apply to some but not others. We suppose that the reason is that ‘local’ factors influence the reaction of teams to events of a match. For such research to be reliable is also crucial that a competition is made up of teams of similar strength, or that do not vary significantly. So while this is true for top European competition and football leagues we doubt that it applies to leagues in other continents/nations.
Additional comments – Follow up research
Note that this research did not report or investigate other conditions which, combined with late half-time goals, could predict a more favourable outcome. It was not meant to! An example of such conditions would be the score, or the goal difference before a late goal. We encountered these and other such conditions during our research but did not properly investigate them. However, it is something we hope to do in a follow up research.