Since first writing about Oxford's search for Right Back in January I've had a feeling that whilst the steps are logical, they could probably be better. Ram Srinivas outlined a famework on the Purefitbaw podcast that led to me creating the first piece, but also thinking how it could be improved.
I started researching further and found this piece - it relates to NBA but why not adapt and see if this can be applied to football also? On first reading, I didn't have a clue what was going on so attempted to break down each element and see if this had a logical, football implication. This lead me to Will Gurpinar-Morgan's 2+2=11 blog and initially presented a process at Opta in 2015 and further presented this year. (You should follow his work and watch his presentations!)
This could provide a blueprint within recruitment when sourcing players of a specific skillset to fulfil a specific role within the squad.
In Oxford's example Chris Cadden was a creative Right Back that was an important part of chance creation in the squad. His alternative, Sam Long, whilst being a decent squad player had differing attributes and as such, did (does) not represent a like for like replacement.
****I should add, I have been self teaching R for the last few months and this seemed like a fun project to get involved with whilst learning some more programming. As with all my stuff, it could be good, it could be awful, but I learnt stuff along the way and was a fun process****
Anyway - the interesting stuff.
Principal Component Analysis (PCA)
I won't go into the mechanics as most probably don't have a huge interest, but in short, it allows patterns within a data set to be found, whilst allowing us to take a variety of dimensions and plot on a single axis. Traditionally I have plotted multiple scatter plots.
Firstly, the red arrows appear grouped. Aerial Duels, Shots blocked etc are in one grouping with the passing metrics once more grouped. Taking the 80 players that have played >450 minutes in League One so far this season at LB/RB we get the basic plot. This provides a decent starting point, but to take it to the next level it would be useful to establish the differing styles of players within the data.
Cluster Analysis
Again, I won't go over the boring stuff but essentially from assessing the data there were 3 primary clusters.
Cluster 1 contains 31 players
Cluster 2 contains 28 players
Cluster 3 contains 21 players
On the above, Cadden is #8 with Sam Long being #59. First observation, they both fall into cluster 3 (yet to know what that is!). This is where team style plays a role and should be a discriminatory factor from the start (highlighted in the first blog!). We are now starting to get a little closer to establishing some player similarity with an understand as to their playing style.
The next phase, establishing the dominant features of each cluster.
Cluster 1 - primary features = Shots blocked // Aerial Duels // Interceptions // Long Passes
Cluster 2 - primary features = Dribbles // Shots // Touches in Box // xG // Crosses
Cluster 3 - primary features = Passes // short/medium/forward passes // shot assists (key passes) // crosses // crosses to box // through balls
With the above information, we can now describe:
Cluster 1 = Defensive
Cluster 2 = Attacking/goal threat
Cluster 3 - Creative
Marvellous, now through the above process we have some clear playing styles whilst attributing players to those styles.
The next challenge was to visualise the individual players against one another. This still needs further work, however I have normalised all values against those within the data set and allows a quick overview comparison (check MrktInisghts for a more polished version!)
To take a player from each cluster:
Defensive
Harry Brockbank:
Attacking
Ryan Giles:
Creative
Brandon Haunstrup:
From the above, having a quick scan the initial clustering seems be pretty spot on. Obviously needs further investigation but will move beyond this and start looking into Cadden replacements implementing the above process (using League 1, League 2 and Scotland data).
For reference, the profile of Cadden:
Now the fun bit....finding players in the same cluster as Cadden (have already filtered out players such as Perry Ng that would cost serious £££)
Callum Brittain - MK Dons:
Stephen O'Donnell - Kilmarnock:
Lewie Coyle - Fleetwood:
Brad Halliday - Doncaster:
Tom James - Hibernian:
As always the above comes with the caveat that this is a filtering process that would lead to further traditional scouting processes. Interestingly, in my first piece Coyle was flagged for similarity...that obviously still holds having gone through the above process. Another interesting addition is that of Stephen O'Donnell who Oxford extensively attempted to bring in during the January window - a clear indication that the recruitment team and I are on the same page ;)
There will possibly be a part 3 to this using event data, going beyond the initial numbers and drilling down into the location of crosses, shots, shot assists etc (will just have to use Statsbomb WSL data instead!)
I started researching further and found this piece - it relates to NBA but why not adapt and see if this can be applied to football also? On first reading, I didn't have a clue what was going on so attempted to break down each element and see if this had a logical, football implication. This lead me to Will Gurpinar-Morgan's 2+2=11 blog and initially presented a process at Opta in 2015 and further presented this year. (You should follow his work and watch his presentations!)
This could provide a blueprint within recruitment when sourcing players of a specific skillset to fulfil a specific role within the squad.
In Oxford's example Chris Cadden was a creative Right Back that was an important part of chance creation in the squad. His alternative, Sam Long, whilst being a decent squad player had differing attributes and as such, did (does) not represent a like for like replacement.
****I should add, I have been self teaching R for the last few months and this seemed like a fun project to get involved with whilst learning some more programming. As with all my stuff, it could be good, it could be awful, but I learnt stuff along the way and was a fun process****
Anyway - the interesting stuff.
Principal Component Analysis (PCA)
I won't go into the mechanics as most probably don't have a huge interest, but in short, it allows patterns within a data set to be found, whilst allowing us to take a variety of dimensions and plot on a single axis. Traditionally I have plotted multiple scatter plots.
Firstly, the red arrows appear grouped. Aerial Duels, Shots blocked etc are in one grouping with the passing metrics once more grouped. Taking the 80 players that have played >450 minutes in League One so far this season at LB/RB we get the basic plot. This provides a decent starting point, but to take it to the next level it would be useful to establish the differing styles of players within the data.
Cluster Analysis
Again, I won't go over the boring stuff but essentially from assessing the data there were 3 primary clusters.
Cluster 1 contains 31 players
Cluster 2 contains 28 players
Cluster 3 contains 21 players
On the above, Cadden is #8 with Sam Long being #59. First observation, they both fall into cluster 3 (yet to know what that is!). This is where team style plays a role and should be a discriminatory factor from the start (highlighted in the first blog!). We are now starting to get a little closer to establishing some player similarity with an understand as to their playing style.
The next phase, establishing the dominant features of each cluster.
Cluster 1 - primary features = Shots blocked // Aerial Duels // Interceptions // Long Passes
Cluster 2 - primary features = Dribbles // Shots // Touches in Box // xG // Crosses
Cluster 3 - primary features = Passes // short/medium/forward passes // shot assists (key passes) // crosses // crosses to box // through balls
With the above information, we can now describe:
Cluster 1 = Defensive
Cluster 2 = Attacking/goal threat
Cluster 3 - Creative
Marvellous, now through the above process we have some clear playing styles whilst attributing players to those styles.
The next challenge was to visualise the individual players against one another. This still needs further work, however I have normalised all values against those within the data set and allows a quick overview comparison (check MrktInisghts for a more polished version!)
To take a player from each cluster:
Defensive
Harry Brockbank:
Attacking
Ryan Giles:
Creative
Brandon Haunstrup:
From the above, having a quick scan the initial clustering seems be pretty spot on. Obviously needs further investigation but will move beyond this and start looking into Cadden replacements implementing the above process (using League 1, League 2 and Scotland data).
For reference, the profile of Cadden:
Now the fun bit....finding players in the same cluster as Cadden (have already filtered out players such as Perry Ng that would cost serious £££)
Callum Brittain - MK Dons:
Stephen O'Donnell - Kilmarnock:
Lewie Coyle - Fleetwood:
Brad Halliday - Doncaster:
Tom James - Hibernian:
As always the above comes with the caveat that this is a filtering process that would lead to further traditional scouting processes. Interestingly, in my first piece Coyle was flagged for similarity...that obviously still holds having gone through the above process. Another interesting addition is that of Stephen O'Donnell who Oxford extensively attempted to bring in during the January window - a clear indication that the recruitment team and I are on the same page ;)
There will possibly be a part 3 to this using event data, going beyond the initial numbers and drilling down into the location of crosses, shots, shot assists etc (will just have to use Statsbomb WSL data instead!)
Hi Mark,
ReplyDeleteGreat post - I am starting out my football analytics journey with a keen interest in the EFL. Would you be able to share which data source you use for the EFL and/or in this post specifically?
Also do you have any recommendations for event data in the EFL?
Keep up with the posts, they are really interesting!
Thanks
Ryan
Hi Ryan,
DeleteI got the above from Wyscout but event data for the EFL is pretty tricky to come by!
I would suggest manually plotting data to start using: https://torvaney.github.io/projects/tracker
I would intially start with shots but it gives you data to play with! Alternatively you can use the free StatsBomb data.
Thanks,
Mark