Skip to main content

Application of PCA in data driven recruitment

When I first started to learn R, after 4/5 weeks I decided to answer a recruitment based question concerning Oxford United and the right back position. This lead me to creating a piece utilising Principal Component Analysis (PCA) at a very basic level, to see if there is a quick and efficient way to categorise and analyse player styles.

Can this then form the basis of an indicator highlighting those players with similar playing styles and such, play a role in replacing players/finding players to fit a specific system?

My original piece is here. Its always weird to read stuff back, but I will try to build on this! There is a quick and brief explanation into PCA there along with a few other links to PCA within football.

Since I produced the above, Mark Carey has done some great work applying PCA to midfielders in the top 5 leagues. This is an area that has aways intrigued me, however after some limited work in professional football I'm certain PCA can play a large role in guiding recruitment (certainly as an initial step!).

To check my process I thought it would be best to replicate some of Mark's work - similar output = I'm doing something broadly right!

Sorry for the basic boy excel but wanted to quickly get this down. Similarly to Mark, I found 5 components for central midfielders that have played >500 minutes in 2019/2020. The naming of these components could be altered, however from a quick overview there is some good crossover to Mark's findings. At this point, it's probably handy to point out I have just thrown in Wyscout data whilst Mark used the StatsBomb powered fbref.




As my process has done a pretty reasonable job, I will look to apply this to 'second' leagues across the top 5 leagues in Europe. This could be applied to any leagues where data is available along with all positions on the pitch. The process that follows could easily be applied to wingers or defenders.

For the purpose of this blog the leagues included were:
- England - Championship
- Spain - Segunda
- France - Ligue 2
- Italy - Serie B
- Germany - Buli.2

The above amounts to 632 players. These players satisfy the criteria:
- Position is Centre Midfield (as deemed by Wyscout) - there will be some positional anomalies
- Have played >500minutes in 2019/2020
- Play in the above leagues

The metrics used (again, sorry for the basic table!):



All performance metrics are P90. The first column made up of filters that will be applied later in the process. I avoided xA/xG etc as wanted to find player style and felt these are more a by-product (am happy to be corrected though!)

Lovely job, away we go. I have scaled all the data prior to performing the PCA to avoid any weird weightings...lets check the result:




Ah look, another excel table. Anyway, we can establish those features dominant within each style.

Primary metrics within each style:

Creator:
Passes to penalty area
Deep completions
Key passes
Through passes
Crosses

Engine:
Passes
Short/medium passes
Lateral passes
Forward passes
Progressive passes

Carrier:
Offensive duels
Dribbles
Progressive runs

Playmaker:
Defensive duels
Interceptions
*Long passes
*Forward passes
*Progressive passes

Defensive:
Aerial duels
Shots blocked
Interceptions

The 4th style, playmaker, is a weird mix that probably needs further investigation as only defensive duels really impact, however thereafter expansive passing (partly) correlates. Will leave as is and see what happens!

Now we have the 5 styles and know the metrics that contribute, we can investigate which style players fall into. Assessing all players, the top 20 in each category:
















A few notes on the above. I have ranked the players based on the 600+ players in the data set - therefore Hernandez is the top creator and midfield engine (passer) compared to all other players across the 5 leagues analysed. If a player crops up in two styles they are probably worth looking at!

The above, essentially creates a 20 man short list to look into if looking for a specific style. Obviously this only makes up an initial filter process but gives a good indication as to a player style and their strengths.

To take this further, a data driven club will probably be looking for players with resale value. As such, the above lists can be filtered via age, market value and contract expiry. I will simply filter by age....lets go 24 or under and have played over 900 minutes (this reduces the number of players to 195):

















We are starting to pick up some decent young talent here including:

- D'Arpino (crops up in 3 of the styles!)
- Frattesi linked with Everton
- Gueye who appears to be moving to Watford
- Fein looks to be returning to the Bayern first team this summer
- Julien Ponceau linked with Swansea and Sevilla
- Samuele Ricci currently linked with Napoli

The PCA appears to have picked out some players of pedigree...always a good sign!

The playmaker dimension is a strange one with Krystian Bielik cropping up. Looking back to the contributory factors defensive duels and interceptions have a reasonable influence along with progressive passing metrics (to a lesser extent!). Whilst PCA does a good job of assigning players to a playing style, some will be mis-placed.

Finally, to take a final step, we can validate some of the findings by looking into the individual metrics of a player, creating a (very basic!) dashboard.

For example Tommaso Pobega ranks second in the Defensive midfielder style overall, and first amongst 24&U over 900minutes. Looking back we would expect Pobega to rank highly in aerial duels, interceptions and blocked shots (as these are strongly correlated with the playing style)...




Nice. Turns out Pobega is on loan from AC Milan...here he is with some lo-fi backing track - https://www.youtube.com/watch?v=butbGPKJAa0

To double check this we can look at other styles, such as creator. We would expect to see high rankings for:
Passes to penalty area
Deep completions
Key passes
Through passes
Crosses

An example: Lilian Egloff (ranks 4th overall and 2nd for the U24s):



In all of the influential creative metrics Egloff ranks in the top 25%...pretty encouraging for a 17 year old! Looking into this further, Wyscout has used Egloff's Stuttgart U19 minutes alongside his (25minutes in Buli. 2), therefore a majority of this is based on youth football. Oh. None the less, probably one to keep an eye on!

We can take this several steps further but I will leave it there!

As always, this makes up just a small part of recruitment, but at the very least can inform decisions as to the style of player you are scouting. This can be used to validate video and live scouting, or to flag players that weren't otherwise on the radar. By applying a variety of filters you can find those players that fit within the club model, limiting errors in the transfer market, potentially finding talent early. Match the above with team playing styles (you could perform PCA on clubs) and this forms the base of a powerful recruitment tool. This can be applied to all positions - I will look to perform a similar analysis solving a specific recruitment problem.

If you have any feedback, just let me know! Am always happy to help/receive guidance or criticism!









Comments

  1. Hi Mark,

    Fantastic analysis!

    How did you generate the ranking value for the x axis?

    Many thanks.

    ReplyDelete
  2. Principal Component Analysis (PCA) is applied in data-driven recruitment to identify and reduce the dimensionality of candidate attributes, improving the accuracy of candidate matching and selection.

    By transforming and analyzing large datasets, PCA helps recruiters uncover hidden patterns and correlations among candidate characteristics, enhancing decision-making in talent acquisition.

    The application of PCA in recruitment facilitates efficient screening processes, optimizing resource allocation and promoting more informed hiring decisions based on comprehensive data analysis.
    Best Recruitment Agency In Pakistan

    ReplyDelete

Post a Comment

Popular posts from this blog

Getting started in R with StatsBomb Data

As always, I should caveat that I'm not an expert either in football or programming...I started learning R in December and have gradually reached a 'mildly competent' level. This will go through installing R, loading the StatsBomb data, then plotting a pass map - something like this: Anyway, away we go. Thing number 1 - install R. There are two things to load...the R 'base' and Rstudio. You can download Rstudio here: https://rstudio.com/products/rstudio/download/ The first 3 minutes of the below shows the process: https://www.youtube.com/watch?v=BuaTLZyg0xs&list=PL6cDc8Xxld162nSsZ14bQnFn1cYStsrtk&index=2&t=0s That is now hopefully R loaded. Open Rstudio and you should be greeted with something like this: Press the arrow areas to reveal: Under the 'Packages' tab select 'install' and search 'devtools'..install package. Repeat the previous step however search 'tidyverse'. Next steps are to load in th...

Shot Maps In R using StatsBomb Data

Im not sure if anyone is following these, but I will do one more and see what happens! I have covered some passing based stuff, I thought it might be useful to look into shots. Therefore, the rough plan for this piece: 1) Total player xG in the WSL for this season 2) Find the top 9 players based on xG 3) Plot all shots taken including xG 4) Add labels 5) Plot the shot map of the 9 players against one another As always, my coding is in the learning stage so this isn't a definitive way...just something that works for me and might help others! Anyway, load in this seasons WSL data as we have previously. We want to extract 3 things from the data - the number of shots, numbers of goals and total xG (initially including penalties) To start - tallying player shots: player_shots<-StatsBombData%>%   filter(type.name == "Shot")%>% ##filter all shots in StatsBombData   group_by(player.name)%>% ##group by player   tally(name = "total_shots"...

Using Wyscout in R

It's pretty clear that within a football setting, clubs are largely using the same data. Most clubs will be using Wyscout/Instat...others may have access to StatsBomb and Metrica. None the less, data quality discussion aside, Wyscout is used predominantly to quickly gain an overview of players (both from a video and data perspective). This dovetails with people up-skilling through the lockdown, taking various courses and becoming increasingly proficient in languages such as R and Python. This is a big asset within football! Those that have read previously know that I am self teaching R and sharing any learnings that may be of interest around football analytics to others. By no means am I an authority on this, I've just found something that works, that might help others...I'm always happy to be corrected! Anyway, the aim is to: - Download Wyscout data - Import into R - Clean the headers - Re-format the data from "wide" to "long" format - Some e...