Space Recruits, Part 2: Data Scouting on a Budget

• 7 min read
Space Recruits, Part 2: Data Scouting on a Budget

How far can public stats get you toward a good scouting list?

You’re reading Space Recruits, a special series on recruiting made possible by space space space’s paid members. Please consider becoming a subscriber to read the full archive and get more letters.

The problem with soccer data is that it’s expensive. Not expensive compared to dropping millions on an agent recommendation who turns out to be a bust. Probably not expensive compared to running an old-fashioned scouting network. But in plain old dollars or euros or Brazilian reais, buying data subscriptions and hiring people with the expertise to turn them into recruiting models isn’t cheap. Clubs — especially smaller clubs — want cheap.

But what if they didn’t have to buy raw data to do data scouting? As Jan Van Haaren put it in Part 1:

Often in recruitment, the metrics are only used to find players. You want players who performed well in a certain area to pop up at the top of your list, and the actual number doesn’t matter that much. You obviously want your metrics to be as accurate, reliable, and robust as possible, but it really depends on the task at hand how reliable and robust they need to be.

If the goal of your analytics operation is to produce a list of names for your scouts to go watch, it’s the names that matter, not how you got there.

Estimating On-Ball Value

Last fall, a couple of American Soccer Analysis contributors named Mike Imburgio and Sam Goldberg decided to see if they could get there in a way that would help clubs on a budget. After months of trial and error and hundreds of emails back and forth, they introduced a framework called DAVIES (one of those reverse-engineered sports acronyms where even the people who came up with it don’t really care what it stands for). It’s a model of a model. The idea is to approximate goals added, American Soccer Analysis’s action value model that assigns a goal value to thousands of individual events each game. But unlike g+, DAVIES is built on aggregate season stats from FBref’s StatsBomb data. That means it’s totally free not just for MLS but also for Europe’s top five leagues.

“It's a player evaluation metric that accounts for a player's age and their style of play,” Goldberg told Ryan O’Hanlon’s No Grass in the Clouds newsletter. “So it predicts a metric called goals added, which is an overall value of how many goals a player adds to their team over the course of a season. And then adjusts it based on similar players by their playstyle and their age.”

The metric is named for Alphonso Davies, who put up eye-watering g+ as a 17-year-old in MLS before Bayern Munich signed him. And yeah, DAVIES would have had him at the top of a scouting list too. “I didn’t think we were going to get as close as we did to a metric that I think is as good as goals added,” Imburgio told me. “Goals added can see things we can’t see with DAVIES, but the potential to do something like that without event-level data, which I think can be pretty expensive, makes it so much more applicable.”

Now, at this point in the letter I could just pull up some guys who score high in DAVIES and let you decide for yourself if it’s finding prospects worth watching. We’ll get to that in a second. If you’re in a hurry, the data is all freely available in an online app that you can sort and filter and even download. But the fun part, in my admittedly weird opinion, is understanding where the numbers come from.

Adjusting for Playstyle

The first step in turning a raw goals added estimate into a DAVIES value is comparing players in similar roles. Before you can tell if a player is good at his job, you need to know what his job is. First all outfield players are sorted into a few main position groups according to generic indicators like touches by third, then they’re clustered again using more specific stats to split each position group into playstyles. Instead of lumping all attackers together, DAVIES calls some “Dribblers,” others “Playmakers,” and a third group “Finishers” depending on what they do with the ball. (There are nine playstyles in all.)

One challenge in building the model was doing playstyle clustering in a way that didn’t confuse style with quality. “You could very easily do a clustering that just gives you all the best players. For the purpose of DAVIES, that’s terrible,” Imburgio said. “We don’t want to just compare the best players to the best players. So I started normalizing by touch.”

DAVIES clusters players into nine playstyles.

Playstyles are DAVIES’s way of comparing apples to apples when calculating player values, but they also double as a useful filter for scouting purposes. Say you’re Ed Woodward and you’re getting a little nervous about Paul Pogba running out his contract. You could start the search for an understudy by looking for players at the same position, but positions are a mess. FBref’s midfield-slash-forward heap includes not only Pogba but also guys like Marco Reus (a Dribbler, according to DAVIES) and Neymar (a Playmaker). If you start with the Attacking Central Progressor style instead, you see names like Frenkie de Jong and Sergej Milinković-Savić. Golberg and Imburgio describe this group as “players who play box-to-box, often carry the ball forward, play progressive passes and sometimes shoot or play balls into the box themselves.” Now we’re at least in the right ballpark.

Pogba's Attacking Central Progressor cluster is for box-to-box types.

Adjusting for Age

The second comparison baked into a DAVIES value is by age group, which goes back to how the project got its start. “It was originally not meant to be a player value model. We set out to try to build a player forecasting model, to predict future goals added,” Imburgio said. “We found that we were pretty good at getting player value-type numbers that made sense, and we were really bad at forecasting.” Instead of trying to predict the future, Goldberg and Imburgio settled for adjusting the present by comparing players’ contribution to guys in a similar role in one of five age bands: youth, rising to prime, prime, falling from prime, or veteran.

If we narrow the Pogba search by age and DAVIES value, we’ll see numbers that have already been adjusted. So while Pogba generates more expected goals added according to the base model than Bordeaux’s Yacine Adli or Barcelona’s Pedri, they come out roughly even in DAVIES against their respective age groups. It’s not fancy as far as age curve modeling goes, but nothing about DAVIES is meant to be precise — it only has to work. One of the first checks Goldberg and Imburgio ran when they were building the model was to go back to the first season in their data and see how the top prospects’ careers had developed since then. “When we looked at young players from a few years ago and it ended up being a very good list,” Imburgio said, “that’s when I started believing these values were useful.”

The list of Attacking Central Progressors age 24 and under with comparable DAVIES to Pogba includes some famous names and some less-famous ones.

The list of Attacking Central Progressors age 24 and under with a comparable DAVIES to Pogba looks promising. The highest-scoring players, Frenkie de Jong, Lucas Paquetá, and Nicoló Barella, had high-profile summers with three of the world’s best national teams. The youngest, Pedri and Eduardo Camavinga, are two of the game’s most coveted prospects. These guys don’t all have exactly the same profile, but that’s not necessarily a bad thing. DAVIES isn’t trying to do player similarity rankings. Its broad playstyles make sense given the blurriness of season-level stats for players like Pogba, who plays multiple positions, and the uncertainty of recruiting. You want a list of guys who might fit — the rest is up to judgment.

But good luck buying de Jong or Barella right now, let alone convincing them to be anyone’s understudy. The reason I picked a hypo about scouting a backup is that it’s more fun to filter the list one more time to guys with a Transfermarkt value under $25 million and look for players who aren’t famous yet but might be one day. “The gold standard for a scouting model is to be able to tell the future,” Imburgio told me. “Everybody wants a diamond in the rough.”

Seven players in our search have a Transfermarkt value under $25 million.

Does any of the seven players left on the list have the potential to step in for Pogba? Maybe not, in this price band, but that’s for scouts to figure out. We’ll get to some video work later on in Space Recruits. What I can tell you is that, while I generally don’t recommend watching Bordeaux games, Adli usually makes it worthwhile.

Speaking of diamonds in the rough, DAVIES helped get its creators noticed. Goldberg got hired as a data scientist for the New York Red Bulls. Imburgio built a new model similar to DAVIES that covers more leagues, and clubs in Austria and Germany have been trying it out. “The ratings have gotten pretty good feedback from scouts out there,” he said. “That helped validate for me that doing this is useful for clubs that don’t have event-level data for the leagues they want to scout.” ❧

Further reading:

Image: George Méliès, A Trip to the Moon

← Why is the World Trying to Spoil Lionel Messi?
Space Recruits, Part 1: The State of Analytics →

Sign up for space space space

The full archive is now free for all members.

You've successfully subscribed to space space space
Welcome! You are now a space space space subscriber.
Welcome back! You've successfully signed in.
Success! You are now a paying member and have access to all content.
Success! Your billing info is updated.
Billing info update failed.