An Introduction to Soccer Analytics

• 10 min read
An Introduction to Soccer Analytics

Some numbers-free background on how soccer uses data.

If you subscribe to space space space, first of all let me just say you’re a discerning and cultured and probably very good looking person; also you may have noticed that the letters talk a lot about analytics. Some of you have careers in the field but for others it’s new and kind of confusing. I hope this intro will give everyone something to think about.

What are analytics? The dictionary tells us the word entered English in the late sixteenth century from Aristotle’s analytiká, an Ancient Greek root meaning “sports talk for nerds who’ve never won a tackle.” Since formal syllogisms aren’t much help in scouting left backs, in soccer we usually reserve the term for data analytics—you know, stats. Tables and figures. Messi vizzes that’ll do numbers on Twitter. Anything that sets out to ruin the beautiful game by turning it into math class, that’s analytics.

The best reason to try to measure the sport is the same reason people used to say it couldn’t be measured: soccer is hard. Even coaches and analysts and scouts who’ve spent their lives learning to watch it won’t see games quite the same way. There are too many moving parts, too many possibilities to hold in your head at once. Had we but world enough and time, you might rewatch each match over and over to pause and study it and it’d still be impossible to see and remember it all. And if you have to do that for an opponent’s entire season, or a continent you’re charged with scouting? Pretty soon it starts to look like the nerds might be onto something.

Data scales. It doesn’t sleep or steal your food out of the office fridge. It’s objective-ish and consistent-ish, depending on what exactly it measures and how, and honestly sometimes even sketchy data is better than the stories we reduce soccer to without it. The hard part of of analytics comes when we have to boil down the data to stories, too, to make it useful. How do we know what it means—and what we’re missing?

Where do soccer analytics come from?

The story of soccer analytics usually starts with a cautionary tale. At 3:50 p.m. on March 18, 1950—his notes were very precise—a retired Royal Air Force officer named Charles Reep, who had trained as an accountant, began systematically recording events at a Swindon Town match. Reep didn’t just want a detailed record of what happened. Like any good analyst, he wanted to know why it happened and what teams could learn from the data to improve their play. His pet finding over his decades of logging matches was that most goals resulted from possessions of three passes or fewer, which he took to mean that teams should simplify their tactics to get to goal faster, with less of the namby-pamby possession stuff. In an era when Hungary’s “pattern-weaving” passing style was the toast of the world game, Reep was writing articles with headlines like This Pattern-Weaving Talk is All Bunk!

The problem wasn’t Reep’s diligent data collection—it was his analysis. Most possessions are short, especially if you define them strictly enough to ensure that result; if there are a lot more short sequences, it’s no surprise that more goals result from them. A less polemical analyst might have inquired about the rate of goalscoring on possessions of various lengths, or how exactly those three-pass possessions developed. Reep’s own work showed that “60 per cent of all goalscoring moves begin 35 yards from an opponent’s goal,” which might make you wonder whether pattern-weaving was perhaps a good way to get close to goal in the first place. Not Reep, who worked with the great Wolves manager Stan Cullis to implement a style based on the “wholly English” principles of “direct passing.” Wolverhampton’s success in the fifties, including a dramatic upset of the Hungarian champions Honved, was seen as proof of concept and an affirmation of the long ball game. By his own account, Reep gathered data to “provide a counter to reliance upon memory, tradition and personal impressions that led to speculation and soccer ideologies.” But the result was more ideology, now with the false certainty of science.

Contrary to what you’ll sometimes read, Reep wasn’t the first pioneer of soccer analytics. His own interest had been piqued by an account of Herbert Chapman’s statistical approach to coaching Arsenal in the early thirties. Just this year, the internet turned up some stunningly modern-looking data vizzes drawn by hand for individual matches in 1920s Hungary, Reep’s bête noire. A Budapest grad student named Attila Bátorfy discovered that dozens of the charts had run on the front page of a daily newspaper dating back to 1922, and were popular enough to be imitated by sports dailies in Italy and Sweden. It’s not hard to imagine that as long as players have been passing and shooting, there’s been a weirdo standing on the sideline somewhere trying to jot it all down.

These days soccer data collection is big business. Companies like Opta and Statsbomb use human coders and computer vision to log information about every event that happens on the ball: passes, dribbles, shots, headers, fouls, tackles, interceptions, blocks, clearances, claims, and saves. Analysts can count these raw events, divvy them up or plot them out, derive second-order metrics like possession percentages, or use them to build sophisticated models. There’s a growing emphasis on linking data to video, which is undergoing a revolution of its own thanks to global providers like Wyscout. The juiciest stuff on the market is tracking data, which uses cameras or other tech to trace every player’s movements so that analysts can see off-ball patterns and soccer’s most prized commodity, space. You can join tracking data to event data for a more complete picture of what’s happening, but even that won’t tell you which way players are looking or how their bodies are arranged, so some researchers go even further and collect gaze and pose data. Because tracking data is expensive to collect and impossible to come by for, say, youth prospects in the Bolivian second division, analysts sometimes try to extract insights from it and apply them back to event data, which you can buy for just about any league with a TV contract or even a decent camera in the stands following the ball. But even when you’ve got information on what happened in a game, Reep’s story serves as a reminder that the challenge is what you do with it.

What is analytics good for?

I threw this list together last night after a couple beers, and maybe practitioners will have different ideas, but the way I see it the stuff you can measure with game data falls into five categories, from easiest to hardest:

  1. Contribution
  2. Style
  3. Skill
  4. Potential
  5. Big questions

Contribution is the most straightforward, since it’s about measuring who did what (although when you try to put a value on those contributions things get more complicated). Style separates what happened from how and why. Skill, at the player level, tries to give a more nuanced—and speculative—picture of contribution by accounting for circumstances like role, tactics, and team strength. Potential strays even further from what we know into what we wish we knew, trying to project how a team or player or even a particular pattern of play would fare under circumstances different from in the past, like how a young player might grow if he transferred to a team with an unfamiliar playstyle in a tougher league. And then there’s a catchall for research questions that sprawl across or outside the other categories. This might include evaluations of a game model or of specific tactics, like a story I once read about a team of analysts who spent weeks studying their team’s transition patterns so a coach could explain something to players by drawing a single line on a whiteboard.

You can do all of this without data—they’re really just kinds of questions you can ask about soccer, not inherently quantitative problems—and a lot of coaches and sporting directors would rather trust analysts’ eyes than whatever some computer spits out. But the supposed dichotomy between data and video, or data and scouting, or data and knowing anything at all about the game, is transparently phony. Good analytics are always informed by what the nerds call “domain knowledge”—the expertise of players and staff. If you do it right, you might even get some benefits flowing the other direction, too. The best analytics work makes us better at seeing and thinking about the game.

Leading data analysts from clubs including Liverpool and Barcelona describe the state of the field as part of this year's excellent Friends of Tracking series.

If you’ve heard of sports analytics, you’ve probably also heard of Moneyball, Michael Lewis’s book about how the Oakland A’s front office turned a low-budget baseball team into a powerhouse by using stats to recruit players with undervalued skillsets. The story hooked not only fans but also business travelers killing time in the airport bookshop during an era when an explosion of data had everyone scrambling to extract valuable analytical insights from it. Lewis’s protagonist, Billy Beane, became a cult hero and, I feel pretty confident saying without checking IMDB, the first stats nerd to be played by Brad Pitt.

Not long after Moneyball was published in 2003, Beane was already turning his attention to soccer, starting with Oakland’s neighboring San Jose Earthquakes. But applying analytics to soccer, as people with a penchant for the obvious never get tired of pointing out, is harder than in baseball. There’s more you can potentially do with data in a fluid sport than scout high-OBP catchers or plot an infield shift, but faced with the challenges of getting there, as well as the equally daunting job of convincing decisionmakers that what you’ve found is useful and not another Reepian mistake, not many organizations have been in a hurry to go all in.

Most clubs do at least do some data scouting. The generally accepted best practice is you use stats to pull together a list of some good prospects who fit the profile you’re looking for, doing a much wider and faster first pass than a scouting network could; then you watch a bunch of video to get a more complete picture of the players and narrow things down; and finally you might send out a scout to get to know your favorite guys in their environments and maybe watch a few live games just for kicks. Done right, this can make the recruiting process more efficient and with any luck find you better players for cheaper than the old system of binoculars in the stands and agents hawking their guys on the phone. But while pretty much every club has bought into the idea that they should be doing something with data, you’ll still hear an alarming number of sporting directors brag about scouting a player they loved and then checking his numbers right before the signing, you know, just in case.

While recruiting probably offers the biggest bang for your statistical buck—ask Liverpool, whose elite analytics operation helped them build a squad that became champions of everything on a comparatively reasonable budget—it’s not the only thing data is good for. A lot of clubs also use analytics as part of their opposition scouting. Again, not hard to see why. When Marcelo Bielsa got caught spying on Derby County and responded in the most Bielsa way possible by convening a press conference to lecture reporters on every detail of his weekly game prep, the biggest takeaway from his mountains of binders and endless slides was that his staff was sacrificing sleep to compile information that could have been done at the push of a button. Analyzing a soccer team is hard, and there are definitely parts that are better done with video, but a lot of clubs integrate their data and video operations to make each other’s lives easier. If nothing else, analytics can be the quickest way of pulling up the right clips for analysts to study.

And then there are what you might call the big questions. Billy Beane’s staff was focused on buying cheap wins, but they took their inspiration from Bill James, an idealistic fan prone to writing things like, “I do not start with the numbers any more than a mechanic starts with a monkey wrench. I start with the game, with the things that I see there and the things that people say there. And I ask: Is it true? Can you validate it? Can you measure it? How does it fit with the rest of the machinery?”

The best possible use of analytics isn’t just to measure the game but to understand it. That’s what smart clubs do when they use data to ask questions about how they play, or better yet how they should play. It’s what academic researchers and curious fans do with whatever data they can get their hands on. It’s what Charles Reep and Herbert Chapman and those Hungarian newspaper readers were after, even if it didn’t always work out. When space space space covers analytics, we’ll be asking the same questions James wanted to know—is it true? can you measure it? how does it all fit together?—but about soccer’s more intricate, more beautiful, and altogether extraordinary machine. ❧

Thanks for reading space space space! If you're into this kind of thing, please sign up for the email list and consider becoming a paid member to support the project and get access to premium posts.

Further reading:

Image: Albrecht Dürer, Treatise on Measurement

← The Unhappy Triumph of the Double Pivot
Is Joe Gomez Going to Be a Problem? →

Sign up for space space space

The full archive is now free for all members.

You've successfully subscribed to space space space
Welcome! You are now a space space space subscriber.
Welcome back! You've successfully signed in.
Success! You are now a paying member and have access to all content.
Success! Your billing info is updated.
Billing info update failed.