Soccer Analytics with R Using worldfootballR for Data-Driven Football Insights

Soccer Analytics with R: Using worldfootballR for Data-Driven Football Insights

A practical, reproducible walkthrough to pull open football data, build tidy datasets, and produce actionable xG-based insights and visuals — all in R with worldfootballR.

  1. Why this guide
  2. Setup: packages & data sources
  3. Download league shots (Understat via worldfootballR)
  4. Team-level insights: xG for/against & efficiency
  5. Shot maps in R (ggplot2/ggsoccer)
  6. Match-level xG and simple power ratings
  7. Player analytics: xG per 90 & shot quality
  8. Production tips & reproducibility
  9. FAQ

Why this guide

Football data is abundant, but turning it into clear, reproducible insights is the real edge. In this tutorial you will:

  • Fetch open shot-level data (with xG) for a full league season using worldfootballR.
  • Build tidy pipelines to analyze teams and players with dplyr.
  • Create publication-ready visuals (league tables, shot maps) with ggplot2.
  • Compute simple, interpretable metrics: xG for/against, xG difference, and per-90 rates.

You can adapt the same steps to other leagues and seasons with a couple of parameter changes.

Setup: packages & data sources

We’ll use the excellent worldfootballR package to access open football data sources (e.g. Understat, FBref). Understat provides shot-level events with expected goals (xG), perfect for quick, practical modeling.

# Install once (uncomment if needed)
# install.packages(c("worldfootballR", "dplyr", "tidyr", "ggplot2", "stringr", "lubridate"))
# install.packages("ggsoccer")   # convenient soccer pitch layer

library(worldfootballR)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggsoccer)
library(stringr)
library(lubridate)

Note: Function names and data fields can evolve over time. If a call changes, check the package reference with ?worldfootballR or the package README.

Download league shots (Understat via worldfootballR)

Let’s pull a season’s worth of shot events (with xG) from Understat. We’ll use the English top flight as an example. Replace the league/season with your target competition.

# Parameters
league_code <- "EPL"           # e.g., "EPL", "La_liga", "Bundesliga", "Serie_A", "Ligue_1"
season_start <- 2023           # 2023-24 season (start year)

# Fetch shot-level data (Understat). The function name may vary slightly by version:
# Try: understat_league_shots(league = league_code, season_start_year = season_start)
shots_raw <- understat_league_shots(
  league = league_code,
  season_start_year = season_start
)

# Inspect fields
dplyr::glimpse(shots_raw)

Typical columns include: match ID/date, home/away teams, shooter, minute, shot coordinates, and xG per shot. We’ll tidy names and keep the essentials.

shots <- shots_raw |>
  janitor::clean_names() |>
  mutate(
    match_date = as.Date(date),
    is_goal = if_else(result == "Goal", 1, 0),
    team = if_else(h_team == team, h_team, a_team)  # normalize team label if needed
  ) |>
  select(match_id, match_date, h_team, a_team, team, player, minute, xg, is_goal, x, y)

Team-level insights: xG for/against & efficiency

Aggregate by team to get season summaries: shots, goals, xG (“expected goals”), and simple conversion efficiency (goals − xG).

team_summary <- shots |>
  group_by(team) |>
  summarise(
    matches = n_distinct(match_id),
    shots_for = n(),
    goals_for = sum(is_goal, na.rm = TRUE),
    xg_for = sum(xg, na.rm = TRUE),
    avg_shot_xg = mean(xg, na.rm = TRUE),
    .groups = "drop"
  )

# Compute xG against from the opposing shots
team_against <- shots |>
  mutate(opponent = if_else(team == h_team, a_team, h_team)) |>
  group_by(opponent) |>
  summarise(
    xg_against = sum(xg, na.rm = TRUE),
    goals_against = sum(is_goal, na.rm = TRUE),
    .groups = "drop"
  ) |>
  rename(team = opponent)

league_table <- team_summary |>
  left_join(team_against, by = "team") |>
  mutate(
    xg_diff = xg_for - xg_against,
    goals_diff = goals_for - goals_against,
    xg_efficiency = goals_for - xg_for  # >0 = finishing above expectation
  ) |>
  arrange(desc(xg_diff))

league_table |> head(10)

Tip: Sort by xg_diff for a shot-quality view of team strength independent of finishing variance.

Shot maps in R (ggplot2/ggsoccer)

Visuals make insights stick. Let’s draw a basic shot map for a single team: goals in solid shapes, misses as hollow points, with size proportional to xG.

focus_team <- league_table$team[1]        # top xG_diff team as an example
team_shots <- shots |> filter(team == focus_team)

ggplot() +
  annotate_pitch(colour = "grey70", fill = "white") +
  theme_pitch() +
  geom_point(
    data = team_shots,
    aes(x = x, y = y, size = xg, shape = factor(is_goal)),
    alpha = 0.7
  ) +
  scale_shape_manual(values = c(`0` = 1, `1` = 16), name = "Goal") +
  scale_size_continuous(name = "xG") +
  labs(
    title = paste("Shot Map —", focus_team),
    subtitle = paste0(league_code, " ", season_start, "/", season_start + 1),
    caption = "Data: Understat via worldfootballR • Plot: ggsoccer/ggplot2"
  )

You can facet by outcome, half, or shot type, or flip to defensive maps by plotting the opponent’s shots.

Match-level xG and simple power ratings

Roll up to matches to compare performance per game. A smoothed xG difference offers a fast, interpretable power rating.

match_xg <- shots |>
  group_by(match_id, h_team, a_team) |>
  summarise(
    h_xg = sum(if_else(team == h_team, xg, 0), na.rm = TRUE),
    a_xg = sum(if_else(team == a_team, xg, 0), na.rm = TRUE),
    .groups = "drop"
  ) |>
  mutate(xg_diff = h_xg - a_xg)

# Team-level rolling xG differential (simple power rating proxy)
library(slider)

team_match_xg <- match_xg |>
  pivot_longer(cols = c(h_xg, a_xg), names_to = "side", values_to = "team_xg") |>
  mutate(team = if_else(side == "h_xg", h_team, a_team)) |>
  select(match_id, team, team_xg) |>
  left_join(
    match_xg |>
      pivot_longer(cols = c(h_xg, a_xg), names_to = "side2", values_to = "opp_xg") |>
      mutate(team = if_else(side2 == "h_xg", a_team, h_team)) |>
      select(match_id, team, opp_xg),
    by = c("match_id", "team")
  ) |>
  arrange(team, match_id) |>
  group_by(team) |>
  mutate(
    xg_diff = team_xg - opp_xg,
    xg_diff_roll3 = slide_dbl(xg_diff, mean, .before = 2, .complete = TRUE)
  ) |>
  ungroup()

Plot xg_diff_roll3 over time to spot momentum or tactical shifts.

Player analytics: xG per 90 & shot quality

With shot-level data we can compute per-90 metrics and rank players by volume and efficiency.

# Assume minutes by player are available from a squad/appearances endpoint; if not,
# you can approximate using match sheets and sub events.

player_summary <- shots |>
  group_by(player, team) |>
  summarise(
    shots = n(),
    goals = sum(is_goal, na.rm = TRUE),
    xg = sum(xg, na.rm = TRUE),
    avg_shot_xg = mean(xg, na.rm = TRUE),
    matches = n_distinct(match_id),
    .groups = "drop"
  )

# If you have minutes, compute per-90. Here we mock minutes = matches * 70 (placeholder).
player_summary <- player_summary |>
  mutate(
    minutes = matches * 70,
    shots_p90 = 90 * shots / pmax(minutes, 1),
    xg_p90 = 90 * xg / pmax(minutes, 1),
    goals_minus_xg = goals - xg
  ) |>
  arrange(desc(xg_p90))

player_summary |> head(15)

Tip: Split by position and minimum minutes to get fairer leaderboards.

Production tips & reproducibility

  • Parameterize league and season at the top; knit as a report per competition.
  • Cache raw pulls to disk (RDS/CSV) to avoid re-hitting sources on every run.
  • Document data provenance and respect the source Terms of Use.
  • Automate weekly refresh with targets or a simple cron/Rscript job.
  • Validate with sanity checks (e.g., xG totals vs. trusted dashboards).

FAQ

Which leagues can I fetch?

Understat covers major European leagues. With worldfootballR you can also access FBref, Transfermarkt and others. Check the package reference for the latest coverage and endpoints. Do I need to build my own xG model?

No. Understat supplies shot-level xG you can use immediately. If you prefer to learn modeling, you can fit your own logistic model on shot features as an extension exercise. How do I adapt this to La Liga or Serie A?

Change league_code and the season_start year. Everything downstream stays the same.


Enjoyed this tutorial? Explore more R sports analytics books and practical guides at RProgrammingBooks.com.

Leave a Comment

Your email address will not be published. Required fields are marked *