A practical, reproducible walkthrough to pull open football data, build tidy datasets, and produce actionable xG-based insights and visuals — all in R with worldfootballR
.
- Why this guide
- Setup: packages & data sources
- Download league shots (Understat via worldfootballR)
- Team-level insights: xG for/against & efficiency
- Shot maps in R (ggplot2/ggsoccer)
- Match-level xG and simple power ratings
- Player analytics: xG per 90 & shot quality
- Production tips & reproducibility
- FAQ
Why this guide
Football data is abundant, but turning it into clear, reproducible insights is the real edge. In this tutorial you will:
- Fetch open shot-level data (with xG) for a full league season using
worldfootballR
. - Build tidy pipelines to analyze teams and players with
dplyr
. - Create publication-ready visuals (league tables, shot maps) with
ggplot2
. - Compute simple, interpretable metrics: xG for/against, xG difference, and per-90 rates.
You can adapt the same steps to other leagues and seasons with a couple of parameter changes.
Setup: packages & data sources
We’ll use the excellent worldfootballR
package to access open football data sources (e.g. Understat, FBref). Understat provides shot-level events with expected goals (xG), perfect for quick, practical modeling.
# Install once (uncomment if needed)
# install.packages(c("worldfootballR", "dplyr", "tidyr", "ggplot2", "stringr", "lubridate"))
# install.packages("ggsoccer") # convenient soccer pitch layer
library(worldfootballR)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggsoccer)
library(stringr)
library(lubridate)
Note: Function names and data fields can evolve over time. If a call changes, check the package reference with ?worldfootballR
or the package README.
Download league shots (Understat via worldfootballR)
Let’s pull a season’s worth of shot events (with xG) from Understat. We’ll use the English top flight as an example. Replace the league/season with your target competition.
# Parameters
league_code <- "EPL" # e.g., "EPL", "La_liga", "Bundesliga", "Serie_A", "Ligue_1"
season_start <- 2023 # 2023-24 season (start year)
# Fetch shot-level data (Understat). The function name may vary slightly by version:
# Try: understat_league_shots(league = league_code, season_start_year = season_start)
shots_raw <- understat_league_shots(
league = league_code,
season_start_year = season_start
)
# Inspect fields
dplyr::glimpse(shots_raw)
Typical columns include: match ID/date, home/away teams, shooter, minute, shot coordinates, and xG
per shot. We’ll tidy names and keep the essentials.
shots <- shots_raw |>
janitor::clean_names() |>
mutate(
match_date = as.Date(date),
is_goal = if_else(result == "Goal", 1, 0),
team = if_else(h_team == team, h_team, a_team) # normalize team label if needed
) |>
select(match_id, match_date, h_team, a_team, team, player, minute, xg, is_goal, x, y)
Team-level insights: xG for/against & efficiency
Aggregate by team to get season summaries: shots, goals, xG (“expected goals”), and simple conversion efficiency (goals − xG).
team_summary <- shots |>
group_by(team) |>
summarise(
matches = n_distinct(match_id),
shots_for = n(),
goals_for = sum(is_goal, na.rm = TRUE),
xg_for = sum(xg, na.rm = TRUE),
avg_shot_xg = mean(xg, na.rm = TRUE),
.groups = "drop"
)
# Compute xG against from the opposing shots
team_against <- shots |>
mutate(opponent = if_else(team == h_team, a_team, h_team)) |>
group_by(opponent) |>
summarise(
xg_against = sum(xg, na.rm = TRUE),
goals_against = sum(is_goal, na.rm = TRUE),
.groups = "drop"
) |>
rename(team = opponent)
league_table <- team_summary |>
left_join(team_against, by = "team") |>
mutate(
xg_diff = xg_for - xg_against,
goals_diff = goals_for - goals_against,
xg_efficiency = goals_for - xg_for # >0 = finishing above expectation
) |>
arrange(desc(xg_diff))
league_table |> head(10)
Tip: Sort by xg_diff
for a shot-quality view of team strength independent of finishing variance.
Shot maps in R (ggplot2/ggsoccer)
Visuals make insights stick. Let’s draw a basic shot map for a single team: goals in solid shapes, misses as hollow points, with size proportional to xG.
focus_team <- league_table$team[1] # top xG_diff team as an example
team_shots <- shots |> filter(team == focus_team)
ggplot() +
annotate_pitch(colour = "grey70", fill = "white") +
theme_pitch() +
geom_point(
data = team_shots,
aes(x = x, y = y, size = xg, shape = factor(is_goal)),
alpha = 0.7
) +
scale_shape_manual(values = c(`0` = 1, `1` = 16), name = "Goal") +
scale_size_continuous(name = "xG") +
labs(
title = paste("Shot Map —", focus_team),
subtitle = paste0(league_code, " ", season_start, "/", season_start + 1),
caption = "Data: Understat via worldfootballR • Plot: ggsoccer/ggplot2"
)
You can facet by outcome, half, or shot type, or flip to defensive maps by plotting the opponent’s shots.
Match-level xG and simple power ratings
Roll up to matches to compare performance per game. A smoothed xG difference offers a fast, interpretable power rating.
match_xg <- shots |>
group_by(match_id, h_team, a_team) |>
summarise(
h_xg = sum(if_else(team == h_team, xg, 0), na.rm = TRUE),
a_xg = sum(if_else(team == a_team, xg, 0), na.rm = TRUE),
.groups = "drop"
) |>
mutate(xg_diff = h_xg - a_xg)
# Team-level rolling xG differential (simple power rating proxy)
library(slider)
team_match_xg <- match_xg |>
pivot_longer(cols = c(h_xg, a_xg), names_to = "side", values_to = "team_xg") |>
mutate(team = if_else(side == "h_xg", h_team, a_team)) |>
select(match_id, team, team_xg) |>
left_join(
match_xg |>
pivot_longer(cols = c(h_xg, a_xg), names_to = "side2", values_to = "opp_xg") |>
mutate(team = if_else(side2 == "h_xg", a_team, h_team)) |>
select(match_id, team, opp_xg),
by = c("match_id", "team")
) |>
arrange(team, match_id) |>
group_by(team) |>
mutate(
xg_diff = team_xg - opp_xg,
xg_diff_roll3 = slide_dbl(xg_diff, mean, .before = 2, .complete = TRUE)
) |>
ungroup()
Plot xg_diff_roll3
over time to spot momentum or tactical shifts.
Player analytics: xG per 90 & shot quality
With shot-level data we can compute per-90 metrics and rank players by volume and efficiency.
# Assume minutes by player are available from a squad/appearances endpoint; if not,
# you can approximate using match sheets and sub events.
player_summary <- shots |>
group_by(player, team) |>
summarise(
shots = n(),
goals = sum(is_goal, na.rm = TRUE),
xg = sum(xg, na.rm = TRUE),
avg_shot_xg = mean(xg, na.rm = TRUE),
matches = n_distinct(match_id),
.groups = "drop"
)
# If you have minutes, compute per-90. Here we mock minutes = matches * 70 (placeholder).
player_summary <- player_summary |>
mutate(
minutes = matches * 70,
shots_p90 = 90 * shots / pmax(minutes, 1),
xg_p90 = 90 * xg / pmax(minutes, 1),
goals_minus_xg = goals - xg
) |>
arrange(desc(xg_p90))
player_summary |> head(15)
Tip: Split by position and minimum minutes to get fairer leaderboards.
Production tips & reproducibility
- Parameterize league and season at the top; knit as a report per competition.
- Cache raw pulls to disk (RDS/CSV) to avoid re-hitting sources on every run.
- Document data provenance and respect the source Terms of Use.
- Automate weekly refresh with
targets
or a simple cron/Rscript job. - Validate with sanity checks (e.g., xG totals vs. trusted dashboards).
FAQ
Which leagues can I fetch?
Understat covers major European leagues. With worldfootballR
you can also access FBref, Transfermarkt and others. Check the package reference for the latest coverage and endpoints. Do I need to build my own xG model?
No. Understat supplies shot-level xG you can use immediately. If you prefer to learn modeling, you can fit your own logistic model on shot features as an extension exercise. How do I adapt this to La Liga or Serie A?
Change league_code
and the season_start
year. Everything downstream stays the same.
Enjoyed this tutorial? Explore more R sports analytics books and practical guides at RProgrammingBooks.com.