Square digital illustration of a soccer field overlaid with analytics visuals, including a bar chart, line graph, pie chart labeled 58%, and passing arrows, with the title “Soccer Analytics in R” at the top

How to Transform Football Data with R: ggsoccer & worldfootballR Guide

A practical, code-first walkthrough for fetching rich football data, building models, and producing clear visuals entirely in R.

What You’ll Learn

  • Pull player, team, and match data from FBref and Transfermarkt using worldfootballR.
  • Create football visuals (shot maps, pass maps, heatmaps) with ggsoccer and ggplot2.
  • Build a baseline expected goals (xG) model with logistic regression.
  • Assemble player ratings with tidy feature engineering and scaling.
  • Prototype match outcome prediction with tidymodels.
  • Ship reproducible projects via renv, targets, and Quarto.

Everything below is copy-ready R. Adjust team/season names to your league of interest.

Setup

# install.packages(c("worldfootballR", "ggsoccer", "tidyverse", "tidymodels"))
library(worldfootballR)
library(ggplot2)
library(ggsoccer)
library(tidyverse)
library(tidymodels)

Get Data with worldfootballR

worldfootballR provides friendly functions to pull leagues, teams, players, and match-level tables from FBref and Transfermarkt.

League & Team URLs

# Example: Premier League 2023–24 (adjust season_end_year and tier as needed)
league_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2024, tier = "1st")
team_urls   <- fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats")
head(league_urls); head(team_urls)

Team & Player Season Stats

# Team season stats (shooting)
team_shoot <- fb_season_team_stats(country = "ENG", gender = "M",
                                   season_end_year = 2024, tier = "1st",
                                   stat_type = "shooting")

# Player season stats (passing)
player_pass <- fb_league_stats(country = "ENG", gender = "M",
                               season_end_year = 2024, tier = "1st",
                               stat_type = "passing", team_or_player = "player")

glimpse(team_shoot); glimpse(player_pass)

Match Results & Events

# Full season match results
epl_results <- fb_match_results(country = "ENG", gender = "M",
                                season_end_year = 2024, tier = "1st")

# Single-match events (shooting / chance creation)
match_url   <- "https://fbref.com/en/matches/CHANGE-THIS-TO-A-REAL-ID"
# shots_one <- fb_match_shooting(match_url = match_url)
# summary_one <- fb_match_summary(match_url = match_url)

Transfer & Valuation (Transfermarkt)

# Example: player dictionary (FBref <--> Transfermarkt) for easier joins
map_tbl <- player_dictionary_mapping()
head(map_tbl)

Visuals with ggsoccer

ggsoccer adds a soccer pitch as a ggplot2 layer so you can plot events cleanly.

Basic Pitch

ggplot() +
  annotate_pitch() +
  theme_pitch() +
  ggtitle("Standard Pitch")

Shot Map (toy example)

# Fake shots; replace with real shot coordinates (often 0–100 scale)
shots <- tibble(
  x = c(88,91,78,70,60,92,84,80),
  y = c(52,54,40,63,58,47,36,60),
  outcome = c("Goal","Goal","Miss","Miss","Miss","Goal","Miss","Miss")
)

ggplot(shots, aes(x = x, y = y)) +
  annotate_pitch() +
  geom_point(aes(shape = outcome), size = 3, alpha = .9) +
  theme_pitch() +
  coord_flip(xlim = c(49, 101)) +  # show attacking half
  scale_y_reverse() +
  ggtitle("Shot Map with ggsoccer")

Build a Minimal Expected Goals (xG) Model

Start with a simple logistic regression using shot distance and angle (extend later with body part, situation, assisted, etc.).

# shots_df should have columns: x (0-100), y (0-100), goal (0/1)
# Helper features
distance_m <- function(x, y) {
  x_m <- (100 - x) * 105/100      # approx pitch length in meters
  y_m <- abs(50 - y) * 68/100     # approx pitch width in meters
  sqrt(x_m^2 + y_m^2)
}
angle_deg <- function(x, y) {
  x_m <- (100 - x) * 105/100
  y_m <- (50 - y) * 68/100
  ang <- atan((7.32 * x_m) / (x_m^2 + y_m^2 - (7.32/2)^2))
  ifelse(ang < 0, ang + pi, ang) * 180/pi
}

# Example pipeline (replace shots_df with your real data)
# shots_df <- your_shots_data %>%
#   mutate(goal = as.integer(outcome == "Goal"),
#          distance = distance_m(x, y),
#          angle = angle_deg(x, y)) %>%
#   drop_na(distance, angle, goal)

# set.seed(1)
# split  <- initial_split(shots_df, prop = 0.8, strata = goal)
# train  <- training(split)
# test   <- testing(split)

# recipe
# rec <- recipe(goal ~ distance + angle, data = train)

# model
# mod <- logistic_reg() %>% set_engine("glm")

# workflow & fit
# wf  <- workflow() %>% add_model(mod) %>% add_recipe(rec)
# fit <- wf %>% fit(data = train)

# predict and assess
# preds <- predict(fit, test, type = "prob") %>% bind_cols(test)
# roc_auc <- yardstick::roc_auc(data = preds, truth = goal, .pred_1)
# roc_auc

Tip: add features (shot type, body part, situation), interactions, and calibration checks. Consider GBMs/trees when you move beyond baseline.

Player Ratings (Composite Indices)

Combine standardized per-90 features into role-aware indices. Keep it transparent and reproducible.

# Example: build a simple outfield index from FBref per-90 stats
# Assume player_per90 has columns for passes_completed, prog_carries, shots, tackles_int, minutes, position

scale01 <- function(x) (x - min(x, na.rm=TRUE)) / (max(x, na.rm=TRUE) - min(x, na.rm=TRUE))

ratings <- player_pass %>%
  janitor::clean_names() %>%
  mutate(across(c(passes_completed, carries, shots_total, tackles_interceptions),
                ~ replace_na(as.numeric(.), 0))) %>%
  mutate(across(c(passes_completed, carries, shots_total, tackles_interceptions), scale01)) %>%
  mutate(
    build_up   = 0.45 * passes_completed + 0.55 * carries,
    defending  = tackles_interceptions,
    finishing  = shots_total,
    overall    = 0.4 * build_up + 0.2 * defending + 0.4 * finishing
  ) %>%
  arrange(desc(overall)) %>%
  select(player, squad, position, overall, build_up, defending, finishing)

head(ratings)

Refine by position groups, strength-of-schedule, possession effects, and multi-season stabilization.

Team Style & Clustering

Summarize team styles with pace, possession, pressing, directness, and field tilt; cluster to identify tactical archetypes.

style_df <- team_shoot %>%
  janitor::clean_names() %>%
  select(squad, mp, sh, sot, so_ag, touches, possession) %>%
  mutate(across(where(is.numeric), ~ replace_na(as.numeric(.), 0))) %>%
  mutate(
    shots_per_match = sh / pmax(mp, 1),
    sot_rate        = sot / pmax(sh, 1)
  ) %>%
  select(squad, shots_per_match, sot_rate, possession) %>%
  drop_na()

set.seed(7)
km <- kmeans(scale(style_df %>% select(-squad)), centers = 4, nstart = 50)
style_df$cluster <- factor(km$cluster)
style_df %>% arrange(cluster) %>% head()

Match Outcome Prediction (Sketch)

Engineer features like team strength (rolling/xG), rest days, home/away, travel distance, injuries (if available), and betting odds.

# Toy example: classify home win/draw/away
# matches_df with columns: result (factor: H/D/A), home_xg, away_xg, home_form, away_form, odds_home, odds_draw, odds_away

# set.seed(123)
# split <- initial_split(matches_df, prop = 0.8, strata = result)
# train <- training(split); test <- testing(split)

# rec <- recipe(result ~ home_xg + away_xg + home_form + away_form + odds_home + odds_draw + odds_away, data = train) %>%
#        step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors())

# mod <- multinom_reg() %>% set_engine("nnet") %>% set_mode("classification")
# wf  <- workflow() %>% add_recipe(rec) %>% add_model(mod)
# fit <- fit(wf, train)

# metrics(prediction <- predict(fit, test) %>% bind_cols(test), truth = result, estimate = .pred_class)

Reproducible Workflow

Environment

Freeze dependencies with renv::init(); capture R and package versions for long-term stability.

Pipelines

Use targets to orchestrate data pulls, cleaning, modeling, and report generation with proper caching.

Publishing

Create Quarto notebooks and dashboards; render HTML/PDF; share parameterized analyses per team/season.


FAQs

Do I need prior R experience?

No. Start by running the setup cell and copy the snippets section by section.

Which data sources are supported?

FBref for detailed team/player tables, Transfermarkt for transfers/valuations, plus optional Understat/Fotmob endpoints exposed in worldfootballR.

Will the xG model generalize?

Baseline distance+angle works surprisingly well; add more features and evaluate with ROC AUC, log loss, and calibration. Validate across seasons/leagues.

How do I make nicer visuals?

Leverage ggsoccer with ggplot2 scales, facets (by player/zone), and team colors. Combine pass networks, shot maps, and heatmaps.

© — Built with R, worldfootballR, ggsoccer, tidyverse, and tidymodels.

Leave a Comment

Your email address will not be published. Required fields are marked *