A practical, code-first walkthrough for fetching rich football data, building models, and producing clear visuals entirely in R.
What You’ll Learn
- Pull player, team, and match data from FBref and Transfermarkt using
worldfootballR
. - Create football visuals (shot maps, pass maps, heatmaps) with ggsoccer and
ggplot2
. - Build a baseline expected goals (xG) model with logistic regression.
- Assemble player ratings with tidy feature engineering and scaling.
- Prototype match outcome prediction with
tidymodels
. - Ship reproducible projects via
renv
,targets
, and Quarto.
Everything below is copy-ready R. Adjust team/season names to your league of interest.
Setup
# install.packages(c("worldfootballR", "ggsoccer", "tidyverse", "tidymodels"))
library(worldfootballR)
library(ggplot2)
library(ggsoccer)
library(tidyverse)
library(tidymodels)
Get Data with worldfootballR
worldfootballR
provides friendly functions to pull leagues, teams, players, and match-level tables from FBref and Transfermarkt.
League & Team URLs
# Example: Premier League 2023–24 (adjust season_end_year and tier as needed)
league_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2024, tier = "1st")
team_urls <- fb_teams_urls("https://fbref.com/en/comps/9/Premier-League-Stats")
head(league_urls); head(team_urls)
Team & Player Season Stats
# Team season stats (shooting)
team_shoot <- fb_season_team_stats(country = "ENG", gender = "M",
season_end_year = 2024, tier = "1st",
stat_type = "shooting")
# Player season stats (passing)
player_pass <- fb_league_stats(country = "ENG", gender = "M",
season_end_year = 2024, tier = "1st",
stat_type = "passing", team_or_player = "player")
glimpse(team_shoot); glimpse(player_pass)
Match Results & Events
# Full season match results
epl_results <- fb_match_results(country = "ENG", gender = "M",
season_end_year = 2024, tier = "1st")
# Single-match events (shooting / chance creation)
match_url <- "https://fbref.com/en/matches/CHANGE-THIS-TO-A-REAL-ID"
# shots_one <- fb_match_shooting(match_url = match_url)
# summary_one <- fb_match_summary(match_url = match_url)
Transfer & Valuation (Transfermarkt)
# Example: player dictionary (FBref <--> Transfermarkt) for easier joins
map_tbl <- player_dictionary_mapping()
head(map_tbl)
Visuals with ggsoccer
ggsoccer
adds a soccer pitch as a ggplot2
layer so you can plot events cleanly.
Basic Pitch
ggplot() +
annotate_pitch() +
theme_pitch() +
ggtitle("Standard Pitch")
Shot Map (toy example)
# Fake shots; replace with real shot coordinates (often 0–100 scale)
shots <- tibble(
x = c(88,91,78,70,60,92,84,80),
y = c(52,54,40,63,58,47,36,60),
outcome = c("Goal","Goal","Miss","Miss","Miss","Goal","Miss","Miss")
)
ggplot(shots, aes(x = x, y = y)) +
annotate_pitch() +
geom_point(aes(shape = outcome), size = 3, alpha = .9) +
theme_pitch() +
coord_flip(xlim = c(49, 101)) + # show attacking half
scale_y_reverse() +
ggtitle("Shot Map with ggsoccer")
Build a Minimal Expected Goals (xG) Model
Start with a simple logistic regression using shot distance and angle (extend later with body part, situation, assisted, etc.).
# shots_df should have columns: x (0-100), y (0-100), goal (0/1)
# Helper features
distance_m <- function(x, y) {
x_m <- (100 - x) * 105/100 # approx pitch length in meters
y_m <- abs(50 - y) * 68/100 # approx pitch width in meters
sqrt(x_m^2 + y_m^2)
}
angle_deg <- function(x, y) {
x_m <- (100 - x) * 105/100
y_m <- (50 - y) * 68/100
ang <- atan((7.32 * x_m) / (x_m^2 + y_m^2 - (7.32/2)^2))
ifelse(ang < 0, ang + pi, ang) * 180/pi
}
# Example pipeline (replace shots_df with your real data)
# shots_df <- your_shots_data %>%
# mutate(goal = as.integer(outcome == "Goal"),
# distance = distance_m(x, y),
# angle = angle_deg(x, y)) %>%
# drop_na(distance, angle, goal)
# set.seed(1)
# split <- initial_split(shots_df, prop = 0.8, strata = goal)
# train <- training(split)
# test <- testing(split)
# recipe
# rec <- recipe(goal ~ distance + angle, data = train)
# model
# mod <- logistic_reg() %>% set_engine("glm")
# workflow & fit
# wf <- workflow() %>% add_model(mod) %>% add_recipe(rec)
# fit <- wf %>% fit(data = train)
# predict and assess
# preds <- predict(fit, test, type = "prob") %>% bind_cols(test)
# roc_auc <- yardstick::roc_auc(data = preds, truth = goal, .pred_1)
# roc_auc
Tip: add features (shot type, body part, situation), interactions, and calibration checks. Consider GBMs/trees when you move beyond baseline.
Player Ratings (Composite Indices)
Combine standardized per-90 features into role-aware indices. Keep it transparent and reproducible.
# Example: build a simple outfield index from FBref per-90 stats
# Assume player_per90 has columns for passes_completed, prog_carries, shots, tackles_int, minutes, position
scale01 <- function(x) (x - min(x, na.rm=TRUE)) / (max(x, na.rm=TRUE) - min(x, na.rm=TRUE))
ratings <- player_pass %>%
janitor::clean_names() %>%
mutate(across(c(passes_completed, carries, shots_total, tackles_interceptions),
~ replace_na(as.numeric(.), 0))) %>%
mutate(across(c(passes_completed, carries, shots_total, tackles_interceptions), scale01)) %>%
mutate(
build_up = 0.45 * passes_completed + 0.55 * carries,
defending = tackles_interceptions,
finishing = shots_total,
overall = 0.4 * build_up + 0.2 * defending + 0.4 * finishing
) %>%
arrange(desc(overall)) %>%
select(player, squad, position, overall, build_up, defending, finishing)
head(ratings)
Refine by position groups, strength-of-schedule, possession effects, and multi-season stabilization.
Team Style & Clustering
Summarize team styles with pace, possession, pressing, directness, and field tilt; cluster to identify tactical archetypes.
style_df <- team_shoot %>%
janitor::clean_names() %>%
select(squad, mp, sh, sot, so_ag, touches, possession) %>%
mutate(across(where(is.numeric), ~ replace_na(as.numeric(.), 0))) %>%
mutate(
shots_per_match = sh / pmax(mp, 1),
sot_rate = sot / pmax(sh, 1)
) %>%
select(squad, shots_per_match, sot_rate, possession) %>%
drop_na()
set.seed(7)
km <- kmeans(scale(style_df %>% select(-squad)), centers = 4, nstart = 50)
style_df$cluster <- factor(km$cluster)
style_df %>% arrange(cluster) %>% head()
Match Outcome Prediction (Sketch)
Engineer features like team strength (rolling/xG), rest days, home/away, travel distance, injuries (if available), and betting odds.
# Toy example: classify home win/draw/away
# matches_df with columns: result (factor: H/D/A), home_xg, away_xg, home_form, away_form, odds_home, odds_draw, odds_away
# set.seed(123)
# split <- initial_split(matches_df, prop = 0.8, strata = result)
# train <- training(split); test <- testing(split)
# rec <- recipe(result ~ home_xg + away_xg + home_form + away_form + odds_home + odds_draw + odds_away, data = train) %>%
# step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors())
# mod <- multinom_reg() %>% set_engine("nnet") %>% set_mode("classification")
# wf <- workflow() %>% add_recipe(rec) %>% add_model(mod)
# fit <- fit(wf, train)
# metrics(prediction <- predict(fit, test) %>% bind_cols(test), truth = result, estimate = .pred_class)
Reproducible Workflow
Environment
Freeze dependencies with renv::init()
; capture R and package versions for long-term stability.
Pipelines
Use targets
to orchestrate data pulls, cleaning, modeling, and report generation with proper caching.
Publishing
Create Quarto notebooks and dashboards; render HTML/PDF; share parameterized analyses per team/season.
FAQs
Do I need prior R experience?
No. Start by running the setup cell and copy the snippets section by section.
Which data sources are supported?
FBref for detailed team/player tables, Transfermarkt for transfers/valuations, plus optional Understat/Fotmob endpoints exposed in worldfootballR
.
Will the xG model generalize?
Baseline distance+angle works surprisingly well; add more features and evaluate with ROC AUC, log loss, and calibration. Validate across seasons/leagues.
How do I make nicer visuals?
Leverage ggsoccer
with ggplot2
scales, facets (by player/zone), and team colors. Combine pass networks, shot maps, and heatmaps.
© — Built with R, worldfootballR, ggsoccer, tidyverse, and tidymodels.
-
Basketball Analytics with R: From hoopR Data to Winning Insights
$7.99 -
Bayesian Sports Analytics with R: Predictive Modeling for Betting & Performance
$7.99 -
Football Analytics with R: NFL Data Science using nflfastR and nflverse
$7.99 -
Hockey Analytics with R – Mastering Performance Analysis in Sports
$7.99 -
Learning R: A Beginner’s Guide to Programming and Statistics
$7.99 -
Mastering Baseball Analytics with R – Data Science and Sabermetrics for Player and Team Performance
$7.99 -
Mastering Boxing Analytics with R – Data Science for Fight Performance and Strategy
$7.99 -
Mastering Football Data with worldfootballR: Complete Guide to Soccer Analytics in R
$7.99 -
Mastering Golf Analytics with R – Data Science for Golf Performance and Strategy
$7.99