A practical, code-first guide to collecting soccer data, building tidy datasets, and producing shareable analysis in R using worldfootballR.
Want the full end-to-end project template (clean folders, reusable functions, and advanced modeling examples)?
đ Mastering Football Data with worldfootballR â practical pipelines with examples from FBref, Transfermarkt, and Understat.
What is worldfootballR?
worldfootballR is an open-source R package that helps you collect and analyze soccer data from popular sources like FBref, Transfermarkt, and Understat. Instead of manually copying tables or scraping pages by hand, you can use a consistent set of functions to pull structured data and immediately start analysis.
If youâre doing sports analytics in Râplayer scouting, team profiling, match modeling, xG-based analysisâthis package can save you hours.
Install & Setup
You can install from CRAN (stable) or GitHub (latest development version).
Install from CRAN
install.packages("worldfootballR")
Install the latest version from GitHub
install.packages("devtools")
devtools::install_github("JaseZiv/worldfootballR")
Load the package
library(worldfootballR)
Tip: If youâre starting fresh, install a small âanalytics toolkitâ too:
install.packages(c("dplyr","tidyr","purrr","stringr","readr","ggplot2","janitor"))
Data Sources: FBref vs Transfermarkt vs Understat
worldfootballR supports different sources. Hereâs how to think about them:
- FBref: rich tables for teams/players, season stats, match logs, advanced metrics.
- Transfermarkt: squads, player market values, transfer histories, staff details.
- Understat: expected goals (xG), shot-level data, shot locations, league and team data.
Most practical workflows combine sources: FBref for structured season stats, Understat for xG/shot detail, and Transfermarkt for market context.
A Reproducible Workflow (Recommended Project Structure)
The biggest difference between âa script that works onceâ and âanalysis you can trustâ is a reproducible workflow. Hereâs a simple structure you can use:
soccer-project/
data_raw/
data_clean/
R/
01_download.R
02_clean.R
03_visualize.R
outputs/
README.md
The key ideas: cache data, clean consistently, and separate download from analysis.
FBref: Team & Player Stats
Example: download Premier League team shooting stats for the 2024/25 season (season_end_year = 2025).
library(worldfootballR)
library(dplyr)
epl_shooting <- fb_season_team_stats(
country = "ENG",
gender = "M",
season_end_year = 2025,
tier = "1st",
stat_type = "shooting"
)
dplyr::glimpse(epl_shooting)
head(epl_shooting)
FBref outputs are usually already table-like, but youâll still want to standardize names and types before modeling or plotting. Weâll do that in the Clean & Tidy section.
Transfermarkt: Squads & Transfers
Example: get player URLs for a team and fetch transfer history.
library(worldfootballR)
team_players <- tm_team_player_urls(
"https://www.transfermarkt.com/fc-bayern-munchen/startseite/verein/27"
)
transfers <- tm_player_transfer_history(player_urls = team_players)
head(transfers)
Transfermarkt data is great for contextual features: player age, market value, transfer fees (where available), squad churn, and career moves that might affect performance.
Understat: xG, Shots, and Match Data
Example: load league shots for EPL (season_start_year = 2024).
library(worldfootballR)
epl_shots <- load_understat_league_shots(
league = "EPL",
season_start_year = 2024
)
dplyr::glimpse(epl_shots)
head(epl_shots)
Shot-level data enables stronger analysis: xG trends, shot maps, finishing vs expected, and features for match prediction.
Clean & Tidy Your Data
Even when data looks âready,â itâs worth standardizing: consistent names, numeric types, missing values, and team identifiers.
library(dplyr)
library(janitor)
epl_shooting_clean <- epl_shooting %>%
janitor::clean_names() %>%
mutate(across(where(is.character), ~ trimws(.)))
If you plan to join multiple datasets (FBref + Understat + Transfermarkt), decide early how you will match teams and players (names, IDs, URLs). Consistent keys prevent headaches later.
Visualize Insights (ggplot2 + optional ggsoccer)
Example: a simple team ranking plot (adjust column names to your table).
library(ggplot2)
library(dplyr)
# Replace 'shots_total' with a numeric column that exists in your table:
# names(epl_shooting_clean)
# ggplot(epl_shooting_clean, aes(x = reorder(squad, shots_total), y = shots_total)) +
# geom_col() +
# coord_flip() +
# labs(title = "Top Teams by Shots (EPL 2024/25)", x = "", y = "Shots")
If you use ggsoccer for pitch plots, keep it as an optional section. The primary keyword here is worldfootballR; treat ggsoccer as a bonus.
Mini Project: Build an EPL Team Snapshot
This mini workflow downloads FBref team shooting stats, cleans them, and creates a compact âteam snapshotâ table. Itâs a realistic deliverable for analysts.
library(worldfootballR)
library(dplyr)
library(janitor)
# 1) Download
epl_shooting <- fb_season_team_stats(
country = "ENG",
gender = "M",
season_end_year = 2025,
tier = "1st",
stat_type = "shooting"
)
# 2) Clean
df <- epl_shooting %>%
clean_names() %>%
mutate(across(where(is.character), ~ trimws(.)))
# 3) Snapshot (adapt columns to what you have)
# Try: names(df) to see available columns
snapshot <- df %>%
select(squad, matches("shots|xg|npxg|shots_on_target|goals")) %>%
head(10)
snapshot
From here you can extend into rolling trends, opponent-adjusted metrics, and match prediction pipelines.
Want a structured playbook + reusable code?
đ Get Mastering Football Data with worldfootballR â step-by-step projects, clean workflows, and reproducible analysis you can reuse.
Best Practices (Caching, Rate Limits, Reliability)
- Respect rate limits: add pauses between requests; avoid aggressive scraping.
- Cache locally: save raw pulls to disk so you donât re-download every time.
- Expect HTML changes: scraping-based tools can break when sites change layouts.
- Separate download vs analysis: it makes debugging and reproducibility easier.
- Document your versions: keep session info and package versions for long projects.
# Simple caching idea:
# saveRDS(epl_shooting, "data_raw/epl_shooting_2024_25.rds")
# epl_shooting <- readRDS("data_raw/epl_shooting_2024_25.rds")
Troubleshooting
1) âFunction not foundâ or errors after install
- Restart the R session, then run
library(worldfootballR)again. - Update packages:
update.packages(ask = FALSE). - Try the GitHub version if CRAN is behind.
2) Empty outputs / missing seasons
- Double-check
season_end_yearand league tier. - Try a known completed season first.
- Sources sometimes change table availability mid-season.
3) Rate limit / blocked requests
- Slow down requests and cache results.
- Avoid large loops without delays.
FAQ
Is worldfootballR on CRAN?
Yes. You can install worldfootballR from CRAN. For the newest features, use the GitHub version.
Whatâs the best data source for modeling?
FBref is great for structured season stats; Understat is best for xG and shot-level detail. Many projects combine both.
Can I use this for betting models?
You can use the data to build predictive models, but outcomes are uncertain and data sources can change. Focus on reproducible evaluation (backtesting), and respect each siteâs terms.
How do I match teams/players across sources?
Create consistent keys early (team names + season + league) and be careful with naming variations. When possible, use stable identifiers such as URLs.
Next Steps
If you want to go beyond âdata pullingâ into real analytics projects, do this next:
- Build a reproducible pipeline (raw â clean â outputs).
- Create 2â3 reusable functions (download, clean, plot).
- Add one simple baseline model and evaluate it properly.
- Then iterate: features, priors (Bayesian), and backtesting.
đ If you want a structured, step-by-step playbook: Mastering Football Data with worldfootballR.

