Square graphic with title 'Collecting & Cleaning Sports Data in R', showing colorful bar and line charts on a dark sports-themed background with the R programming logo.

How to Collect and Clean Sports Data in R (FBref, Sports-Reference, ESPN + rvest Tutorial)

This hands-on tutorial walks through acquiring sports data for analysis in R—either by downloading CSVs from public sources or by using R-based APIs and web scraping. You’ll learn how to load raw data, deal with common “messiness” (types, missing values, inconsistent formats), and produce a tidy dataset that’s ready for analysis and visualization.

What you’ll learn

  • Where to find structured sports data (FBref, Sports-Reference, ESPN, and league sites).
  • How to import CSVs and spreadsheets into R with reliable types.
  • How to scrape public tables with rvest (ethically and robustly).
  • How to standardize team/player names, dates, and scores.
  • How to handle missing values and outliers thoughtfully.
  • How to validate, document, and save your cleaned dataset.

Prerequisites

Make sure you have R (≥ 4.2), RStudio (or Posit) and the following packages:

install.packages(c(
  "tidyverse",   # readr, dplyr, tidyr, ggplot2, stringr, lubridate
  "janitor",     # clean_names(), tabyl()
  "rvest",       # scraping HTML
  "httr2",       # optional: working with HTTP APIs
  "jsonlite",    # JSON parsing
  "arrow",       # optional: Parquet/Feather for fast IO
  "vroom",       # optional: fast CSV ingest
  "readxl"       # optional: Excel files
))

Data sources (overview)

  • FBref (soccer): detailed tables for leagues, players, and matches.
  • Sports-Reference (basketball-reference, baseball-reference, etc.): season, game logs, advanced stats.
  • ESPN: schedules, box scores, standings; often JSON-backed endpoints.
  • League/Team sites: often provide schedules and basic stats in CSV or JSON formats.

Note: Always review each site’s terms of use. Prefer official downloads or documented APIs. Cache raw files locally to avoid repeated requests.

Project structure

.
├── data_raw/        # unmodified downloads (CSV/JSON/HTML)
├── data_interim/    # lightly processed, standardized types
├── data_clean/      # fully cleaned, tidy tables
├── R/               # functions (R scripts)
└── notebooks/       # exploratory scripts / Rmd

Method 1 — Downloading CSVs and spreadsheets

If a site provides CSV/Excel downloads, prefer these—they’re stable and reproducible.

Example: Importing CSV with type hints

library(tidyverse)
library(lubridate)
library(janitor)

# Suppose you downloaded a match stats CSV to data_raw/
raw_path <- "data_raw/epl_matches_2023_24.csv"

# Define column types explicitly when possible (prevents accidental character types)
colspec <- cols(
  match_id = col_character(),
  date = col_date(format = ""),
  home_team = col_character(),
  away_team = col_character(),
  home_goals = col_integer(),
  away_goals = col_integer(),
  attendance = col_double(),
  venue = col_character()
)

matches_raw <- readr::read_csv(raw_path, col_types = colspec) |>
  janitor::clean_names()

glimpse(matches_raw)

Handling common CSV quirks

  • Thousand separators: use locale(grouping_mark = ",") in readr.
  • Non-UTF-8 encodings: use locale(encoding = "latin1") or "UTF-8".
  • Dates as text: parse with lubridate::ymd(), mdy(), etc.
  • “NA” strings: identify with na = c("", "NA", "N/A", "-") in read_csv().

Method 2 — Scraping public tables with rvest

Use scraping only when data isn’t provided as a download or API. Be respectful: add delays, identify your user agent, and cache results.

Basic table scrape

library(rvest)
library(tidyverse)
library(janitor)

url <- "https://example.com/league/table"   # replace with a real table URL
html <- read_html(url)

# Many stat sites use HTML tables
tbl <- html |>
  html_element("table") |>     # or a more specific CSS selector
  html_table() |>
  as_tibble() |>
  janitor::clean_names()

# Example: rename ambiguous columns and coerce types
league_tbl <- tbl |>
  rename(
    team = 1,  # first column is team name
    matches = matches_played,
    gd = goal_difference
  ) |>
  mutate(
    matches = readr::parse_integer(matches),
    gd = readr::parse_integer(gd),
    points = readr::parse_integer(points)
  )

league_tbl |> slice(1:10)

Multiple pages, polite delays, caching

library(purrr)
library(fs)

dir_create("data_raw/cache")

fetch_table <- function(page_url, cache_name, sleep_sec = 2) {
  cache_path <- file.path("data_raw/cache", cache_name)
  if (file_exists(cache_path)) {
    return(readr::read_rds(cache_path))
  }
  # Polite delay
  Sys.sleep(sleep_sec)
  message("Fetching: ", page_url)
  out <- read_html(page_url) |>
    html_element("table") |>
    html_table() |>
    as_tibble() |>
    janitor::clean_names()
  readr::write_rds(out, cache_path)
  out
}

pages <- paste0("https://example.com/matches?page=", 1:5)
tables <- map2(pages, paste0("matches_", 1:5, ".rds"), fetch_table)
matches_raw <- bind_rows(tables)

Tip: If the site loads data via JavaScript, inspect network requests in your browser developer tools—there may be a JSON endpoint you can use instead of parsing HTML.

Method 3 — Working with JSON/undocumented APIs

Some sites expose JSON endpoints used by their frontends. When allowed by terms, you can request those directly. Handle headers and retries; store responses locally.

library(httr2)
library(jsonlite)
library(tidyverse)

endpoint <- "https://example.com/api/matches?season=2023"

req <- request(endpoint) |>
  req_user_agent("your-email@example.com data research") |>  # be transparent
  req_headers("Accept" = "application/json")

resp <- req |> req_perform()

stop_for_status(resp)
json <- resp |> resp_body_string()
dat <- jsonlite::fromJSON(json, flatten = TRUE) |> as_tibble()

dat |> glimpse()

Cleaning messy sports data

Raw sports data is notoriously inconsistent: names, team codes, dates, scores, and missing values vary by source. Build a small toolbox of cleaning steps and apply them consistently.

1) Standardize column names

library(janitor)
cleaned <- matches_raw |> janitor::clean_names()

2) Convert scores and stats to numeric

cleaned <- cleaned |>
  mutate(
    home_goals = parse_integer(home_goals),
    away_goals = parse_integer(away_goals),
    attendance = parse_number(attendance)
  )

3) Parse dates and times

cleaned <- cleaned |>
  mutate(date = lubridate::ymd(date))

4) Handle missing values

# Count missing values
sapply(cleaned, function(x) sum(is.na(x)))

# Option 1: drop rows with NA in critical columns
cleaned <- cleaned |> filter(!is.na(home_goals), !is.na(away_goals))

# Option 2: impute or replace with defaults (context-specific)
cleaned <- cleaned |> mutate(attendance = replace_na(attendance, 0))

5) Resolve duplicates

cleaned <- cleaned |> distinct(match_id, .keep_all = TRUE)

6) Validate and save clean data

skimr::skim(cleaned)     # overview stats
readr::write_csv(cleaned, "data_clean/epl_matches_2023_24.csv")
arrow::write_parquet(cleaned, "data_clean/epl_matches_2023_24.parquet")

Next steps

With your data standardized and cleaned, you can now move on to descriptive analytics (summaries, plots), advanced modeling, or dashboarding. Clean input is the foundation of all reliable sports analytics.

Leave a Comment

Your email address will not be published. Required fields are marked *