This hands-on tutorial walks through acquiring sports data for analysis in R—either by downloading CSVs from public sources or by using R-based APIs and web scraping. You’ll learn how to load raw data, deal with common “messiness” (types, missing values, inconsistent formats), and produce a tidy dataset that’s ready for analysis and visualization.
What you’ll learn
- Where to find structured sports data (FBref, Sports-Reference, ESPN, and league sites).
- How to import CSVs and spreadsheets into R with reliable types.
- How to scrape public tables with
rvest
(ethically and robustly). - How to standardize team/player names, dates, and scores.
- How to handle missing values and outliers thoughtfully.
- How to validate, document, and save your cleaned dataset.
Prerequisites
Make sure you have R (≥ 4.2), RStudio (or Posit) and the following packages:
install.packages(c(
"tidyverse", # readr, dplyr, tidyr, ggplot2, stringr, lubridate
"janitor", # clean_names(), tabyl()
"rvest", # scraping HTML
"httr2", # optional: working with HTTP APIs
"jsonlite", # JSON parsing
"arrow", # optional: Parquet/Feather for fast IO
"vroom", # optional: fast CSV ingest
"readxl" # optional: Excel files
))
Data sources (overview)
- FBref (soccer): detailed tables for leagues, players, and matches.
- Sports-Reference (basketball-reference, baseball-reference, etc.): season, game logs, advanced stats.
- ESPN: schedules, box scores, standings; often JSON-backed endpoints.
- League/Team sites: often provide schedules and basic stats in CSV or JSON formats.
Note: Always review each site’s terms of use. Prefer official downloads or documented APIs. Cache raw files locally to avoid repeated requests.
Project structure
.
├── data_raw/ # unmodified downloads (CSV/JSON/HTML)
├── data_interim/ # lightly processed, standardized types
├── data_clean/ # fully cleaned, tidy tables
├── R/ # functions (R scripts)
└── notebooks/ # exploratory scripts / Rmd
Method 1 — Downloading CSVs and spreadsheets
If a site provides CSV/Excel downloads, prefer these—they’re stable and reproducible.
Example: Importing CSV with type hints
library(tidyverse)
library(lubridate)
library(janitor)
# Suppose you downloaded a match stats CSV to data_raw/
raw_path <- "data_raw/epl_matches_2023_24.csv"
# Define column types explicitly when possible (prevents accidental character types)
colspec <- cols(
match_id = col_character(),
date = col_date(format = ""),
home_team = col_character(),
away_team = col_character(),
home_goals = col_integer(),
away_goals = col_integer(),
attendance = col_double(),
venue = col_character()
)
matches_raw <- readr::read_csv(raw_path, col_types = colspec) |>
janitor::clean_names()
glimpse(matches_raw)
Handling common CSV quirks
- Thousand separators: use
locale(grouping_mark = ",")
inreadr
. - Non-UTF-8 encodings: use
locale(encoding = "latin1")
or"UTF-8"
. - Dates as text: parse with
lubridate::ymd()
,mdy()
, etc. - “NA” strings: identify with
na = c("", "NA", "N/A", "-")
inread_csv()
.
Method 2 — Scraping public tables with rvest
Use scraping only when data isn’t provided as a download or API. Be respectful: add delays, identify your user agent, and cache results.
Basic table scrape
library(rvest)
library(tidyverse)
library(janitor)
url <- "https://example.com/league/table" # replace with a real table URL
html <- read_html(url)
# Many stat sites use HTML tables
tbl <- html |>
html_element("table") |> # or a more specific CSS selector
html_table() |>
as_tibble() |>
janitor::clean_names()
# Example: rename ambiguous columns and coerce types
league_tbl <- tbl |>
rename(
team = 1, # first column is team name
matches = matches_played,
gd = goal_difference
) |>
mutate(
matches = readr::parse_integer(matches),
gd = readr::parse_integer(gd),
points = readr::parse_integer(points)
)
league_tbl |> slice(1:10)
Multiple pages, polite delays, caching
library(purrr)
library(fs)
dir_create("data_raw/cache")
fetch_table <- function(page_url, cache_name, sleep_sec = 2) {
cache_path <- file.path("data_raw/cache", cache_name)
if (file_exists(cache_path)) {
return(readr::read_rds(cache_path))
}
# Polite delay
Sys.sleep(sleep_sec)
message("Fetching: ", page_url)
out <- read_html(page_url) |>
html_element("table") |>
html_table() |>
as_tibble() |>
janitor::clean_names()
readr::write_rds(out, cache_path)
out
}
pages <- paste0("https://example.com/matches?page=", 1:5)
tables <- map2(pages, paste0("matches_", 1:5, ".rds"), fetch_table)
matches_raw <- bind_rows(tables)
Tip: If the site loads data via JavaScript, inspect network requests in your browser developer tools—there may be a JSON endpoint you can use instead of parsing HTML.
Method 3 — Working with JSON/undocumented APIs
Some sites expose JSON endpoints used by their frontends. When allowed by terms, you can request those directly. Handle headers and retries; store responses locally.
library(httr2)
library(jsonlite)
library(tidyverse)
endpoint <- "https://example.com/api/matches?season=2023"
req <- request(endpoint) |>
req_user_agent("your-email@example.com data research") |> # be transparent
req_headers("Accept" = "application/json")
resp <- req |> req_perform()
stop_for_status(resp)
json <- resp |> resp_body_string()
dat <- jsonlite::fromJSON(json, flatten = TRUE) |> as_tibble()
dat |> glimpse()
Cleaning messy sports data
Raw sports data is notoriously inconsistent: names, team codes, dates, scores, and missing values vary by source. Build a small toolbox of cleaning steps and apply them consistently.
1) Standardize column names
library(janitor)
cleaned <- matches_raw |> janitor::clean_names()
2) Convert scores and stats to numeric
cleaned <- cleaned |>
mutate(
home_goals = parse_integer(home_goals),
away_goals = parse_integer(away_goals),
attendance = parse_number(attendance)
)
3) Parse dates and times
cleaned <- cleaned |>
mutate(date = lubridate::ymd(date))
4) Handle missing values
# Count missing values
sapply(cleaned, function(x) sum(is.na(x)))
# Option 1: drop rows with NA in critical columns
cleaned <- cleaned |> filter(!is.na(home_goals), !is.na(away_goals))
# Option 2: impute or replace with defaults (context-specific)
cleaned <- cleaned |> mutate(attendance = replace_na(attendance, 0))
5) Resolve duplicates
cleaned <- cleaned |> distinct(match_id, .keep_all = TRUE)
6) Validate and save clean data
skimr::skim(cleaned) # overview stats
readr::write_csv(cleaned, "data_clean/epl_matches_2023_24.csv")
arrow::write_parquet(cleaned, "data_clean/epl_matches_2023_24.parquet")
Next steps
With your data standardized and cleaned, you can now move on to descriptive analytics (summaries, plots), advanced modeling, or dashboarding. Clean input is the foundation of all reliable sports analytics.