This post accompanies and expands on the book listed here: Basketball Analytics with R β Product Page
Table of Contents
1) Setup & Data Ingestion
We will use hoopR (part of the SportsDataverse) to access ESPN-based schedules, box scores, and play-by-play (pbp) data, along with tidyverse for wrangling and ggplot2 for visuals.
# install.packages(c("tidyverse", "lubridate", "ggplot2", "ggforce", "gt"))
# install.packages("hoopR") # or: remotes::install_github("sportsdataverse/hoopR")
library(tidyverse)
library(lubridate)
library(ggplot2)
library(ggforce)
library(gt)
2) Cleaning and Tidy Structures
Play-by-play streams often require standardization: consistent team labels, seconds remaining, possession inference, and categorical event tags.
clean_pbp <- function(df){
df %>%
mutate(
sec_left = ifelse(grepl(":", clock_display_value),
as.integer(sub(":.*","",clock_display_value))*60 +
as.integer(sub(".*:","",clock_display_value)),
NA_integer_),
is_shot = event_type %in% c("made2","made3","miss2","miss3"),
is_make = event_type %in% c("made2","made3"),
is_three = event_type %in% c("made3","miss3"),
value = dplyr::case_when(
event_type == "made3" ~ 3L,
event_type == "made2" ~ 2L,
TRUE ~ 0L
)
)
}
3) Lineups and On/Off Impact
To study lineup effectiveness, aggregate by stint (continuous period with the same five players). Then compute net rating as points for minus points against per 100 possessions.
# Example outline
# stints <- build_stints_from_subs(pbp_clean)
# ratings <- stints %>%
# group_by(team_id, lineup_id) %>%
# summarise(pts_for = sum(value_for),
# pts_against = sum(value_against),
# poss = sum(possessions)) %>%
# mutate(net_rating = 100 * (pts_for - pts_against) / poss)
4) Shot Charts and Efficiency Maps
With xy-coordinates relative to the basket, you can render shot charts and hex-efficiency maps in ggplot2.
draw_half_court <- function(){
list(
geom_rect(aes(xmin=-25, xmax=25, ymin=0, ymax=47), fill=NA, color="#2dd4bf"),
geom_circle(aes(x0=0, y0=5.25, r=0.75), color="#2dd4bf"),
geom_segment(aes(x=-3, xend=3, y=4, yend=4), color="#2dd4bf")
)
}
ggplot(pbp_shots, aes(x = x, y = y)) +
draw_half_court() +
geom_point(aes(color = factor(shot_made)))
5) Pace, Four Factors, and Possessions
four_factors <- function(team_box){
team_box %>%
mutate(
poss = 0.5 * ((FGA + 0.44 * FTA + TOV - ORB) +
(OppFGA + 0.44 * OppFTA + OppTOV - OppORB)),
pace = 40 * poss / MinutesPlayed,
eFG = (FGM + 0.5 * `3PM`) / FGA,
TOVR = TOV / poss,
ORR = ORB / (ORB + OppDRB),
FTR = FTA / FGA
)
}
6) Win Probability and In-Game Leverage
wp_fit <- glm(win ~ margin + possession + poly(sec_left, 2),
data = states, family = binomial())
states <- states %>%
mutate(wp = predict(wp_fit, newdata = states, type = "response"))
7) Scouting Reports with Reproducible R Markdown
# Example YAML header (Quarto)
# ---
# title: "Opponent Scouting Report"
# params:
# team: "Example University"
# date_from: "2025-01-01"
# date_to: "2025-02-01"
# format: pdf
# ---
8) Where to Go Next
If you enjoyed this tutorial, check the full resource here:
Basketball Analytics with R: From hoopR Data to Winning Insights
