Football Analytics with R NFL Data Science using nflfastR and the nflverse

Football Analytics with R: NFL Data Science using nflfastR and the nflverse

A hands-on, reproducible guide to loading NFL play-by-play, computing EPA/success, win probability, and fourth-down decisions — all with tidy R workflows.

  1. Why this guide
  2. Setup: packages & data
  3. Load play-by-play (multiple seasons)
  4. Team analytics: EPA/play & success rate
  5. QB analytics: EPA/play vs CPOE
  6. Game win-probability charts
  7. Fourth-down decisions (go/punt/FG) — a simple EV model
  8. Production tips & reproducibility
  9. FAQ

Why this guide

NFL play-by-play (PBP) data enables powerful, interpretable metrics: Expected Points Added (EPA), success rate, win probability (WP), and fourth-down decision models. This tutorial shows a clean path to:

  • Load several seasons of PBP into a tidy tibble.
  • Compute team- and player-level efficiency (EPA/play, success%).
  • Plot win-probability timelines for any game.
  • Build a lightweight EV framework for fourth-down “go vs punt vs FG”.

Setup: packages & data

We’ll use the nflverse ecosystem. Recent workflows favor nflreadr for loading PBP, while historical code may use nflfastR. We’ll show both where useful.

# Install once (uncomment if needed)
# install.packages(c("tidyverse","nflreadr","nflplotR","lubridate","gt","ggrepel"))
# install.packages("nflfastR")  # optional; many helpers live here historically

library(tidyverse)
library(nflreadr)  # preferred loader for PBP, rosters, schedules
library(nflplotR)  # handy NFL logos/colors ggplot helpers
library(lubridate)
library(ggrepel)
# library(nflfastR) # optional; keep if you need legacy helpers

Tip: If you hit a column name mismatch across seasons, standardize with janitor::clean_names().

Load play-by-play (multiple seasons)

Pull two recent regular seasons as an example. Adjust the vector for your analysis window.

seasons <- 2023:2024  # adjust as needed
pbp <- nflreadr::load_pbp(seasons)

# quick sanity: core fields we will use
pbp %>%
  select(season, week, game_id, posteam, defteam, play_type, epa, success, wp, yardline_100,
         down, ydstogo, qtr, time_remaining = half_seconds_remaining) %>%
  glimpse()

PBP rows include EPA and success flags precomputed for most plays, plus WP on many snapshots. We’ll filter out non-plays and penalties for clean rate stats.

plays <- pbp %>%
  filter(play_type %in% c("run","pass")) %>%
  filter(!is.na(epa))  # drop rows without EPA

Team analytics: EPA/play & success rate

Compute season-level offensive efficiency (EPA/play) and success rate by team, plus pass/run splits.

team_off <- plays %>%
  group_by(season, posteam) %>%
  summarise(
    n_plays = n(),
    epa_per_play = mean(epa, na.rm = TRUE),
    success_rate = mean(success, na.rm = TRUE),
    pass_rate = mean(play_type == "pass"),
    .groups = "drop"
  ) %>%
  arrange(desc(epa_per_play))

head(team_off)

Visualize league-wide EPA/play by offense. We add logos via nflplotR::geom_nfl_logos().

library(ggplot2)
latest <- max(team_off$season)

p <- team_off %>%
  filter(season == latest) %>%
  ggplot(aes(x = reorder(posteam, epa_per_play), y = epa_per_play)) +
  geom_col(fill = "grey85") +
  nflplotR::geom_nfl_logos(aes(team_abbr = posteam), width = 0.06) +
  coord_flip() +
  labs(title = glue::glue("Offensive EPA/play — {latest}"),
       x = NULL, y = "EPA per play") +
  theme_minimal(base_size = 12)

p

QB analytics: EPA/play vs CPOE

A common scatter shows passing EPA/play vs. CPOE (Completion Percentage Over Expectation). Many PBP tables include cpoe on pass plays. We’ll aggregate by passer.

qb_pass <- pbp %>%
  filter(pass == 1, !is.na(passer_player_name)) %>%
  group_by(season, passer_player_name, posteam) %>%
  summarise(
    dropbacks = n(),
    epa_per_db = mean(epa, na.rm = TRUE),
    cpoe = mean(cpoe, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(dropbacks >= 200)  # sample size floor

qb_year <- qb_pass %>% filter(season == latest)

ggplot(qb_year, aes(cpoe, epa_per_db, label = passer_player_name)) +
  geom_hline(yintercept = 0, linetype = 2, color = "grey70") +
  geom_vline(xintercept = 0, linetype = 2, color = "grey70") +
  geom_point(size = 2.5) +
  ggrepel::geom_text_repel(size = 3, max.overlaps = 20) +
  labs(
    title = glue::glue("QB efficiency — {latest}"),
    x = "CPOE (avg)", y = "EPA per dropback"
  ) +
  theme_minimal(base_size = 12)

Quadrants: top-right = accurate & efficient; bottom-left = concerns in both accuracy and value.

Game win-probability charts

Plot WP over game time for a selected matchup. Columns vary by season; if home_wp exists, prefer it. Otherwise fall back to wp and label carefully.

# pick a game id (e.g., first game of the latest season)
gid <- pbp %>% filter(season == latest) %>% pull(game_id) %>% .[1]

g <- pbp %>% filter(game_id == gid)

# choose a WP column
wp_col <- if ("home_wp" %in% names(g)) "home_wp" else "wp"

g %>%
  mutate(elapsed = 3600 - game_seconds_remaining,
         wp_use = .data[[wp_col]]) %>%
  ggplot(aes(elapsed/60, wp_use)) +
  geom_line(size = 1) +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = paste0("Win probability — game ", gid),
    x = "Minutes elapsed", y = "Win probability"
  ) +
  theme_minimal(base_size = 12)

Enrich with scoring annotations (filter scoring plays and use geom_label()).

Fourth-down decisions (go/punt/FG) — a simple EV model

Build a first-pass expected value (EV) comparison among go for it, field goal, and punt. We’ll estimate:

  1. Go EV = P(convert) × EP(after first down) + (1 − P) × EP(opponent after turnover on downs)
  2. FG EV = P(FG good) × 3 + (1 − P) × EP(opponent after miss)
  3. Punt EV = EP(opponent after net punt)

We’ll fit quick empirical models for P(convert) and P(FG good) by distance/yardline.

# 1) Estimate 4th-down conversion probability (historical)
fourth_hist <- pbp %>%
  filter(down == 4, ydstogo > 0, play_type %in% c("run","pass")) %>%
  mutate(conv = if_else(first_down == 1, 1, 0)) %>%
  select(yardline_100, ydstogo, conv)

fit_go <- glm(conv ~ yardline_100 + ydstogo, data = fourth_hist,
              family = binomial())

# 2) Estimate field goal make probability by distance
fg_hist <- pbp %>%
  filter(play_type == "field_goal", !is.na(kick_distance)) %>%
  mutate(make = if_else(field_goal_result == "made", 1, 0)) %>%
  select(kick_distance, make)

fit_fg <- glm(make ~ splines::ns(kick_distance, df = 3),
              data = fg_hist, family = binomial())

# Helper: expected points (EP) by yardline and offense/defense
ep_table <- pbp %>%
  filter(!is.na(epa), !is.na(yardline_100), down %in% 1:4) %>%
  group_by(yardline_100, posteam) %>%
  summarise(ep_mean = mean(ep, na.rm = TRUE), .groups = "drop") %>%
  group_by(yardline_100) %>% summarise(ep_offense = mean(ep_mean), .groups = "drop")

# Quick function to get EP after change of possession (approximate)
get_ep_defense <- function(yardline_100_after_change) {
  # offense flips; defense's yardline ~ 100 - yardline
  y <- pmax(pmin(100 - yardline_100_after_change, 100), 1)
  approx(ep_table$yardline_100, ep_table$ep_offense, xout = y, rule = 2)$y
}

# Evaluate a specific 4th down situation:
situation <- tibble(yardline_100 = 45, ydstogo = 2)  # e.g., opp 45, 4th & 2

p_go <- predict(fit_go, situation, type = "response")
p_fg <- predict(fit_fg, tibble(kick_distance = 63 - situation$yardline_100), type = "response")  # rough: LOS+~18

# EP after 1st down (advance ~ ydstogo + 1 as a crude proxy)
ep_after_first <- approx(ep_table$yardline_100, ep_table$ep_offense,
                         xout = pmax(situation$yardline_100 - (situation$ydstogo + 1), 1), rule = 2)$y
# EP to opponent after turnover on downs (ball near current LOS)
ep_after_fail_def <- get_ep_defense(situation$yardline_100)

# EP after FG miss (touchback-ish proxy)
ep_after_fg_miss_def <- get_ep_defense(pmin(situation$yardline_100 + 8, 80))

# EP after punt (net ~40 yards proxy)
ep_after_punt_def <- get_ep_defense(pmin(situation$yardline_100 + 40, 90))

ev_go  <- p_go * ep_after_first + (1 - p_go) * ep_after_fail_def
ev_fg  <- p_fg * 3 + (1 - p_fg) * ep_after_fg_miss_def
ev_pnt <- ep_after_punt_def

tibble(
  choice = c("Go for it","Field Goal","Punt"),
  EV = c(ev_go, ev_fg, ev_pnt)
) %>%
  arrange(desc(EV))

This is intentionally simple (and fast). Improve by conditioning EP on down/distance, using more features for conversion/make models, and accounting for time/score leverage.

Production tips & reproducibility

  • Cache raw pulls (arrow parquet or RDS) by season to speed iteration.
  • Parameterize season(s) and team; render multiple reports via quarto or targets.
  • Sanity checks: compare team EPA/play with trusted dashboards each week.
  • Document data provenance and version your code (Git + a public repo for credibility).

FAQ

Do I need both nflreadr and nflfastR?

nflreadr is the modern loader for PBP, schedules, and rosters. You can keep nflfastR if you rely on legacy helpers — both work together. What seasons are available?

Most recent seasons have full PBP coverage with EPA/WP. Earlier seasons may have differences; test small pulls first. How do I adapt this to college football?

Look for college-oriented loaders within the nflverse ecosystem or compatible packages; the workflow is similar once PBP is tidy.


Want more? Find practical R e-books and notebooks for sports analytics at RProgrammingBooks.com.

Leave a Comment

Your email address will not be published. Required fields are marked *