How do I avoid overfitting a football betting model in R?

Use time-based validation, keep features simple, evaluate with calibration and log loss, and avoid tuning thresholds on the same data used for evaluation.

What is the simplest value-bet rule?

Bet only when model probability exceeds implied probability (with a safety margin, e.g. EV > 2%), and stake conservatively using flat staking or fractional Kelly.

Football Betting Model in R (Step-by-Step Guide 2026)

Q: Is a Poisson model good enough for football betting?

It’s a strong baseline that captures team strength and home advantage with minimal complexity. Many upgrades like xG, time decay, and Bayesian partial pooling can improve robustness, but the baseline can already be useful when validated properly.

Related (on this site):

Setup

This is a fully reproducible, code-heavy walkthrough for building a football betting model in R: data → features → model → probabilities → value bets → bankroll rules → backtest. If you’re new to worldfootballR, start here: Install & Use worldfootballR and keep the worldfootballR Guide open as reference.

# Core packages
install.packages(c(
  "dplyr","tidyr","purrr","stringr","lubridate",
  "readr","ggplot2","tibble","janitor","glue"
))

# Modeling + evaluation
install.packages(c("broom","rsample","yardstick","scoringRules","pROC"))

# Optional (for speed / nicer tables)
install.packages(c("data.table","DT"))

# Football data
# worldfootballR is usually installed from GitHub
# See: https://rprogrammingbooks.com/install-use-worldfootballr/
# If needed:
# install.packages("remotes")
# remotes::install_github("JaseZiv/worldfootballR")

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(lubridate)
library(readr)
library(ggplot2)
library(janitor)
library(glue)

# worldfootballR (uncomment after install)
# library(worldfootballR)

set.seed(2026)

Modeling note: The baseline approach below is the classic Poisson goals model (team attack/defense + home advantage). It’s simple, explainable, and a great foundation. You can extend it later to xG models, Bayesian hierarchical models, or time-varying strength. If you like Bayesian approaches, see: Bayesian Sports Analytics (Book/Product) .

Get match data

The easiest path is to pull historical match results from public sources. With worldfootballR, you can often scrape league seasons from sources like FBref. The specific function names can vary by package version/source, so below you’ll see:

Option A: Use worldfootballR directly (recommended).
Option B: Use your own CSV export if you already have data.

Option A — Pull league season data with worldfootballR

If you need help with installation and authentication quirks, see: Install & Use worldfootballR.

# --- Option A (worldfootballR) ---
# The exact workflow depends on the data source (FBref / other).
# Typical pattern:
# 1) Get competition URLs
# 2) Pull match results for seasons
#
# PSEUDOCODE (adjust per your worldfootballR version):
#
# comp_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2026)
# epl_url <- comp_urls %>% filter(str_detect(Competition_Name, "Premier League")) %>% pull(Seasons_Urls) %>% .[1]
#
# matches <- fb_match_results(season_url = epl_url) %>%
#   janitor::clean_names()
#
# head(matches)

Option B — Use a CSV export (works everywhere)

Your data needs (minimum): date, home_team, away_team, home_goals, away_goals. Save it as matches.csv.

# --- Option B (CSV) ---
matches <- readr::read_csv("matches.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  filter(!is.na(home_goals), !is.na(away_goals)) %>%
  arrange(date)

dplyr::glimpse(matches)

Standardize columns

# Make sure we have standardized column names
matches <- matches %>%
  transmute(
    date = as.Date(date),
    season = if_else(month(date) >= 7, year(date) + 1L, year(date)), # football season heuristic
    home = as.character(home_team),
    away = as.character(away_team),
    hg = as.integer(home_goals),
    ag = as.integer(away_goals)
  ) %>%
  filter(!is.na(date), !is.na(home), !is.na(away), !is.na(hg), !is.na(ag)) %>%
  arrange(date)

# Basic sanity checks
stopifnot(all(matches$hg >= 0), all(matches$ag >= 0))
matches %>% count(season) %>% arrange(desc(season)) %>% print(n = 50)

Feature engineering

For a baseline Poisson model, we’ll estimate team strength through parameters: attack and defense, plus a home advantage. We’ll also build rolling form features as optional enhancements.

Long format for modeling

long <- matches %>%
  mutate(match_id = row_number()) %>%
  tidyr::pivot_longer(
    cols = c(home, away),
    names_to = "side",
    values_to = "team"
  ) %>%
  mutate(
    opp = if_else(side == "home", away, home),
    goals = if_else(side == "home", hg, ag),
    conceded = if_else(side == "home", ag, hg),
    is_home = as.integer(side == "home")
  ) %>%
  select(match_id, date, season, team, opp, is_home, goals, conceded)

head(long)

Optional: rolling “form” features

These can help a bit, but don’t overfit. Keep them simple and always validate out-of-sample.

# Rolling averages for goals scored/conceded over last N matches (per team)
add_form_features <- function(df, n = 5) {
  df %>%
    arrange(team, date, match_id) %>%
    group_by(team) %>%
    mutate(
      gf_roll = zoo::rollapplyr(goals, width = n, FUN = mean, fill = NA, partial = TRUE),
      ga_roll = zoo::rollapplyr(conceded, width = n, FUN = mean, fill = NA, partial = TRUE)
    ) %>%
    ungroup()
}

# install.packages("zoo") if needed
# library(zoo)
# long <- add_form_features(long, n = 5)

Link building tip: If you cover multiple sports, create a “methods hub” page and link out to each sport’s analytics guide: Sports Analytics with R, NFL, Tennis, Boxing. This strengthens topical authority and internal PageRank flow.

Model 1: Poisson goals (baseline)

We’ll fit two Poisson regressions: one for home goals and one for away goals, with team attack/defense effects and a home advantage term. A standard approach is:

home_goals ~ home_adv + attack(home) + defense(away)
away_goals ~ attack(away) + defense(home)

To avoid identifiability issues, we’ll set a baseline team as reference via factor levels.

Train/test split by time (realistic for betting)

# Time-based split (e.g., last 20% of matches as test)
n_total <- nrow(matches)
cut_idx <- floor(n_total * 0.80)

train <- matches %>% slice(1:cut_idx)
test  <- matches %>% slice((cut_idx + 1):n_total)

# Ensure consistent factor levels
teams <- sort(unique(c(matches$home, matches$away)))
train <- train %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))
test  <- test  %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))

# Fit models
home_mod <- glm(hg ~ 1 + home + away, data = train, family = poisson())
away_mod <- glm(ag ~ 1 + away + home, data = train, family = poisson())

summary(home_mod)
summary(away_mod)

Interpretation: The simplest version above uses team fixed effects as factors. It works, but it mixes attack/defense. Next, we’ll fit the more interpretable attack/defense parameterization.

Attack/Defense parameterization (more interpretable)

# Build a modeling frame in the classic attack/defense form
# We model home goals:
# log(lambda_home) = home_adv + attack_home - defense_away
# And away goals:
# log(lambda_away) = attack_away - defense_home

# Create team factors
train2 <- train %>%
  mutate(
    home = factor(home, levels = teams),
    away = factor(away, levels = teams)
  )

# We'll encode attack and defense as separate factors by prefixing labels
mk_attack <- function(team) factor(paste0("att_", team), levels = paste0("att_", teams))
mk_def    <- function(team) factor(paste0("def_", team), levels = paste0("def_", teams))

train_home <- train2 %>%
  transmute(
    goals = hg,
    is_home = 1L,
    att = mk_attack(home),
    def = mk_def(away)
  )

train_away <- train2 %>%
  transmute(
    goals = ag,
    is_home = 0L,
    att = mk_attack(away),
    def = mk_def(home)
  )

train_long <- bind_rows(train_home, train_away)

# Fit a single Poisson model with:
# goals ~ is_home + att + def
# Note: to reflect "- defense" we can include def and allow coefficients to learn direction;
# For stricter structure you can re-code defense sign, but this works well in practice.
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())

summary(ad_mod)

Predict expected goals (lambda) for each match

predict_lambdas <- function(df, model, teams) {
  df2 <- df %>%
    mutate(
      home = factor(home, levels = teams),
      away = factor(away, levels = teams),
      att_home = factor(paste0("att_", home), levels = paste0("att_", teams)),
      def_away = factor(paste0("def_", away), levels = paste0("def_", teams)),
      att_away = factor(paste0("att_", away), levels = paste0("att_", teams)),
      def_home = factor(paste0("def_", home), levels = paste0("def_", teams))
    )

  # home lambda
  new_home <- df2 %>%
    transmute(is_home = 1L, att = att_home, def = def_away)

  # away lambda
  new_away <- df2 %>%
    transmute(is_home = 0L, att = att_away, def = def_home)

  lam_home <- predict(model, newdata = new_home, type = "response")
  lam_away <- predict(model, newdata = new_away, type = "response")

  df2 %>%
    mutate(lambda_home = lam_home, lambda_away = lam_away)
}

test_pred <- predict_lambdas(test, ad_mod, teams)
head(test_pred)

Model 2: Dixon–Coles adjustment (improves low scores)

The independent Poisson assumption can under/overestimate probabilities for low scores (0–0, 1–0, 0–1, 1–1). Dixon–Coles introduces a small correction factor. Below is a clean implementation.

# Dixon-Coles tau adjustment for low-score dependence
tau_dc <- function(x, y, lam_x, lam_y, rho) {
  # x = home goals, y = away goals
  # rho is the dependence parameter
  if (x == 0 && y == 0) return(1 - (lam_x * lam_y * rho))
  if (x == 0 && y == 1) return(1 + (lam_x * rho))
  if (x == 1 && y == 0) return(1 + (lam_y * rho))
  if (x == 1 && y == 1) return(1 - rho)
  return(1)
}

# Scoreline probability matrix up to max_goals
score_matrix <- function(lam_h, lam_a, rho = 0, max_goals = 10) {
  xs <- 0:max_goals
  ys <- 0:max_goals

  ph <- dpois(xs, lam_h)
  pa <- dpois(ys, lam_a)

  # outer product for independent probabilities
  P <- outer(ph, pa)

  # apply DC tau correction
  for (i in seq_along(xs)) {
    for (j in seq_along(ys)) {
      P[i, j] <- P[i, j] * tau_dc(xs[i], ys[j], lam_h, lam_a, rho)
    }
  }

  # renormalize
  P / sum(P)
}

# Example
P_ex <- score_matrix(lam_h = 1.4, lam_a = 1.1, rho = 0.05, max_goals = 8)
round(P_ex[1:5,1:5], 4)

How do we choose rho? You can estimate it by maximizing likelihood on training data. Here’s a lightweight optimizer:

# Estimate rho by maximizing log-likelihood on train set given lambdas
train_pred <- predict_lambdas(train, ad_mod, teams)

dc_loglik <- function(rho, df, max_goals = 10) {
  # clamp rho to a reasonable range to avoid numerical issues
  rho <- max(min(rho, 0.3), -0.3)

  ll <- 0
  for (k in seq_len(nrow(df))) {
    lam_h <- df$lambda_home[k]
    lam_a <- df$lambda_away[k]
    hg <- df$hg[k]
    ag <- df$ag[k]

    P <- score_matrix(lam_h, lam_a, rho = rho, max_goals = max_goals)

    # if score exceeds max_goals, treat as tiny prob (or increase max_goals)
    if (hg > max_goals || ag > max_goals) {
      ll <- ll + log(1e-12)
    } else {
      ll <- ll + log(P[hg + 1, ag + 1] + 1e-15)
    }
  }
  ll
}

opt <- optimize(
  f = function(r) -dc_loglik(r, train_pred, max_goals = 10),
  interval = c(-0.2, 0.2)
)

rho_hat <- opt$minimum
rho_hat

From scorelines to 1X2 probabilities

Once you have a scoreline probability matrix P, compute:

Home win: sum of probabilities where home goals > away goals
Draw: sum of diagonal
Away win: sum where home goals < away goals

p1x2_from_matrix <- function(P) {
  max_g <- nrow(P) - 1
  xs <- 0:max_g
  ys <- 0:max_g

  p_home <- 0
  p_draw <- 0
  p_away <- 0

  for (i in seq_along(xs)) {
    for (j in seq_along(ys)) {
      if (xs[i] > ys[j]) p_home <- p_home + P[i, j]
      if (xs[i] == ys[j]) p_draw <- p_draw + P[i, j]
      if (xs[i] < ys[j]) p_away <- p_away + P[i, j]
    }
  }

  tibble(p_home = p_home, p_draw = p_draw, p_away = p_away)
}

predict_1x2 <- function(df, rho = 0, max_goals = 10) {
  out <- vector("list", nrow(df))
  for (k in seq_len(nrow(df))) {
    P <- score_matrix(df$lambda_home[k], df$lambda_away[k], rho = rho, max_goals = max_goals)
    out[[k]] <- p1x2_from_matrix(P)
  }
  bind_rows(out)
}

test_1x2 <- bind_cols(
  test_pred,
  predict_1x2(test_pred, rho = rho_hat, max_goals = 10)
)

test_1x2 %>%
  select(date, home, away, hg, ag, lambda_home, lambda_away, p_home, p_draw, p_away) %>%
  head(10)

Odds, implied probabilities & value

Betting decisions should be driven by expected value (EV). If the market offers decimal odds O and your model probability is p, then:

Expected value per unit stake: EV = p*(O-1) - (1-p)
Value condition: p > 1/O

You’ll typically have an odds feed. Below we assume you have a file odds.csv: date, home, away, odds_home, odds_draw, odds_away.

odds <- readr::read_csv("odds.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  transmute(
    date,
    home = as.character(home),
    away = as.character(away),
    o_home = as.numeric(odds_home),
    o_draw = as.numeric(odds_draw),
    o_away = as.numeric(odds_away)
  )

df <- test_1x2 %>%
  mutate(home = as.character(home), away = as.character(away)) %>%
  left_join(odds, by = c("date","home","away"))

# Implied probs (no vig removal yet)
df <- df %>%
  mutate(
    imp_home = 1 / o_home,
    imp_draw = 1 / o_draw,
    imp_away = 1 / o_away,
    overround = imp_home + imp_draw + imp_away
  )

# Simple vig removal by normalization
df <- df %>%
  mutate(
    mkt_home = imp_home / overround,
    mkt_draw = imp_draw / overround,
    mkt_away = imp_away / overround
  )

# EV per 1 unit stake
ev <- function(p, o) p*(o - 1) - (1 - p)

df <- df %>%
  mutate(
    ev_home = ev(p_home, o_home),
    ev_draw = ev(p_draw, o_draw),
    ev_away = ev(p_away, o_away)
  )

df %>%
  select(date, home, away, p_home, p_draw, p_away, o_home, o_draw, o_away, ev_home, ev_draw, ev_away) %>%
  head(10)

Pick bets with thresholds (avoid noise)

# Practical filters: require at least some edge and avoid tiny probabilities
EDGE_MIN <- 0.02   # 2% EV edge
P_MIN    <- 0.05   # avoid extreme longshots unless you model them well

df_bets <- df %>%
  mutate(
    pick = case_when(
      ev_home == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "H",
      ev_draw == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "D",
      TRUE ~ "A"
    ),
    p_pick = case_when(pick == "H" ~ p_home, pick == "D" ~ p_draw, TRUE ~ p_away),
    o_pick = case_when(pick == "H" ~ o_home, pick == "D" ~ o_draw, TRUE ~ o_away),
    ev_pick = case_when(pick == "H" ~ ev_home, pick == "D" ~ ev_draw, TRUE ~ ev_away)
  ) %>%
  filter(!is.na(o_pick)) %>%
  filter(p_pick >= P_MIN, ev_pick >= EDGE_MIN)

df_bets %>% count(pick)

Backtest: flat stake vs Kelly

Backtesting is where most people fool themselves. Use a time-based split, realistic bet selection rules, and conservative bankroll sizing.

Compute bet results

# Outcome label
df_bets <- df_bets %>%
  mutate(
    result = case_when(
      hg > ag ~ "H",
      hg == ag ~ "D",
      TRUE ~ "A"
    ),
    win = as.integer(pick == result)
  )

# Profit per 1 unit stake
df_bets <- df_bets %>%
  mutate(
    profit_flat = if_else(win == 1, o_pick - 1, -1)
  )

df_bets %>% summarise(
  n_bets = n(),
  hit_rate = mean(win),
  avg_odds = mean(o_pick),
  roi_flat = mean(profit_flat)
)

Kelly staking (fractional Kelly recommended)

Full Kelly is often too aggressive. Use fractional Kelly (e.g., 0.25 Kelly). If you want a deeper treatment of Kelly and Bayesian uncertainty, see: Bayesian Sports Analytics (Book/Product) .

kelly_fraction <- function(p, o) {
  # Decimal odds o. Net odds b = o - 1
  b <- o - 1
  q <- 1 - p
  f <- (b*p - q) / b
  pmax(0, f)
}

KELLY_MULT <- 0.25  # fractional Kelly

df_bets <- df_bets %>%
  mutate(
    f_kelly = kelly_fraction(p_pick, o_pick),
    stake_kelly = KELLY_MULT * f_kelly
  )

# Simulate bankroll
simulate_bankroll <- function(df, start_bankroll = 100, stake_col = "stake_kelly") {
  br <- start_bankroll
  path <- numeric(nrow(df))

  for (i in seq_len(nrow(df))) {
    stake <- df[[stake_col]][i]
    stake_amt <- br * stake

    # Profit = stake_amt*(o-1) if win else -stake_amt
    prof <- if (df$win[i] == 1) stake_amt*(df$o_pick[i] - 1) else -stake_amt
    br <- br + prof
    path[i] <- br
  }

  path
}

df_bets <- df_bets %>% arrange(date)

# Flat staking: e.g., 1% bankroll per bet
df_bets <- df_bets %>% mutate(stake_flat = 0.01)

df_bets$br_flat  <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_flat")
df_bets$br_kelly <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_kelly")

tail(df_bets %>% select(date, home, away, pick, o_pick, win, br_flat, br_kelly), 10)

Plot bankroll curves

plot_df <- df_bets %>%
  select(date, br_flat, br_kelly) %>%
  pivot_longer(cols = c(br_flat, br_kelly), names_to = "strategy", values_to = "bankroll")

ggplot(plot_df, aes(x = date, y = bankroll, group = strategy)) +
  geom_line() +
  labs(x = NULL, y = "Bankroll", title = "Backtest Bankroll: Flat vs Fractional Kelly")

Calibration diagnostics

Profit is noisy. Calibration tells you if probabilities are sensible. A great proper scoring rule for 1X2 is the multi-class log loss. We’ll compute log loss for your 1X2 probabilities on the test set.

# Create a probability matrix and truth labels
df_eval <- df %>%
  filter(!is.na(p_home), !is.na(p_draw), !is.na(p_away)) %>%
  mutate(
    truth = case_when(hg > ag ~ "H", hg == ag ~ "D", TRUE ~ "A"),
    truth = factor(truth, levels = c("H","D","A"))
  )

# Log loss (manual)
log_loss_1x2 <- function(pH, pD, pA, y) {
  eps <- 1e-15
  p <- ifelse(y=="H", pH, ifelse(y=="D", pD, pA))
  -mean(log(pmax(p, eps)))
}

ll <- log_loss_1x2(df_eval$p_home, df_eval$p_draw, df_eval$p_away, df_eval$truth)
ll

Reliability plot (binning)

# Example: calibration for HOME-win probability
calib_home <- df_eval %>%
  mutate(bin = ntile(p_home, 10)) %>%
  group_by(bin) %>%
  summarise(
    p_mean = mean(p_home),
    freq = mean(truth == "H"),
    n = n(),
    .groups = "drop"
  )

ggplot(calib_home, aes(x = p_mean, y = freq)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(x = "Predicted P(Home win)", y = "Observed frequency", title = "Calibration (Home win)")

Common upgrade path: Add xG features (if you have them) and/or time decay (recent matches matter more). Then validate calibration again. Don’t chase ROI without checking probability quality.

Production: weekly pipeline

Here’s a practical skeleton you can run weekly: fetch latest matches → refit/refresh → generate next fixtures probabilities → compare with odds.

# 1) Load historical matches (from worldfootballR or your CSV)
matches <- readr::read_csv("matches.csv") %>% janitor::clean_names()

# 2) Train up to cutoff date (e.g., yesterday)
cutoff_date <- Sys.Date() - 1

hist <- matches %>%
  mutate(date = as.Date(date)) %>%
  filter(date <= cutoff_date) %>%
  transmute(date, home = home_team, away = away_team, hg = home_goals, ag = away_goals) %>%
  arrange(date)

teams <- sort(unique(c(hist$home, hist$away)))

# 3) Fit attack/defense model
train_long <- bind_rows(
  hist %>%
    transmute(goals = hg, is_home = 1L,
              att = factor(paste0("att_", home), levels = paste0("att_", teams)),
              def = factor(paste0("def_", away), levels = paste0("def_", teams))),
  hist %>%
    transmute(goals = ag, is_home = 0L,
              att = factor(paste0("att_", away), levels = paste0("att_", teams)),
              def = factor(paste0("def_", home), levels = paste0("def_", teams)))
)

ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())

# 4) Predict for upcoming fixtures (you need a fixtures table)
fixtures <- readr::read_csv("fixtures.csv") %>%
  janitor::clean_names() %>%
  mutate(date = as.Date(date)) %>%
  transmute(date, home = home_team, away = away_team)

# Compute lambdas
fixtures2 <- fixtures %>%
  mutate(
    home = factor(home, levels = teams),
    away = factor(away, levels = teams)
  )

fixtures_pred <- predict_lambdas(
  df = fixtures2 %>% mutate(hg = 0L, ag = 0L), # placeholders
  model = ad_mod,
  teams = teams
)

# 5) Estimate rho (optional) on recent history only (faster)
hist_pred <- hist %>% mutate(home = factor(home, levels=teams), away=factor(away, levels=teams))
hist_pred <- predict_lambdas(hist_pred, ad_mod, teams)

opt <- optimize(
  f = function(r) -dc_loglik(r, hist_pred %>% mutate(lambda_home=lambda_home, lambda_away=lambda_away),
                             max_goals = 10),
  interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum

# 6) Convert to 1X2
fixtures_1x2 <- bind_cols(
  fixtures_pred,
  predict_1x2(fixtures_pred, rho = rho_hat, max_goals = 10)
) %>%
  select(date, home, away, lambda_home, lambda_away, p_home, p_draw, p_away)

write_csv(fixtures_1x2, "model_probs.csv")

At this point you have model_probs.csv ready to merge with bookmaker odds and produce a bet shortlist. In your content strategy, link this post to your broader methods pages: Sports Analytics with R and the relevant sport hubs (NFL, tennis, boxing) to strengthen internal linking.

FAQ

Is a Poisson model “good enough” for football betting?

It’s a strong baseline. It captures team strength and home advantage with minimal complexity. Many upgrades (xG, time decay, Bayesian partial pooling) improve robustness, but the baseline can already be useful.

How do I avoid overfitting?

Use time-based validation, keep features simple, and prioritize calibration and log loss. Don’t tune thresholds using the same data you evaluate on.

What’s the simplest value-bet rule?

Bet only when p_model > p_implied and you have a buffer (e.g. EV > 2%), then stake conservatively (flat 0.5%–1% bankroll or fractional Kelly).

Where do I learn more advanced Bayesian sports models in R?

If you want Bayesian approaches, uncertainty-aware staking, and a deeper treatment of the Kelly criterion, see: Bayesian Sports Analytics (Book/Product) .

Next reads on rprogrammingbooks.com

Install & Use worldfootballR (setup + troubleshooting)
worldfootballR Guide (scraping + workflows)
Sports Analytics with R (methods hub)
Bayesian Sports Analytics (advanced modeling + Kelly)