Setup
This is a fully reproducible, code-heavy walkthrough for building a football betting model in R:
data → features → model → probabilities → value bets → bankroll rules → backtest.
If you’re new to worldfootballR, start here:
Install & Use worldfootballR
and keep the
worldfootballR Guide
open as reference.
# Core packages
install.packages(c(
"dplyr","tidyr","purrr","stringr","lubridate",
"readr","ggplot2","tibble","janitor","glue"
))
# Modeling + evaluation
install.packages(c("broom","rsample","yardstick","scoringRules","pROC"))
# Optional (for speed / nicer tables)
install.packages(c("data.table","DT"))
# Football data
# worldfootballR is usually installed from GitHub
# See: https://rprogrammingbooks.com/install-use-worldfootballr/
# If needed:
# install.packages("remotes")
# remotes::install_github("JaseZiv/worldfootballR")
library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(lubridate)
library(readr)
library(ggplot2)
library(janitor)
library(glue)
# worldfootballR (uncomment after install)
# library(worldfootballR)
set.seed(2026)
Get match data
The easiest path is to pull historical match results from public sources.
With worldfootballR, you can often scrape league seasons from sources like FBref.
The specific function names can vary by package version/source, so below you’ll see:
- Option A: Use
worldfootballRdirectly (recommended). - Option B: Use your own CSV export if you already have data.
Option A — Pull league season data with worldfootballR
If you need help with installation and authentication quirks, see: Install & Use worldfootballR.
# --- Option A (worldfootballR) ---
# The exact workflow depends on the data source (FBref / other).
# Typical pattern:
# 1) Get competition URLs
# 2) Pull match results for seasons
#
# PSEUDOCODE (adjust per your worldfootballR version):
#
# comp_urls <- fb_league_urls(country = "ENG", gender = "M", season_end_year = 2026)
# epl_url <- comp_urls %>% filter(str_detect(Competition_Name, "Premier League")) %>% pull(Seasons_Urls) %>% .[1]
#
# matches <- fb_match_results(season_url = epl_url) %>%
# janitor::clean_names()
#
# head(matches)
Option B — Use a CSV export (works everywhere)
Your data needs (minimum): date, home_team, away_team, home_goals, away_goals.
Save it as matches.csv.
# --- Option B (CSV) ---
matches <- readr::read_csv("matches.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
filter(!is.na(home_goals), !is.na(away_goals)) %>%
arrange(date)
dplyr::glimpse(matches)
Standardize columns
# Make sure we have standardized column names
matches <- matches %>%
transmute(
date = as.Date(date),
season = if_else(month(date) >= 7, year(date) + 1L, year(date)), # football season heuristic
home = as.character(home_team),
away = as.character(away_team),
hg = as.integer(home_goals),
ag = as.integer(away_goals)
) %>%
filter(!is.na(date), !is.na(home), !is.na(away), !is.na(hg), !is.na(ag)) %>%
arrange(date)
# Basic sanity checks
stopifnot(all(matches$hg >= 0), all(matches$ag >= 0))
matches %>% count(season) %>% arrange(desc(season)) %>% print(n = 50)
Feature engineering
For a baseline Poisson model, we’ll estimate team strength through parameters: attack and defense, plus a home advantage. We’ll also build rolling form features as optional enhancements.
Long format for modeling
long <- matches %>%
mutate(match_id = row_number()) %>%
tidyr::pivot_longer(
cols = c(home, away),
names_to = "side",
values_to = "team"
) %>%
mutate(
opp = if_else(side == "home", away, home),
goals = if_else(side == "home", hg, ag),
conceded = if_else(side == "home", ag, hg),
is_home = as.integer(side == "home")
) %>%
select(match_id, date, season, team, opp, is_home, goals, conceded)
head(long)
Optional: rolling “form” features
These can help a bit, but don’t overfit. Keep them simple and always validate out-of-sample.
# Rolling averages for goals scored/conceded over last N matches (per team)
add_form_features <- function(df, n = 5) {
df %>%
arrange(team, date, match_id) %>%
group_by(team) %>%
mutate(
gf_roll = zoo::rollapplyr(goals, width = n, FUN = mean, fill = NA, partial = TRUE),
ga_roll = zoo::rollapplyr(conceded, width = n, FUN = mean, fill = NA, partial = TRUE)
) %>%
ungroup()
}
# install.packages("zoo") if needed
# library(zoo)
# long <- add_form_features(long, n = 5)
Model 1: Poisson goals (baseline)
We’ll fit two Poisson regressions: one for home goals and one for away goals, with team attack/defense effects and a home advantage term. A standard approach is:
home_goals ~ home_adv + attack(home) + defense(away)away_goals ~ attack(away) + defense(home)
To avoid identifiability issues, we’ll set a baseline team as reference via factor levels.
Train/test split by time (realistic for betting)
# Time-based split (e.g., last 20% of matches as test)
n_total <- nrow(matches)
cut_idx <- floor(n_total * 0.80)
train <- matches %>% slice(1:cut_idx)
test <- matches %>% slice((cut_idx + 1):n_total)
# Ensure consistent factor levels
teams <- sort(unique(c(matches$home, matches$away)))
train <- train %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))
test <- test %>% mutate(home = factor(home, levels = teams), away = factor(away, levels = teams))
# Fit models
home_mod <- glm(hg ~ 1 + home + away, data = train, family = poisson())
away_mod <- glm(ag ~ 1 + away + home, data = train, family = poisson())
summary(home_mod)
summary(away_mod)
Attack/Defense parameterization (more interpretable)
# Build a modeling frame in the classic attack/defense form
# We model home goals:
# log(lambda_home) = home_adv + attack_home - defense_away
# And away goals:
# log(lambda_away) = attack_away - defense_home
# Create team factors
train2 <- train %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams)
)
# We'll encode attack and defense as separate factors by prefixing labels
mk_attack <- function(team) factor(paste0("att_", team), levels = paste0("att_", teams))
mk_def <- function(team) factor(paste0("def_", team), levels = paste0("def_", teams))
train_home <- train2 %>%
transmute(
goals = hg,
is_home = 1L,
att = mk_attack(home),
def = mk_def(away)
)
train_away <- train2 %>%
transmute(
goals = ag,
is_home = 0L,
att = mk_attack(away),
def = mk_def(home)
)
train_long <- bind_rows(train_home, train_away)
# Fit a single Poisson model with:
# goals ~ is_home + att + def
# Note: to reflect "- defense" we can include def and allow coefficients to learn direction;
# For stricter structure you can re-code defense sign, but this works well in practice.
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())
summary(ad_mod)
Predict expected goals (lambda) for each match
predict_lambdas <- function(df, model, teams) {
df2 <- df %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams),
att_home = factor(paste0("att_", home), levels = paste0("att_", teams)),
def_away = factor(paste0("def_", away), levels = paste0("def_", teams)),
att_away = factor(paste0("att_", away), levels = paste0("att_", teams)),
def_home = factor(paste0("def_", home), levels = paste0("def_", teams))
)
# home lambda
new_home <- df2 %>%
transmute(is_home = 1L, att = att_home, def = def_away)
# away lambda
new_away <- df2 %>%
transmute(is_home = 0L, att = att_away, def = def_home)
lam_home <- predict(model, newdata = new_home, type = "response")
lam_away <- predict(model, newdata = new_away, type = "response")
df2 %>%
mutate(lambda_home = lam_home, lambda_away = lam_away)
}
test_pred <- predict_lambdas(test, ad_mod, teams)
head(test_pred)
Model 2: Dixon–Coles adjustment (improves low scores)
The independent Poisson assumption can under/overestimate probabilities for low scores (0–0, 1–0, 0–1, 1–1). Dixon–Coles introduces a small correction factor. Below is a clean implementation.
# Dixon-Coles tau adjustment for low-score dependence
tau_dc <- function(x, y, lam_x, lam_y, rho) {
# x = home goals, y = away goals
# rho is the dependence parameter
if (x == 0 && y == 0) return(1 - (lam_x * lam_y * rho))
if (x == 0 && y == 1) return(1 + (lam_x * rho))
if (x == 1 && y == 0) return(1 + (lam_y * rho))
if (x == 1 && y == 1) return(1 - rho)
return(1)
}
# Scoreline probability matrix up to max_goals
score_matrix <- function(lam_h, lam_a, rho = 0, max_goals = 10) {
xs <- 0:max_goals
ys <- 0:max_goals
ph <- dpois(xs, lam_h)
pa <- dpois(ys, lam_a)
# outer product for independent probabilities
P <- outer(ph, pa)
# apply DC tau correction
for (i in seq_along(xs)) {
for (j in seq_along(ys)) {
P[i, j] <- P[i, j] * tau_dc(xs[i], ys[j], lam_h, lam_a, rho)
}
}
# renormalize
P / sum(P)
}
# Example
P_ex <- score_matrix(lam_h = 1.4, lam_a = 1.1, rho = 0.05, max_goals = 8)
round(P_ex[1:5,1:5], 4)
How do we choose rho? You can estimate it by maximizing likelihood on training data.
Here’s a lightweight optimizer:
# Estimate rho by maximizing log-likelihood on train set given lambdas
train_pred <- predict_lambdas(train, ad_mod, teams)
dc_loglik <- function(rho, df, max_goals = 10) {
# clamp rho to a reasonable range to avoid numerical issues
rho <- max(min(rho, 0.3), -0.3)
ll <- 0
for (k in seq_len(nrow(df))) {
lam_h <- df$lambda_home[k]
lam_a <- df$lambda_away[k]
hg <- df$hg[k]
ag <- df$ag[k]
P <- score_matrix(lam_h, lam_a, rho = rho, max_goals = max_goals)
# if score exceeds max_goals, treat as tiny prob (or increase max_goals)
if (hg > max_goals || ag > max_goals) {
ll <- ll + log(1e-12)
} else {
ll <- ll + log(P[hg + 1, ag + 1] + 1e-15)
}
}
ll
}
opt <- optimize(
f = function(r) -dc_loglik(r, train_pred, max_goals = 10),
interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum
rho_hat
From scorelines to 1X2 probabilities
Once you have a scoreline probability matrix P, compute:
- Home win: sum of probabilities where home goals > away goals
- Draw: sum of diagonal
- Away win: sum where home goals < away goals
p1x2_from_matrix <- function(P) {
max_g <- nrow(P) - 1
xs <- 0:max_g
ys <- 0:max_g
p_home <- 0
p_draw <- 0
p_away <- 0
for (i in seq_along(xs)) {
for (j in seq_along(ys)) {
if (xs[i] > ys[j]) p_home <- p_home + P[i, j]
if (xs[i] == ys[j]) p_draw <- p_draw + P[i, j]
if (xs[i] < ys[j]) p_away <- p_away + P[i, j]
}
}
tibble(p_home = p_home, p_draw = p_draw, p_away = p_away)
}
predict_1x2 <- function(df, rho = 0, max_goals = 10) {
out <- vector("list", nrow(df))
for (k in seq_len(nrow(df))) {
P <- score_matrix(df$lambda_home[k], df$lambda_away[k], rho = rho, max_goals = max_goals)
out[[k]] <- p1x2_from_matrix(P)
}
bind_rows(out)
}
test_1x2 <- bind_cols(
test_pred,
predict_1x2(test_pred, rho = rho_hat, max_goals = 10)
)
test_1x2 %>%
select(date, home, away, hg, ag, lambda_home, lambda_away, p_home, p_draw, p_away) %>%
head(10)
Odds, implied probabilities & value
Betting decisions should be driven by expected value (EV).
If the market offers decimal odds O and your model probability is p, then:
- Expected value per unit stake:
EV = p*(O-1) - (1-p) - Value condition:
p > 1/O
You’ll typically have an odds feed. Below we assume you have a file odds.csv:
date, home, away, odds_home, odds_draw, odds_away.
odds <- readr::read_csv("odds.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
transmute(
date,
home = as.character(home),
away = as.character(away),
o_home = as.numeric(odds_home),
o_draw = as.numeric(odds_draw),
o_away = as.numeric(odds_away)
)
df <- test_1x2 %>%
mutate(home = as.character(home), away = as.character(away)) %>%
left_join(odds, by = c("date","home","away"))
# Implied probs (no vig removal yet)
df <- df %>%
mutate(
imp_home = 1 / o_home,
imp_draw = 1 / o_draw,
imp_away = 1 / o_away,
overround = imp_home + imp_draw + imp_away
)
# Simple vig removal by normalization
df <- df %>%
mutate(
mkt_home = imp_home / overround,
mkt_draw = imp_draw / overround,
mkt_away = imp_away / overround
)
# EV per 1 unit stake
ev <- function(p, o) p*(o - 1) - (1 - p)
df <- df %>%
mutate(
ev_home = ev(p_home, o_home),
ev_draw = ev(p_draw, o_draw),
ev_away = ev(p_away, o_away)
)
df %>%
select(date, home, away, p_home, p_draw, p_away, o_home, o_draw, o_away, ev_home, ev_draw, ev_away) %>%
head(10)
Pick bets with thresholds (avoid noise)
# Practical filters: require at least some edge and avoid tiny probabilities
EDGE_MIN <- 0.02 # 2% EV edge
P_MIN <- 0.05 # avoid extreme longshots unless you model them well
df_bets <- df %>%
mutate(
pick = case_when(
ev_home == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "H",
ev_draw == pmax(ev_home, ev_draw, ev_away, na.rm = TRUE) ~ "D",
TRUE ~ "A"
),
p_pick = case_when(pick == "H" ~ p_home, pick == "D" ~ p_draw, TRUE ~ p_away),
o_pick = case_when(pick == "H" ~ o_home, pick == "D" ~ o_draw, TRUE ~ o_away),
ev_pick = case_when(pick == "H" ~ ev_home, pick == "D" ~ ev_draw, TRUE ~ ev_away)
) %>%
filter(!is.na(o_pick)) %>%
filter(p_pick >= P_MIN, ev_pick >= EDGE_MIN)
df_bets %>% count(pick)
Backtest: flat stake vs Kelly
Backtesting is where most people fool themselves. Use a time-based split, realistic bet selection rules, and conservative bankroll sizing.
Compute bet results
# Outcome label
df_bets <- df_bets %>%
mutate(
result = case_when(
hg > ag ~ "H",
hg == ag ~ "D",
TRUE ~ "A"
),
win = as.integer(pick == result)
)
# Profit per 1 unit stake
df_bets <- df_bets %>%
mutate(
profit_flat = if_else(win == 1, o_pick - 1, -1)
)
df_bets %>% summarise(
n_bets = n(),
hit_rate = mean(win),
avg_odds = mean(o_pick),
roi_flat = mean(profit_flat)
)
Kelly staking (fractional Kelly recommended)
Full Kelly is often too aggressive. Use fractional Kelly (e.g., 0.25 Kelly). If you want a deeper treatment of Kelly and Bayesian uncertainty, see: Bayesian Sports Analytics (Book/Product) .
kelly_fraction <- function(p, o) {
# Decimal odds o. Net odds b = o - 1
b <- o - 1
q <- 1 - p
f <- (b*p - q) / b
pmax(0, f)
}
KELLY_MULT <- 0.25 # fractional Kelly
df_bets <- df_bets %>%
mutate(
f_kelly = kelly_fraction(p_pick, o_pick),
stake_kelly = KELLY_MULT * f_kelly
)
# Simulate bankroll
simulate_bankroll <- function(df, start_bankroll = 100, stake_col = "stake_kelly") {
br <- start_bankroll
path <- numeric(nrow(df))
for (i in seq_len(nrow(df))) {
stake <- df[[stake_col]][i]
stake_amt <- br * stake
# Profit = stake_amt*(o-1) if win else -stake_amt
prof <- if (df$win[i] == 1) stake_amt*(df$o_pick[i] - 1) else -stake_amt
br <- br + prof
path[i] <- br
}
path
}
df_bets <- df_bets %>% arrange(date)
# Flat staking: e.g., 1% bankroll per bet
df_bets <- df_bets %>% mutate(stake_flat = 0.01)
df_bets$br_flat <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_flat")
df_bets$br_kelly <- simulate_bankroll(df_bets, start_bankroll = 100, stake_col = "stake_kelly")
tail(df_bets %>% select(date, home, away, pick, o_pick, win, br_flat, br_kelly), 10)
Plot bankroll curves
plot_df <- df_bets %>%
select(date, br_flat, br_kelly) %>%
pivot_longer(cols = c(br_flat, br_kelly), names_to = "strategy", values_to = "bankroll")
ggplot(plot_df, aes(x = date, y = bankroll, group = strategy)) +
geom_line() +
labs(x = NULL, y = "Bankroll", title = "Backtest Bankroll: Flat vs Fractional Kelly")
Calibration diagnostics
Profit is noisy. Calibration tells you if probabilities are sensible. A great proper scoring rule for 1X2 is the multi-class log loss. We’ll compute log loss for your 1X2 probabilities on the test set.
# Create a probability matrix and truth labels
df_eval <- df %>%
filter(!is.na(p_home), !is.na(p_draw), !is.na(p_away)) %>%
mutate(
truth = case_when(hg > ag ~ "H", hg == ag ~ "D", TRUE ~ "A"),
truth = factor(truth, levels = c("H","D","A"))
)
# Log loss (manual)
log_loss_1x2 <- function(pH, pD, pA, y) {
eps <- 1e-15
p <- ifelse(y=="H", pH, ifelse(y=="D", pD, pA))
-mean(log(pmax(p, eps)))
}
ll <- log_loss_1x2(df_eval$p_home, df_eval$p_draw, df_eval$p_away, df_eval$truth)
ll
Reliability plot (binning)
# Example: calibration for HOME-win probability
calib_home <- df_eval %>%
mutate(bin = ntile(p_home, 10)) %>%
group_by(bin) %>%
summarise(
p_mean = mean(p_home),
freq = mean(truth == "H"),
n = n(),
.groups = "drop"
)
ggplot(calib_home, aes(x = p_mean, y = freq)) +
geom_point() +
geom_abline(slope = 1, intercept = 0) +
labs(x = "Predicted P(Home win)", y = "Observed frequency", title = "Calibration (Home win)")
Production: weekly pipeline
Here’s a practical skeleton you can run weekly: fetch latest matches → refit/refresh → generate next fixtures probabilities → compare with odds.
# 1) Load historical matches (from worldfootballR or your CSV)
matches <- readr::read_csv("matches.csv") %>% janitor::clean_names()
# 2) Train up to cutoff date (e.g., yesterday)
cutoff_date <- Sys.Date() - 1
hist <- matches %>%
mutate(date = as.Date(date)) %>%
filter(date <= cutoff_date) %>%
transmute(date, home = home_team, away = away_team, hg = home_goals, ag = away_goals) %>%
arrange(date)
teams <- sort(unique(c(hist$home, hist$away)))
# 3) Fit attack/defense model
train_long <- bind_rows(
hist %>%
transmute(goals = hg, is_home = 1L,
att = factor(paste0("att_", home), levels = paste0("att_", teams)),
def = factor(paste0("def_", away), levels = paste0("def_", teams))),
hist %>%
transmute(goals = ag, is_home = 0L,
att = factor(paste0("att_", away), levels = paste0("att_", teams)),
def = factor(paste0("def_", home), levels = paste0("def_", teams)))
)
ad_mod <- glm(goals ~ is_home + att + def, data = train_long, family = poisson())
# 4) Predict for upcoming fixtures (you need a fixtures table)
fixtures <- readr::read_csv("fixtures.csv") %>%
janitor::clean_names() %>%
mutate(date = as.Date(date)) %>%
transmute(date, home = home_team, away = away_team)
# Compute lambdas
fixtures2 <- fixtures %>%
mutate(
home = factor(home, levels = teams),
away = factor(away, levels = teams)
)
fixtures_pred <- predict_lambdas(
df = fixtures2 %>% mutate(hg = 0L, ag = 0L), # placeholders
model = ad_mod,
teams = teams
)
# 5) Estimate rho (optional) on recent history only (faster)
hist_pred <- hist %>% mutate(home = factor(home, levels=teams), away=factor(away, levels=teams))
hist_pred <- predict_lambdas(hist_pred, ad_mod, teams)
opt <- optimize(
f = function(r) -dc_loglik(r, hist_pred %>% mutate(lambda_home=lambda_home, lambda_away=lambda_away),
max_goals = 10),
interval = c(-0.2, 0.2)
)
rho_hat <- opt$minimum
# 6) Convert to 1X2
fixtures_1x2 <- bind_cols(
fixtures_pred,
predict_1x2(fixtures_pred, rho = rho_hat, max_goals = 10)
) %>%
select(date, home, away, lambda_home, lambda_away, p_home, p_draw, p_away)
write_csv(fixtures_1x2, "model_probs.csv")
At this point you have model_probs.csv ready to merge with bookmaker odds and produce a bet shortlist.
In your content strategy, link this post to your broader methods pages:
Sports Analytics with R
and the relevant sport hubs (NFL, tennis, boxing) to strengthen internal linking.
FAQ
Is a Poisson model “good enough” for football betting?
It’s a strong baseline. It captures team strength and home advantage with minimal complexity. Many upgrades (xG, time decay, Bayesian partial pooling) improve robustness, but the baseline can already be useful.
How do I avoid overfitting?
Use time-based validation, keep features simple, and prioritize calibration and log loss. Don’t tune thresholds using the same data you evaluate on.
What’s the simplest value-bet rule?
Bet only when p_model > p_implied and you have a buffer (e.g. EV > 2%),
then stake conservatively (flat 0.5%–1% bankroll or fractional Kelly).
Where do I learn more advanced Bayesian sports models in R?
If you want Bayesian approaches, uncertainty-aware staking, and a deeper treatment of the Kelly criterion, see: Bayesian Sports Analytics (Book/Product) .
- Install & Use worldfootballR (setup + troubleshooting)
- worldfootballR Guide (scraping + workflows)
- Sports Analytics with R (methods hub)
- Bayesian Sports Analytics (advanced modeling + Kelly)

