eSports Analytics in R: Predicting Dota 2 Matches

eSports analytics is still an underexplored area in the R ecosystem, which makes it a great niche for practical, original work. While football, basketball, and betting models already have strong communities, competitive games such as Dota 2 and Counter-Strike offer rich event data, fast feedback loops, and interesting prediction problems. In this post, I will show how R can be used to extract match data, engineer useful features, classify players or teams, and build models to predict match outcomes.

The main idea is simple: treat eSports matches like any other structured competition dataset. We can collect historical match information, transform it into team-level or player-level predictors, and then train machine learning models that estimate the probability of victory. For Dota 2, the OpenDota ecosystem is especially useful because it exposes public match and player data through an API that can be accessed from R.

Why eSports analytics is a strong fit for R

R is particularly well suited for eSports analytics because it combines data collection, cleaning, visualization, modeling, and reporting in a single workflow. With packages from the tidyverse, tidymodels, and API tools such as httr and jsonlite, it becomes straightforward to move from raw match endpoints to a predictive pipeline.

This is also one of the reasons the topic stands out. Compared with mainstream sports, eSports still has much less mature R coverage, so a post focused on predicting Dota 2 matches in R feels fresh. It is practical, technically interesting, and relevant to analysts who want to work on non-traditional sports datasets.

Typical analytics questions in Dota 2 or CS-style games

Once match data is available, several interesting problems appear naturally:

Which team features are most associated with winning?
Can we predict the outcome of a match before it starts?
Which players outperform their role or bracket expectations?
Do certain heroes, maps, or compositions create measurable edges?
How stable are team ratings over time?

Some of these are classification tasks, others are ranking or regression problems, and several can benefit from time-aware modeling. If you enjoy probabilistic approaches, a Bayesian sports analytics book in R can be a useful complement when you want to move from point predictions to uncertainty-aware forecasts.

Data collection in R with OpenDota

A practical starting point is Dota 2 match data from the OpenDota API. In R, you can work either with a dedicated wrapper such as ROpenDota when available in your environment, or call the API directly with httr2, httr, and jsonlite. I often prefer direct API calls because they make the data flow more transparent and easier to debug.

The example below shows a simple way to retrieve recent professional matches and convert them into a tidy tibble.

library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tibble)

base_url <- "https://api.opendota.com/api/proMatches"

resp <- request(base_url) |>
  req_perform()

pro_matches <- resp |>
  resp_body_string() |>
  fromJSON(flatten = TRUE) |>
  as_tibble()

glimpse(pro_matches)

At this stage, the key goal is not modeling yet. It is understanding what the dataset contains. You want to inspect variables such as match identifiers, start times, radiant and dire team names, duration, league information, and the final winner. Once the structure is clear, the next step is to collect richer match-level or player-level details.

Downloading detailed match records

Predictive models usually need more than a top-level match result. We often want per-match detail: kills, deaths, assists, gold per minute, experience per minute, hero picks, bans, lobby type, patch information, and team-level aggregates. A common workflow is to fetch a set of match IDs and then loop through the detailed endpoint for each match.

library(httr2)
library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)

get_match_details <- function(match_id) {
  url <- paste0("https://api.opendota.com/api/matches/", match_id)

  tryCatch({
    request(url) |>
      req_perform() |>
      resp_body_string() |>
      fromJSON(flatten = TRUE)
  }, error = function(e) {
    NULL
  })
}

sample_ids <- pro_matches |>
  slice_head(n = 50) |>
  pull(match_id)

match_details_raw <- map(sample_ids, get_match_details)
match_details_raw <- compact(match_details_raw)

This gives us a list of match records. From there, we can create a team-level modeling table. For predictive work, that usually means one row per team per match, along with a target variable indicating whether that team won.

Feature engineering for match prediction

Feature engineering is where most of the value is created. A model rarely becomes useful because of the algorithm alone; it becomes useful because the input variables capture something meaningful about team quality, momentum, or composition.

Some strong candidate features include:

Recent win rate over the last 5 or 10 matches
Average team KDA from recent games
Average gold per minute and experience per minute
Hero-pool diversity
Patch-specific performance
Opponent strength proxies
Side indicator such as Radiant vs Dire
Time since the team last played

A basic team-level engineering pipeline in R might look like this:

library(dplyr)
library(purrr)
library(tidyr)
library(stringr)

team_rows <- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  players <- as_tibble(m$players)

  players <- players |>
    mutate(
      side = if_else(player_slot < 128, "radiant", "dire")
    )

  team_summary <- players |>
    group_by(side) |>
    summarise(
      team_kills = sum(kills, na.rm = TRUE),
      team_deaths = sum(deaths, na.rm = TRUE),
      team_assists = sum(assists, na.rm = TRUE),
      avg_gpm = mean(gold_per_min, na.rm = TRUE),
      avg_xpm = mean(xp_per_min, na.rm = TRUE),
      hero_diversity = n_distinct(hero_id),
      .groups = "drop"
    ) |>
    mutate(
      match_id = m$match_id,
      duration = m$duration,
      radiant_win = m$radiant_win,
      win = if_else(
        (side == "radiant" & radiant_win) | (side == "dire" & !radiant_win),
        1, 0
      )
    )

  team_summary
})

team_rows

This table is already enough for a first classification model. It is not perfect, and it does not yet include pre-match only features, but it is ideal for prototyping. In real forecasting, we should be careful not to leak post-match information into the prediction target. For example, final kills and average GPM are fine for explanatory analysis but not for true pre-match forecasting.

Building a proper pre-match dataset

If the goal is to predict the winner before the game begins, then every feature must be available before the first second of the match. That means historical rolling summaries are usually better than in-match totals. A cleaner setup is:

Sort matches chronologically
Create one row per team per match
Compute rolling features from previous matches only
Join the two competing teams into a head-to-head row
Train a binary classifier on the winner

Here is a simplified example of rolling team form:

library(dplyr)
library(slider)

team_history <- team_rows |>
  arrange(match_id) |>
  group_by(side) |>
  mutate(
    recent_win_rate = slide_dbl(win, mean, .before = 5, .complete = FALSE),
    recent_avg_kills = slide_dbl(team_kills, mean, .before = 5, .complete = FALSE),
    recent_avg_deaths = slide_dbl(team_deaths, mean, .before = 5, .complete = FALSE)
  ) |>
  ungroup()

In a more complete dataset, you would calculate these rolling statistics by actual team identity rather than by side alone. That produces a much more realistic team-strength signal.

Predicting match outcomes with tidymodels

Once a clean modeling table is ready, tidymodels provides an elegant framework for splitting data, preprocessing predictors, training models, and evaluating performance. Logistic regression is a strong baseline because it is interpretable and fast. After that, tree-based methods such as random forests or gradient boosting can be tested.

library(tidymodels)

model_data <- team_rows |>
  select(win, team_kills, team_deaths, team_assists, avg_gpm, avg_xpm, hero_diversity, duration) |>
  mutate(win = factor(win, levels = c(0, 1)))

set.seed(123)

split_obj <- initial_split(model_data, prop = 0.8, strata = win)
train_data <- training(split_obj)
test_data  <- testing(split_obj)

rec <- recipe(win ~ ., data = train_data) |>
  step_impute_median(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

log_spec <- logistic_reg() |>
  set_engine("glm")

wf <- workflow() |>
  add_recipe(rec) |>
  add_model(log_spec)

fit_log <- fit(wf, data = train_data)

preds <- predict(fit_log, test_data, type = "prob") |>
  bind_cols(predict(fit_log, test_data)) |>
  bind_cols(test_data)

roc_auc(preds, truth = win, .pred_1)
accuracy(preds, truth = win, .pred_class)

The first model is rarely the final model, but it gives us a baseline. If performance is weak, that usually means the issue is in the feature set rather than the modeling syntax. Better historical variables, better team identifiers, and better patch-aware data often matter more than switching algorithms immediately.

Moving beyond logistic regression

After a baseline, several improvements are possible. Random forests can capture nonlinear relationships. Gradient boosting often performs well when feature interactions matter. Bayesian models can be especially attractive when sample sizes are uneven or when you want probability distributions instead of single-point estimates. For readers interested in probabilistic thinking and predictive uncertainty, a resource on Bayesian sports betting with R can help connect model outputs with practical decision-making.

rf_spec <- rand_forest(
  trees = 500,
  min_n = 5
) |>
  set_engine("ranger") |>
  set_mode("classification")

rf_wf <- workflow() |>
  add_recipe(rec) |>
  add_model(rf_spec)

fit_rf <- fit(rf_wf, data = train_data)

rf_preds <- predict(fit_rf, test_data, type = "prob") |>
  bind_cols(predict(fit_rf, test_data)) |>
  bind_cols(test_data)

roc_auc(rf_preds, truth = win, .pred_1)
accuracy(rf_preds, truth = win, .pred_class)

A good post does not need to claim perfect predictive power. In fact, readers usually trust the analysis more when you clearly explain the constraints. Team rosters change, patches alter the meta, public data can be incomplete, and many matches are influenced by contextual factors that are difficult to encode numerically.

Player classification and rating ideas

Match prediction is only one angle. Another strong direction is player classification. For example, we can cluster players based on aggression, farming style, support contribution, and efficiency. This is particularly interesting because eSports roles are both strategic and behavioral.

A simple unsupervised workflow could include:

K-means clustering on player performance metrics
PCA for dimensionality reduction and visualization
Role classification using labeled examples
Elo-style or Glicko-style rating systems for evolving skill estimates

library(dplyr)
library(ggplot2)

player_data <- map_dfr(match_details_raw, function(m) {
  if (is.null(m$players) || length(m$players) == 0) return(NULL)

  as_tibble(m$players) |>
    transmute(
      match_id = m$match_id,
      account_id = account_id,
      hero_id = hero_id,
      kills = kills,
      deaths = deaths,
      assists = assists,
      gpm = gold_per_min,
      xpm = xp_per_min,
      last_hits = last_hits
    )
}) |>
  filter(!is.na(account_id))

player_summary <- player_data |>
  group_by(account_id) |>
  summarise(
    avg_kills = mean(kills, na.rm = TRUE),
    avg_deaths = mean(deaths, na.rm = TRUE),
    avg_assists = mean(assists, na.rm = TRUE),
    avg_gpm = mean(gpm, na.rm = TRUE),
    avg_xpm = mean(xpm, na.rm = TRUE),
    avg_last_hits = mean(last_hits, na.rm = TRUE),
    matches = n(),
    .groups = "drop"
  ) |>
  filter(matches >= 10)

From there, clustering or supervised classification becomes straightforward. This is the kind of section that makes an eSports article feel broader than a simple API tutorial.

Visualization ideas that make the post stronger

Visuals can turn a technical post into a memorable one. In eSports, a few plots are especially effective:

Win probability calibration plots
Rolling team form charts
Hero usage and win-rate heatmaps
Player cluster scatterplots from PCA
Feature importance plots for tree models

For example, here is a simple variable importance chart after fitting a random forest:

library(vip)

fit_rf |>
  extract_fit_parsnip() |>
  vip()

The purpose of these plots is not just decoration. They help answer the analytical question visually: what actually drives team success, and which signals seem stable across matches?

What about Counter-Strike or other eSports titles?

The same workflow generalizes well. Even if package support is less standardized than in Dota 2, the modeling logic remains the same:

Collect historical match data
Build team and player features
Use rolling windows to represent recent form
Train classification or rating models
Evaluate probabilities, not just hard predictions

In Counter-Strike style datasets, likely features include map win rates, side-specific strength, recent kill differential, roster stability, and head-to-head history. In that sense, the sport changes, but the R workflow does not.

Why this kind of post can stand out

A post on eSports analytics in R stands out because it sits at the intersection of data science novelty and practical modeling. It is specific enough to be useful, but unusual enough to attract readers who are tired of the same repeated examples from traditional sports. A title built around predicting Dota 2 matches is especially effective because it immediately communicates a concrete deliverable.

It also fits naturally into a broader sports analytics learning path. Readers who discover this topic through eSports may later want to explore work in football, soccer, or multi-sport modeling, where books such as Football Analytics with R, Mastering Sports Analytics with R: Soccer, or Sports Analytics with R across multiple sports can expand the same analytical mindset into other domains.

Final thoughts

eSports analytics deserves more attention in R, and Dota 2 is one of the best places to start. With API access, tidy data workflows, and flexible modeling tools, it is possible to go from raw public match records to meaningful predictive systems entirely in R. Even a simple first version can teach a lot about data engineering, feature design, classification, and evaluation.

The real opportunity is not only to predict winners, but to build a reproducible framework for understanding team performance, player styles, and competitive dynamics in games that are becoming more important every year. That combination of novelty, data richness, and analytical depth is exactly what makes eSports such a compelling subject for an R post.