Booting · 00:00:00

Tom LEFRERE · Data Scientist

Raw data. A signal.

0%
EN FR

← Portfolio

· css · data · html · jupyter · numpy · pandas · playwright · plotly · python

Esport Stats, a small data investigation on pro scenes

A personal project starting from a silly question (who are the youngest? who are the oldest?) that ended as a small editorial report with Python, Plotly, a printable PDF, and a few methodological traps along the way. Here's how it went.

Esport Stats, a small data investigation on pro scenes

A data project done over a weekend, starting from a silly question about pro players’ ages, that ended as a 15-page editorial report. One of those things you start on a Saturday evening thinking “OK, quick one”, and obviously not at all.

So it all started from a curiosity one evening. I was wondering, it’s something we often hear without checking, whether it was true that FPS scenes end careers early, and that Dota 2 is full of veterans hanging around for ten years. I wanted a clean, quantified answer, and while I was at it, something not too ugly to look at. So I opened a notebook, told myself I’d wrap it up in two hours, and obviously it took a bit more.

The starting point

The initial idea was to scrape Liquipedia, output a bar chart, move on. I quickly changed my mind when I found PandaScore, a well-made esport API, especially for anyone who wants structured data without parsing unstable HTML. 1,000 requests per hour on the free tier, clean endpoints for players, teams, tournaments. Plenty for this kind of exercise.

I picked six games: League of Legends, Counter-Strike, Dota 2, Valorant, Rainbow 6 Siege and Overwatch. The plan was to only look at tier S, the elite (LCK, LEC, VCT Masters, BLAST Major, The International, that kind of thing), and only active players. No semi-pro, no retired players still lingering in the DB. A 2026 snapshot, not a census.

First roadblocks

The first hours were a little tour of classic traps, the kind that make you doubt the API before you realize you just fooled yourself.

  • The birthday field seemingly missing from responses, when actually PandaScore just omits null fields. Faker has his birthdate, an anonymous tier-3 player doesn’t, which is normal. Lost an hour before I got that.

  • API slugs don’t always match the games. In the list it’s cs-go, but in the URL you have to write /csgo/. Same for dota-2 becoming /dota2/. Small detail, but it costs you three 404s before clicking.

  • No tier field on leagues, but yes on tournaments. So the useful filter is filter[tier]=s applied to tournaments, not leagues. Once you get that, it rolls.

  • Rate limits obviously, which burn fast when you explore by trial and error. I blew through the 1,000 calls in one session, which was the perfect moment to add a JSON disk cache on every request. Re-runs are now instant, and it’s clearly five minutes of code that saved me a ton of time later.

The pipeline

The final architecture fits in three Python scripts, chainable without asking questions. First, etl.py, hits PandaScore and outputs four CSVs. players_top.csv for unique active players, tournaments.csv for tournament metadata (with winner), participants.csv linking tournament to team, and above all rosters.csv, the real keystone, which contains historical rosters, i.e. who played what when. More on that below, this file is what let me fix the methodological mistake I mention later.

The second, viz.py, turns the CSVs into an editorial HTML report. That’s where I tried to step out of “utility dashboard” territory toward something more readable. Serif typeface for titles (Cormorant Garamond), sans-serif for body (Inter), a cream and terracotta palette, italic captions. Plotly themed to stay consistent with the rest. Eleven charts, a sortable DataTables table, and auto-generated insights at the top (“Dota 2 is the most mature scene, 2.5 years older than Overwatch”).

The third, pdf.py, uses Playwright to load the HTML into headless Chromium, wait for Plotly to finish rendering, apply the @media print CSS, and output a 15-page PDF. Not the most trivial part, more on that too.

The methodological trap to spot

The part I’m most glad I caught, because honestly it could have slipped through. In the first version of the report there was a chart of median age year over year, very clean visually. Except it was linear by construction, i.e. I was taking the current cohort and walking back in time with year - birthdate. Every player aging by exactly one year per year, the median shifted mechanically by one year. Zero real information, but obviously, it looks like it says a lot.

What saved me was a question from the “client” (me, re-reading cold) who said it looked a little too good. I looked again, and yes, it was a pure mathematical artifact. Important reflex, which we don’t always have: ask yourself whether a chart that feels telling could have been built otherwise than by reflecting a truth.

The fix came from a small API discovery. The /tournaments/{id}/teams endpoint doesn’t return current rosters of teams that played, it returns rosters at the time of the tournament, which changes everything. By saving a rosters.csv table with one row per (tournament, team, player), we can compute for each year who actually played and what age they were then. Each year gets its own cohort, newcomers enter, veterans leave, and the chart becomes genuinely informative.

Result, Dota 2 gained about 2 years of median since 2020, R6 Siege gained 3, Counter-Strike is stable, Valorant too. The kind of result that matches community intuition, which is reassuring after a big methodology bug like that. Even so, obviously, it should be taken with caution: we don’t cover all years at the same depth, especially the very old tournaments.

What the report reveals

A few results I find interesting, though obviously it’s a point-in-time snapshot and not absolute truth.

  • Dota 2 is the most mature scene with an average age of 26.8, and Overwatch the youngest at 24.3. 2.5 years apart, which is significant at the scale of a pro career, and roughly tracks the lifespans of games in the ecosystem (Dota 2 running since 2013, Overwatch restarted more recently with OW2).

  • The most international scene by Shannon’s index is Counter-Strike (0.82). The most homogeneous is Overwatch (0.63), heavily dominated by Korea and the US. Not surprising for anyone following the scene, but still nice to have it quantified.

  • 62% of tier-S Counter-Strike players come from EMEA. The record regional concentration in the dataset.

  • In League of Legends, winning rosters are on average 0.4 year younger than the scene. Small but measurable gap. Not enough to build a grand theory, but it’s a small signal.

  • The oldest is TaZ (39, Counter-Strike), the youngest TaiLung (15, Dota 2). A 24-year spread within the same pro ecosystem.

The visualizations

The report is structured in five parts plus an annex. Part one on pro ages, with a bar chart of averages, a ridgeline plot of distributions (more editorial than a classic violin plot, small detail but it changes the read), and a table of the oldest and youngest. Part two on scene evolution, with stacked birth cohorts and the fixed time curve.

Part three is the densest: geography. A world choropleth, top 15 countries, regional shares per game (EMEA, Americas, Asia, Oceania), small multiples for the top 6 countries per discipline, a Sankey of migrations (player born in one country, signed by a team in another), Shannon’s diversity index, and independent mini-maps per game. Maybe a bit much, but it was hard to pick.

Part four is on performance. Average age of winning rosters vs. the whole, and a top 3 of teams per discipline, normalized by the number of tournaments in the game. Normalizing is obvious in hindsight, but at first I had a raw top, which made Counter-Strike crush everything just because it has three times more tournaments than LoL. Typical kind of bias you catch by looking at the chart and thinking “wait, that doesn’t tell the story I want”.

Part five is one tile per game with key numbers and role composition. And in the annex, a small comparison with traditional sports: NBA 26.4, Premier League 27.1, NFL 26.6, ATP 27.3, esport 25.4. Elite esport rosters are much younger than most pro sports, but the gap is less dramatic than you’d think, especially if you isolate scenes like Dota 2 which sit at Premier League level.

The PDF render

Getting a clean PDF from Plotly is not at all trivial, something I hadn’t measured going in. Charts are interactive SVGs, sized at first render in the browser. So if you just flip to print, the layout breaks, because the chart keeps its original width (typically 1,200px HTML viewport) while the A4 page is 794px.

The recipe that works with Playwright: load the HTML, wait for every .js-plotly-plot to have its .main-svg, then call page.emulate_media("print") to activate print CSS, shrink the viewport to A4 width (794px), and force a Plotly.Plots.resize() on every chart to recompute. Wait a second and a half for the relayout to stabilize, then page.pdf(). A bit cobbled, but the output is clean.

Print CSS has its own rules: collapse two-column to single-column (charts span full width), hide the DataTables chrome that makes no sense offline, impose clean page breaks between major parts, shrink heights so a chart fits with its caption and methodological note on a single page. Many small adjustments before you get a PDF you actually want to open all the way through.

What’s left for V2

Two tracks need more API compute time and will move to V2. First, career longevity, which requires hitting /players/{id}/tournaments for every player. That’s 2,600 requests, or about three hours of fetching within the rate limit. Hypothesis: careers are very short on FPS, much longer on MOBA. Would be interesting to quantify.

Second, roster rotation, the percentage of players who change teams year over year. A metric often commented on by the community, especially around off-season transfers, but rarely properly measured. Should be doable without many extra calls since we already have rosters.csv.

Tech stack

Nothing exotic. Python 3.13 with requests, pandas, numpy, pycountry, python-dotenv. Plotly for all charts, with a custom simple_white theme. DataTables.js for the interactive table (lightweight, does the job). Playwright plus headless Chromium for faithful PDF rendering. And PandaScore as the single data source.

What this project taught me

Three things struck me, more or less connected to the subject itself.

First, a chart that looks clean can hide a major methodological bias. The reflex of asking yourself “could this line have been built otherwise than by reflecting a truth?” is worth its weight in gold. My linear chart could have passed for insight, while it was a pure artifact. So I try more and more to do this kind of cold re-read, ideally the next day, because fresh off the build we’re a bit too proud of the result to critique it honestly.

Second, the difference between a utility dashboard and an editorial report. It holds to very little actually: a serif typeface for titles, a muted palette, italic captions, polished page breaks. But it completely flips the reader’s perception. A detail of form that shifts the mental category the document gets filed under.

Third, more banal but worth noting: there’s always a rate limit somewhere. Adding a disk cache at the start of exploration is five minutes of code that saves fifty. I know it, I try to do it, and I still forget one time out of two.

The full editorial report (15 pages) esport-stats-report.pdf · 2.8 MB

Project wrapped in an evening and a few extra hours. With PandaScore’s paid API other horizons open up (more history years, more calls for longevity), but even in free tier, it’s a solid base for anyone who wants to play with esport data. And honestly, if you’re looking for a data project idea that’s rich, feasible, and original enough to fit on a CV, this is typically the kind of subject I’d recommend. There you go.