PDF-derived affiliation dataset for all 5,356 ICLR 2026 accepted papers, with full scrape-to-treemap pipeline and CSV/XLSX downloads.
Key Takeaways
Affiliations pulled from paper title-block PDFs (94% success), not OpenReview profiles, avoiding profile-drift where a current employer overwrites historical affiliations.
Dataset columns include canonical institution names (250+ normalization rules), country, region, primary area, keywords, abstract, and OpenReview URL.
Three counting methods provided: unique-per-paper, first-author-only, fractional 1/N; sensitivity CSV shows top-50 institutions are stable across all three.
Full pipeline is reproducible for other conferences: scrape OpenReview, bulk-download PDFs (~5 GB), parse, canonicalize, render treemap.
Hong Kong institutions counted separately from mainland China, matching QS/THE ranking conventions.