Scraping editorial board data from academic publisher websites.
OpenEditors collects structured data about journal editors (names, roles, affiliations, countries; sometimes also ORCID IDs) from the websites of academic publishers. It matches raw affiliation data against the ROR API to obtain harmonized information about affiliations and country names. It also tracks changes over time, i.e., recording when editors are added or removed.
(Note that to avoid false positives caused by scraping failures or infinite-scroll timeouts, a confirmation window is used: an editor is only recorded as added or removed after the same outcome is observed across three consecutive scrape runs.)
The resulting dataset is publicly accessible via a dashboard at openeditors.ooir.org, and a snapshot from March 2026 can be downloaded as a CSV file at Zenodo: https://doi.org/10.5281/zenodo.19108866. Former versions are archived on Zenodo: https://doi.org/10.5281/zenodo.4619374 (versions from 2021 and 2022).
Andreas Nishikawa-Pacher, Tamara Heck, Kerstin Schoch, Open Editors: A dataset of scholarly journals' editorial board positions, Research Evaluation, Volume 32, Issue 2, April 2023, Pages 228–243. https://doi.org/10.1093/reseval/rvac037
.
├── 00_create_tables.py # Creates all required MySQL tables
├── 01_scrape_journals.py # Scrapes journal catalogues per publisher
├── 02_scrape_editors.py # Scrapes editorial boards & tracks changes
├── 03_postprocessing_relocatecountry.py # Extracts country from affiliation string
├── 04_postprocessing_inferror.py # Infers ROR IDs via the ROR affiliation API
├── 05_postprocessing_countrynames.py # Harmonises country name variants
├── fetch_browser.py # Playwright-based browser fetcher (human-like)
├── parsers/
│ ├── __init__.py # Re-exports all strategies & run_* functions
│ ├── sage.py
│ ├── frontiers.py
│ ├── elife.py
│ └── ... # One file per publisher
└── publisher_configs/
├── sage.yml
├── frontiers.yml
└── ... # One YAML file per publisher
Creates the MySQL schema, including:
publishers— publisher namesjournals— journal catalogue (title, URL, publisher key)editors— editor records with timestamps (first_seen_at,last_seen_at,removed_at)scrape_runs— audit log of every scrape runpending_changes— staging table for the "confirmation window"
Scrapes the journal catalogue for a given publisher and populates the journals table. Pagination, field selectors, and URL patterns are all integrated in the publisher's YAML config.
python 01_scrape_journals.py --publisher sageThe main scraping script. For each journal, it fetches the editorial board page, parses the HTML, and reconciles the result against the current database state.
python 02_scrape_editors.py --publisher frontiers
python 02_scrape_editors.py --publisher sage --limit 50 --startingjournal 10
python 02_scrape_editors.py --publisher imedpub --debugOptions:
| Flag | Description |
|---|---|
--publisher |
Publisher key (required); must match a file in publisher_configs/ |
--limit |
Maximum number of journals to process |
--startingjournal |
1-based index to start from (useful for resuming) |
--debug |
Enables debug logging and saves screenshots when no editors are found |
Editors are not immediately marked as added or removed on a single scrape. Instead, candidate changes are written to a pending_changes staging table and only committed to editors after appearing consistently in three consecutive runs for the same journal. This prevents transient network failures or incomplete infinite-scroll loads from generating spurious churn in the data.
The threshold is configurable per publisher via the YAML:
editors:
confirmation_streak: 3 # default; override per publisher if neededFor editors where country IS NULL, extracts the substring after the last comma in the affiliation field and writes it to country. (Targets the following publishers: APA, CUP, iMedPub, Longdom, SAGE, SCIRP, SciTechnol.)
python 03_postprocessing_relocatecountry.pyCalls the ROR affiliation API on raw affiliation strings to infer ROR IDs and canonical institution names.
python 04_postprocessing_inferror.pyHarmonises country name variants (e.g. "USA" → "United States", "UK" → "United Kingdom") to a consistent controlled vocabulary.
python 05_postprocessing_countrynames.pyEach publisher has a YAML file in publisher_configs/ and a Python parser in parsers/. The YAML controls journal catalogue scraping, editorial board URL patterns, HTML selectors, fetch mode, rate limiting, and pagination strategy. The parser contains the publisher-specific HTML parsing logic.
Example — publisher_configs/sage.yml (excerpt):
publisher:
key: sage
name: SAGE Journals
editors:
url_template: "https://journals.sagepub.com/editorial-board/{journal_slug}"
parsing:
strategy: sage
fetch:
mode: browser
rate_limit:
min_seconds_per_request: 5Example — parsers/sage.py (excerpt):
def parse_sage(container_html: str) -> list[dict]:
# returns a list of dicts with keys:
# full_name, role, affiliation, country, raw_text
...To add a new publisher, create both files and register the parser in parsers/__init__.py.
- Python 3.11+
- MySQL 8+
- Playwright (for browser-mode fetching)
Create a .env file in the project root:
DB_NAME=openeditors
DB_HOST=localhost
DB_PORT=3306
DB_USER=your_db_user
DB_PASSWORD=your_db_password
ROR_CLIENT_ID=your_ror_api_client_id
python 00_create_tables.pyThe following publishers are currently supported. Each has a corresponding YAML config and parser:
- Allied Academies (probably predatory)
- APA
- ASCE
- CUP (Cambridge University Press)
- eLife
- Emerald
- Frontiers
- iMedPub (probably predatory)
- Inderscience
- Karger
- Longdom (probably predatory)
- OMICS (probably predatory)
- PeerJ
- Pleiades
- PLOS
- RSC
- SAGE
- SCIRP (probably predatory)
- SciTechnol (probably predatory)
- Springer Nature
Claude (Sonnet 4.6) helped a lot in writing the code and in drafting this README file.
Andreas Nishikawa-Pacher · andreas.pacher@da-vienna.at