Skip to content

Luen/scholar

Repository files navigation

Scholar API

This project uses a home server cronjob to scrape Google Scholar data via the scholarly Python package and the Google Sheets to get the journal's impact factor (IF) and the publication's DOI.

The JSON can then be used, for example, by uploading the data to a publicly accessible server via Secure Copy (SCP) or rsync, which serves the JSON data via a Flask application.

Meltwater is the news gathering tool used by some universities. See also Isentia Medaiportal. See also Zotera, an open source citation manager.

python -m venv scholar && source scholar/bin/activate

Installation

Prerequisites

  • Python 3.10+ (required by Scrapling)
  • Flask
  • scholarly
  • Wikipedia

Setup

  1. Clone this repository to your local machine.

  2. Install the required packages.

  3. Setup.

    git clone https://github.com/Luen/scholarly-api
    python -m venv scholar
    source scholar/bin/activate
    pip install -r requirements.txt
  4. Test run.

    python main.py ynWS968AAAAJ

Docker

The stack includes:

  • web – Flask API serving scholar data
  • cron – Runs the main scraper and DOI metrics revalidation on a schedule

Cron schedule (in cron/Dockerfile): main scholar pipeline at 00:00 every 14 days; DOI metrics revalidation (Crossref / Altmetric / Google Scholar cache) at 02:00 daily.

Build the base image first (web and cron use it; the base container exits immediately):

docker compose build base

Then start all services (or build and start in one go):

docker compose up -d
# or, to (re)build everything: docker compose build base && docker compose up -d --build

For browser-based DOI fetching on sites that block plain HTTP, the project uses Scrapling. Install browser dependencies with scrapling install if you use that path. News aggregation uses RSS, NewsAPI, the Guardian API, Newspaper4k, and other sources.

Caching

HTTP responses are cached with requests-cache in cache/ (SQLite). This includes:

  • Scholarly (Google Scholar) requests
  • DOI API requests (doi.org, shortdoi.org)
  • Scrapling (browser-fetched HTML for DOI extraction)
  • Web page fetches for DOI extraction

Set CACHE_DIR to change the cache location; CACHE_EXPIRE_SECONDS (default: 30 days) to control expiry.

Wait for containers to be ready (check status with docker compose ps). Then to manually run the script:

# First check if containers are ready
docker compose ps

# If containers are running, execute the script
docker compose exec cron python main.py ynWS968AAAAJ

# If you get a "container is restarting" error, check logs
docker compose logs web

Testing SOCKS5 proxies

If you use SOCKS5_PROXIES (see .env.template), you can test each proxy from inside the web container:

docker exec scholar_web python -c "
import os, requests
from urllib.parse import quote
raw = os.environ.get('SOCKS5_PROXIES', '').strip()
if not raw:
    print('No SOCKS5_PROXIES set'); exit(0)
for i, entry in enumerate([p.strip() for p in raw.replace(';', chr(10)).splitlines() if p.strip()]):
    parts = entry.split('|', 2)
    if len(parts) < 3:
        print(f'Proxy {i+1}: invalid format'); continue
    host_port, user, passw = parts[0].strip(), parts[1].strip(), parts[2].strip()
    url = 'socks5://' + quote(user, safe='') + ':' + quote(passw, safe='') + '@' + host_port
    try:
        r = requests.get('https://api.altmetric.com/v1/doi/10.1038/nature.2014.14950', proxies={'http': url, 'https': url}, timeout=15)
        print(f'Proxy {i+1} ({host_port}): OK')
    except Exception as e:
        print(f'Proxy {i+1} ({host_port}): FAIL - {e}')
"
docker exec scholar_web python -c "
import os, requests
from urllib.parse import quote
raw = os.environ.get('SOCKS5_PROXIES', '').strip()
if not raw:
    print('No SOCKS5_PROXIES set'); exit(0)
for i, entry in enumerate([p.strip() for p in raw.replace(';', chr(10)).splitlines() if p.strip()]):
    parts = entry.split('|', 2)
    if len(parts) < 3:
        print(f'Proxy {i+1}: invalid format'); continue
    host_port, user, passw = parts[0].strip(), parts[1].strip(), parts[2].strip()
    url = 'socks5://' + quote(user, safe='') + ':' + quote(passw, safe='') + '@' + host_port
    try:
        r = requests.get('https://www.altmetric.com/details/doi/10.1038/s41586-025-08665-0', proxies={'http': url, 'https': url}, timeout=30)
        print(f'Proxy {i+1} ({host_port}): OK status={r.status_code}')
    except Exception as e:
        print(f'Proxy {i+1} ({host_port}): FAIL - {e}')
"

Test whether each proxy is blocked by Google Scholar (CAPTCHA / "unusual traffic"). This script only uses SOCKS5_PROXIES (it does not use Tor). The revalidation app tries Tor first, then SOCKS5, so if Tor is blocked you will see "blocked on all proxies" even when SOCKS5 work in this test. To use SOCKS5 only in the app, unset TOR_PROXY in .env.

docker exec scholar_web python -c "
import os, requests
from urllib.parse import quote
raw = os.environ.get('SOCKS5_PROXIES', '').strip()
if not raw:
    print('No SOCKS5_PROXIES set'); exit(0)
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Referer': 'https://scholar.google.com/',
}
block_signals = ('captcha', 'recaptcha', 'unusual traffic', 'automated queries', 'our systems have detected', 'sorry, we have detected')
scholar_url = 'https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=' + quote('10.1111/1365-2435.70147') + '&btnG='
for i, entry in enumerate([p.strip() for p in raw.replace(';', chr(10)).splitlines() if p.strip()]):
    parts = entry.split('|', 2)
    if len(parts) < 3:
        print(f'Proxy {i+1}: invalid format'); continue
    host_port, user, passw = parts[0].strip(), parts[1].strip(), parts[2].strip()
    url = 'socks5://' + quote(user, safe='') + ':' + quote(passw, safe='') + '@' + host_port
    try:
        r = requests.get(scholar_url, headers=headers, proxies={'http': url, 'https': url}, timeout=30)
        lower = r.text.lower()
        blocked = any(s in lower for s in block_signals) or 'scholar.google.com' not in (r.url or '')
        if blocked:
            print(f'Proxy {i+1} ({host_port}): BLOCKED (CAPTCHA or rate limit)')
        else:
            print(f'Proxy {i+1} ({host_port}): OK (not blocked) status={r.status_code}')
    except Exception as e:
        print(f'Proxy {i+1} ({host_port}): FAIL - {e}')
"

The app tries TOR_PROXY first (up to 3 attempts), then each SOCKS5 proxy in order. If Tor is blocked (common), you see 3 blocked warnings then 4 more for SOCKS5; if SOCKS5 also return blocked (e.g. different query or rate limit), you see "blocked on all 7". To try SOCKS5 only: unset TOR_PROXY in .env and re-run revalidation. To see why a response was treated as blocked, run with LOG_LEVEL=DEBUG and check logs for "Blocked response: url=...".

Revalidating DOI metrics cache

Refreshes Crossref, Altmetric, and Google Scholar cache. Runs daily at 02:00 in the cron container. DOIs are read from scholar_data (all publications with a DOI):

docker exec scholar_web python -u scripts/revalidate_doi_metrics.py
  • Phase 1 (every run): Refetches DOIs with no cache or with a blocked/warning cache (missing or previously failed), so they are retried on each daily run.
  • Phase 2 (only after 7 days): Revalidates DOIs that have successful cache older than a week.

To remove DOIs from the data and cache so they no longer appear or get refetched, run scripts/remove_dois_from_data_and_cache.py DOI [DOI ...] (e.g. docker exec scholar_web python scripts/remove_dois_from_data_and_cache.py 10.1093/conphys/coab030 10.14288/1.0071389).

Neither phase uses force_refresh, so if a request is blocked you keep existing cache. Run manually after fixing proxies to fill in missing DOIs.

Project structure

  • main.py – Orchestration only: loads config, runs pipeline, handles idempotency
  • src/scholar_fetcher.py – Author, coauthors, publications from scholarly (with retries)
  • src/doi_resolver.py – DOI lookup and resolution (with retries)
  • src/output.py – Load/save JSON, schema_version, last_fetched, resume indices
  • src/config.py – Config loaded from .env
  • src/retry.py – Retry decorator with exponential backoff
  • src/logging_config.py – Structured logging (text or JSON)

Output JSON includes schema_version, last_fetched, and _last_successful_*_index for resume support.

Development

Linting and formatting

This project uses Ruff for linting and formatting. Run after code changes:

ruff check . --fix && ruff format .

Testing

pip install -r requirements.txt
pytest tests/ -v

Tests marked integration require network access. Tests that need google-credentials.json or Scrapling browsers will skip when unavailable. Run lint and format before committing:

ruff check . && ruff format --check .

Starting the Flask server

Navigate to the project directory and run:

python server.py

Or with Docker:

docker compose up web cron -d --build
docker compose down; docker volume rm scholar_cache; docker compose up web cron -d --build

The API is available at http://localhost:8000 (Docker maps 8000→5000).

API Endpoints

URL Method Description
/ GET Welcome message
/health GET Health check
/scholars GET List available scholar IDs
/scholar/<id> GET Get scholar data by ID (e.g. /scholar/ynWS968AAAAJ)
/altmetric/<doi> GET Altmetric score for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu
/scholar-citations/<doi> GET Google Scholar citation count for a DOI (cached 2 weeks). 401 if not Rummer/Bergseth/Wu
/crossref/<doi> GET Crossref works API data for a DOI (cached 1 month). 404 if not found

Environment variables

  • FLASK_HOST – Bind host (default: 0.0.0.0)
  • FLASK_PORT – Bind port (default: 5000)
  • SCHOLAR_DATA_DIR – Path to scholar JSON files (default: scholar_data)
  • CACHE_DIR – HTTP cache directory (default: cache)
  • CACHE_EXPIRE_SECONDS – Cache expiry (default: 30 days)
  • FRESH_DATA_SECONDS – Skip full fetch if data is newer (default: 7 days)
  • MAX_RETRIES, RETRY_BASE_DELAY – Retry settings for Scholar/DOI APIs
  • COAUTHOR_DELAY, PUBLICATION_DELAY – Rate limiting (seconds)
  • LOG_FORMAT – Set to json for structured JSON logs (e.g. in Docker)
  • NEWS_API_ORG_KEY, THE_GUARDIAN_API_KEY – For news aggregation (see .env.template)

About

JSON data of published papers with DOIs and Journal Impact Factors for professors

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors