Skip to content

CPU Loop and memory leak triggered by circular ⁠paleoContextlocalities relationship #8064

@grantfitzsimmons

Description

@grantfitzsimmons

Describe the bug
This was reported by Tomislav Urban at UT Austin in BoldDesk ticket #892:

When users in UT Austin paleo collections (specifically those linking paleoContext through locality.PaleoContextID) run stored queries via the stored_query/ephemeral/ endpoint, Gunicorn workers enter a CPU-bound loop in Python and never send the query to the database. Workers leak approximately 20 MiB of memory per request and eventually timeout (SIGKILL). During these hangs, the database is fully idle with zero active queries, transactions, or locks.

To Reproduce
Steps to reproduce the behavior:

  1. Open a paleo collection in the UT Austin database (e.g., Vertebrate Paleontology) where paleoContext records are linked specifically through locality.PaleoContextID.
  2. Run a stored query that utilizes the stored_query/ephemeral/ endpoint.
  3. Monitor the Gunicorn worker processes.
  4. See error: Workers enter a CPU-bound loop (producing zero system calls), memory climbs by ~20 MiB, no SQL is sent to the MariaDB backend, and the process eventually results in a WORKER TIMEOUT.

Expected behavior
The Python query builder should successfully traverse the paleoContext -> localities relationship without entering an infinite loop or leaking memory, allowing the SQL query to be promptly generated and executed against the database.

Screenshots
N/A (The user captured SHOW FULL PROCESSLIST, docker top, and an strace confirming zero system calls and a completely idle database during the application hangs).

Crash Report
Please fill out the following information manually:

  • OS: Linux (Docker Deployment)
  • Browser: N/A
  • Specify 7 Version: v7.12.0.5
  • Database Name: N/A
  • Collection name: VPL (Vertebrate Paleontology)
  • User Name: N/A
  • URL: stored_query/ephemeral/ (Internal Server Errors also reported on /report_runner/run/)

Reported By
Tomislav Urban

Additional context
Original Ticket: https://specifysoftware.bolddesk.com/support/tickets/892

Summary: When VPL users run stored queries via the stored_query/ephemeral/ endpoint, Gunicorn workers occasionally enter a CPU-bound loop in Python and never send a query to the database. The worker burns CPU for the full timeout period and gets SIGKILL'd.

Evidence:

DATABASE IS IDLE DURING HANGS
We captured SHOW FULL PROCESSLIST, INNODB_TRX, INNODB_LOCK_WAITS, and SHOW ENGINE INNODB STATUS three separate times while the user was actively experiencing a hang. Every capture: all connections sleeping, zero active queries, zero transactions, zero locks, zero pending I/O, buffer pool hit rate 1000/1000.

WORKERS ARE CPU-BOUND IN PYTHON
During hangs, Gunicorn workers show 13-20% CPU each via docker top. An strace on a stuck worker produced zero system calls — pure CPU computation. Workers never reach the point of sending SQL to the database.

MEMORY LEAK (~20 MiB PER REQUEST)
VPL workers leak approximately 20 MiB per request. After a fresh container restart, memory jumps from 15 MiB to 713 MiB after a single query, then climbs ~20 MiB per subsequent request. Other collections on the same server (PRC, herps, entomology) do not exhibit this — their long-running workers stay under 300 MiB each.

VPL-SPECIFIC DUE TO PALEO RELATIONSHIPS
Docker logs show these warnings on every record save, unique to paleo collections:

  • "circular/redundant relationship paleoContext -> localities found in data"
  • "circular/redundant relationship collectionObjectAttribute -> collectionobjects found in data"
  • "circular/redundant relationship collectingEvent -> collectionobjects found in data"

We believe the circular relationship traversal in the query builder is what causes both the memory leak and the occasional infinite (or near-infinite) loop.

INTERMITTENT BUT REPRODUCIBLE
The same query can succeed quickly one time and hang the next. This is consistent with a memory-dependent trigger, workers behave normally when fresh but become vulnerable as memory accumulates.

Mitigation in place:
We've configured Gunicorn with --max-requests 200 --max-requests-jitter 50 to auto-recycle workers before memory bloats. This reduces the frequency (one worker timeout instead of all workers cascading), but doesn't eliminate the problem. We still see occasional WORKER TIMEOUT events.

Environment:

  • Specify 7 v7.12.0.5
  • Containerized deployment (Docker)
  • MariaDB backend
  • The failing endpoint is stored_query/ephemeral/
  • The relevant code path appears to be in specifyweb.backend.stored_queries
  • Also seeing Internal Server Error on /report_runner/run/ which may be related

Comparative Analysis & Root Cause Hypothesis:
The user conducted a comparative analysis between their VPL and NPL collections on the same server infrastructure. The issue appears entirely dependent on the specific circular relationship path used for paleoContext:

  • VPL (Severe Hangs): 4,583 paleoContext records, with 4,568 linked through locality.PaleoContextID.
  • NPL (Rare Hangs): 88,562 paleoContext records, with 88,475 linked through collectingevent.PaleoContextID.

The locality cycle (collectingEvent -> locality -> paleoContext -> localities -> ...) is triggering a tighter loop in the ORM's relationship graph than the collectingEvent cycle, leading to the Python CPU-bound loop.

Docker logs in the affected environment continually throw these warnings on record saves:

  • "circular/redundant relationship paleoContext -> localities found in data"
  • "circular/redundant relationship collectionObjectAttribute -> collectionobjects found in data"
  • "circular/redundant relationship collectingEvent -> collectionobjects found in data"

Temporary Mitigation: The institution has currently configured Gunicorn with --max-requests 200 --max-requests-jitter 50 to aggressively auto-recycle workers before the ~20 MiB/request memory leak triggers cascading timeouts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2 - QueriesIssues that are related to the query builder or queries in general4 - PerformanceIssues related to performance, concurrency, and optimization

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions