CPU Loop and memory leak triggered by circular ⁠`paleoContext` → `localities` relationship

**Describe the bug**
This was reported by Tomislav Urban at UT Austin in [BoldDesk ticket #892](https://specifysoftware.bolddesk.com/support/tickets/892):

When users in UT Austin paleo collections (specifically those linking `paleoContext` through `locality.PaleoContextID`) run stored queries via the `stored_query/ephemeral/` endpoint, Gunicorn workers enter a CPU-bound loop in Python and never send the query to the database. Workers leak approximately 20 MiB of memory per request and eventually timeout (SIGKILL). During these hangs, the database is fully idle with zero active queries, transactions, or locks.

**To Reproduce**
Steps to reproduce the behavior:
1. Open a paleo collection in the [UT Austin database](https://drive.google.com/open?id=10rZphDlapJUUEQBamGqWqCnM78_LJV-q&usp=drive_fs) (e.g., Vertebrate Paleontology) where `paleoContext` records are linked specifically through `locality.PaleoContextID`.
2. Run a stored query that utilizes the `stored_query/ephemeral/` endpoint.
3. Monitor the Gunicorn worker processes.
4. See error: Workers enter a CPU-bound loop (producing zero system calls), memory climbs by ~20 MiB, no SQL is sent to the MariaDB backend, and the process eventually results in a `WORKER TIMEOUT`.

**Expected behavior**
The Python query builder should successfully traverse the `paleoContext -> localities` relationship without entering an infinite loop or leaking memory, allowing the SQL query to be promptly generated and executed against the database.

**Screenshots**
N/A (The user captured `SHOW FULL PROCESSLIST`, `docker top`, and an `strace` confirming zero system calls and a completely idle database during the application hangs).

**Crash Report**
Please fill out the following information manually:
- OS: Linux (Docker Deployment)
- Browser: N/A
- Specify 7 Version: v7.12.0.5
- Database Name: N/A
- Collection name: VPL (Vertebrate Paleontology)
- User Name: N/A
- URL: `stored_query/ephemeral/` (Internal Server Errors also reported on `/report_runner/run/`)

**Reported By**
Tomislav Urban

**Additional context**
Original Ticket: https://specifysoftware.bolddesk.com/support/tickets/892

> Summary: When VPL users run stored queries via the stored_query/ephemeral/ endpoint, Gunicorn workers occasionally enter a CPU-bound loop in Python and never send a query to the database. The worker burns CPU for the full timeout period and gets SIGKILL'd.
> 
> Evidence:
> 
> DATABASE IS IDLE DURING HANGS
> We captured SHOW FULL PROCESSLIST, INNODB_TRX, INNODB_LOCK_WAITS, and SHOW ENGINE INNODB STATUS three separate times while the user was actively experiencing a hang. Every capture: all connections sleeping, zero active queries, zero transactions, zero locks, zero pending I/O, buffer pool hit rate 1000/1000.
> 
> WORKERS ARE CPU-BOUND IN PYTHON
> During hangs, Gunicorn workers show 13-20% CPU each via docker top. An strace on a stuck worker produced zero system calls — pure CPU computation. Workers never reach the point of sending SQL to the database.
> 
> MEMORY LEAK (~20 MiB PER REQUEST)
> VPL workers leak approximately 20 MiB per request. After a fresh container restart, memory jumps from 15 MiB to 713 MiB after a single query, then climbs ~20 MiB per subsequent request. Other collections on the same server (PRC, herps, entomology) do not exhibit this — their long-running workers stay under 300 MiB each.
> 
> VPL-SPECIFIC DUE TO PALEO RELATIONSHIPS
> Docker logs show these warnings on every record save, unique to paleo collections:
> - "circular/redundant relationship paleoContext -> localities found in data"
> - "circular/redundant relationship collectionObjectAttribute -> collectionobjects found in data"
> - "circular/redundant relationship collectingEvent -> collectionobjects found in data"
> 
> We believe the circular relationship traversal in the query builder is what causes both the memory leak and the occasional infinite (or near-infinite) loop.
> 
> INTERMITTENT BUT REPRODUCIBLE
> The same query can succeed quickly one time and hang the next. This is consistent with a memory-dependent trigger, workers behave normally when fresh but become vulnerable as memory accumulates.
> 
> Mitigation in place:
> We've configured Gunicorn with --max-requests 200 --max-requests-jitter 50 to auto-recycle workers before memory bloats. This reduces the frequency (one worker timeout instead of all workers cascading), but doesn't eliminate the problem. We still see occasional WORKER TIMEOUT events.
> 
> Environment:
> - Specify 7 v7.12.0.5
> - Containerized deployment (Docker)
> - MariaDB backend
> - The failing endpoint is stored_query/ephemeral/
> - The relevant code path appears to be in specifyweb.backend.stored_queries
> - Also seeing Internal Server Error on /report_runner/run/ which may be related
> 
> 

**Comparative Analysis & Root Cause Hypothesis:**
The user conducted a comparative analysis between their VPL and NPL collections on the same server infrastructure. The issue appears entirely dependent on the specific circular relationship path used for `paleoContext`:

- **VPL (Severe Hangs):** 4,583 `paleoContext` records, with 4,568 linked through `locality.PaleoContextID`.
- **NPL (Rare Hangs):** 88,562 `paleoContext` records, with 88,475 linked through `collectingevent.PaleoContextID`.

The locality cycle (`collectingEvent -> locality -> paleoContext -> localities -> ...`) is triggering a tighter loop in the ORM's relationship graph than the `collectingEvent` cycle, leading to the Python CPU-bound loop. 

Docker logs in the affected environment continually throw these warnings on record saves:
- `"circular/redundant relationship paleoContext -> localities found in data"`
- `"circular/redundant relationship collectionObjectAttribute -> collectionobjects found in data"`
- `"circular/redundant relationship collectingEvent -> collectionobjects found in data"`

*Temporary Mitigation:* The institution has currently configured Gunicorn with `--max-requests 200 --max-requests-jitter 50` to aggressively auto-recycle workers before the ~20 MiB/request memory leak triggers cascading timeouts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Loop and memory leak triggered by circular ⁠`paleoContext` → `localities` relationship #8064

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CPU Loop and memory leak triggered by circular ⁠paleoContext → localities relationship #8064

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

CPU Loop and memory leak triggered by circular ⁠`paleoContext` → `localities` relationship #8064