Queued async events appear to be lost on normal CLI process exit
Summary
When a CLI process exits normally with async_backend_comm = true, some events that are still queued in the background sender appear not to arrive at APM Server after the process ends.
I reproduced this on Linux with agent version v1.16.0. The script reaches the end normally and exits with code 0.
This report is about the normal-exit loss path specifically. In the runs behind this report, I do not see queue-overflow errors explaining the missing data.
Version Scope
- My original reproduction is on
v1.16.0
- I have not yet retested this exact scenario on the latest
1.16.x
Environment
- PHP 7.3
- Elastic APM PHP agent
v1.16.0
- CLI SAPI
- Linux
async_backend_comm = true
server_timeout = 120s
Reproduction Pattern
The workload is a CLI script that:
- Loops over many units of work in a single process
- Creates one APM transaction per iteration
- Produces roughly 200 to 400 auto-instrumented DB spans per transaction
- Reaches the end of the script normally and exits with code
0
This is not a fatal-error, crash, or forced-kill scenario.
Observed Behavior
- The number of transactions that arrive at APM Server depends on how much the background sender flushes before the script ends.
- Events that are still queued when the process exits do not appear later.
- In the runs behind this report, I do not see queue-overflow errors such as:
Already queued events are above max queue
- With
dev_internal_backend_comm_log_verbose = true, the last backend-comm logs show a request body being uploaded:
CURLINFO_DATA_OUT: [10923 bytes of span data]
CURLINFO_TEXT: upload completely sent off: 10923 out of 10923 bytes
- I do not see the corresponding HTTP response afterwards, such as
202 Accepted.
Additional Evidence
1. DEBUG logs stop at request shutdown
With log_level_stderr = DEBUG, the last log line in one run was:
[22:24:57] [Lifecycle] [lifecycle.cpp:615] [elasticApmRequestShutdown] Exiting...
After that, I do not see log entries from:
elasticApmModuleShutdown
signalBackgroundBackendCommThreadToExit
unwindBackgroundBackendComm
I do not take that as proof that module shutdown never ran. It only means I do not have positive log evidence that the async shutdown path completed.
2. TRACE-level shared-state logs did not show the shutdown flag flipping
In a separate TRACE-level run, the last shared-state snapshot I captured from the background thread still showed shouldExit: false.
I do not think that is conclusive by itself, because the sender thread logs snapshots at loop boundaries. If shutdown starts while the thread is blocked in an in-flight HTTP send, the next snapshot may never be logged.
Why This Looks Unexpected
From reading the source, I expected queued events to keep draining during module shutdown, at least until the queue became empty or the background sender decided its shutdown deadline had expired:
PHP_MSHUTDOWN_FUNCTION(elastic_apm)
-> elasticApmModuleShutdown()
-> backgroundBackendCommOnModuleShutdown()
-> signalBackgroundBackendCommThreadToExit()
-> unwindBackgroundBackendComm()
The relevant pieces appear to be:
PHP_MSHUTDOWN_FUNCTION(elastic_apm) calls elasticApmModuleShutdown()
elasticApmModuleShutdown() calls backgroundBackendCommOnModuleShutdown()
signalBackgroundBackendCommThreadToExit() sets shouldExit = true
- it computes
shouldExitBy from server_timeout
- the worker loop keeps running until the queue is empty or the shutdown deadline has passed
- if the shutdown deadline is exceeded with queued data still pending, the worker logs that the remaining events will be dropped
One Linux-specific nuance in v1.16.0 may matter here:
unwindBackgroundBackendComm() passes the computed deadline into timedJoinAndDeleteThread()
- but the Linux implementation uses
pthread_join() rather than pthread_timedjoin_np()
So on Linux v1.16.0, server_timeout clearly affects the worker's internal shutdown deadline and curl timeout, but it is not a hard timeout for the join helper itself.
Question
Given the shutdown path above, I would expect queued async events to continue draining during normal CLI module shutdown until the queue becomes empty or the worker reaches its shutdown deadline.
What I am seeing looks like that expected drain path is not completing, or is starting too late to finish, in this normal-exit CLI case.
Is that interpretation correct?
If so, this seems like a shutdown/drain bug or a shutdown-while-POST-is-in-flight bug rather than expected behavior.
Suggested Investigation
- Retest this exact scenario on the latest
1.16.x
- Confirm whether
elasticApmApiEntered() can fail in PHP_MSHUTDOWN_FUNCTION(elastic_apm) for this CLI case
- Confirm whether
elasticApmModuleShutdown() reaches backgroundBackendCommOnModuleShutdown()
- Add temporary logs around:
PHP_MSHUTDOWN_FUNCTION(elastic_apm)
elasticApmModuleShutdown()
signalBackgroundBackendCommThreadToExit()
unwindBackgroundBackendComm()
- the return from
timedJoinAndDeleteThread()
- Correlate those logs with the final verbose curl output to see whether shutdown begins while the sender is still blocked in
curl_easy_perform()
- Confirm whether the existing async-shutdown timeout warning is emitted anywhere in this reproduction path
Additional Context
- I can provide a reduced reproduction script if useful
- This report is meant to describe the exit-time loss path specifically, not the separate queue-overflow path that logs
Already queued events are above max queue
Queued async events appear to be lost on normal CLI process exit
Summary
When a CLI process exits normally with
async_backend_comm = true, some events that are still queued in the background sender appear not to arrive at APM Server after the process ends.I reproduced this on Linux with agent version
v1.16.0. The script reaches the end normally and exits with code0.This report is about the normal-exit loss path specifically. In the runs behind this report, I do not see queue-overflow errors explaining the missing data.
Version Scope
v1.16.01.16.xEnvironment
v1.16.0async_backend_comm = trueserver_timeout = 120sReproduction Pattern
The workload is a CLI script that:
0This is not a fatal-error, crash, or forced-kill scenario.
Observed Behavior
dev_internal_backend_comm_log_verbose = true, the last backend-comm logs show a request body being uploaded:202 Accepted.Additional Evidence
1. DEBUG logs stop at request shutdown
With
log_level_stderr = DEBUG, the last log line in one run was:After that, I do not see log entries from:
elasticApmModuleShutdownsignalBackgroundBackendCommThreadToExitunwindBackgroundBackendCommI do not take that as proof that module shutdown never ran. It only means I do not have positive log evidence that the async shutdown path completed.
2. TRACE-level shared-state logs did not show the shutdown flag flipping
In a separate TRACE-level run, the last shared-state snapshot I captured from the background thread still showed
shouldExit: false.I do not think that is conclusive by itself, because the sender thread logs snapshots at loop boundaries. If shutdown starts while the thread is blocked in an in-flight HTTP send, the next snapshot may never be logged.
Why This Looks Unexpected
From reading the source, I expected queued events to keep draining during module shutdown, at least until the queue became empty or the background sender decided its shutdown deadline had expired:
The relevant pieces appear to be:
PHP_MSHUTDOWN_FUNCTION(elastic_apm)callselasticApmModuleShutdown()elasticApmModuleShutdown()callsbackgroundBackendCommOnModuleShutdown()signalBackgroundBackendCommThreadToExit()setsshouldExit = trueshouldExitByfromserver_timeoutOne Linux-specific nuance in
v1.16.0may matter here:unwindBackgroundBackendComm()passes the computed deadline intotimedJoinAndDeleteThread()pthread_join()rather thanpthread_timedjoin_np()So on Linux
v1.16.0,server_timeoutclearly affects the worker's internal shutdown deadline and curl timeout, but it is not a hard timeout for the join helper itself.Question
Given the shutdown path above, I would expect queued async events to continue draining during normal CLI module shutdown until the queue becomes empty or the worker reaches its shutdown deadline.
What I am seeing looks like that expected drain path is not completing, or is starting too late to finish, in this normal-exit CLI case.
Is that interpretation correct?
If so, this seems like a shutdown/drain bug or a shutdown-while-POST-is-in-flight bug rather than expected behavior.
Suggested Investigation
1.16.xelasticApmApiEntered()can fail inPHP_MSHUTDOWN_FUNCTION(elastic_apm)for this CLI caseelasticApmModuleShutdown()reachesbackgroundBackendCommOnModuleShutdown()PHP_MSHUTDOWN_FUNCTION(elastic_apm)elasticApmModuleShutdown()signalBackgroundBackendCommThreadToExit()unwindBackgroundBackendComm()timedJoinAndDeleteThread()curl_easy_perform()Additional Context
Already queued events are above max queue