Skip to content

Queued async events appear to be lost on normal CLI process exit #1450

@githoober

Description

@githoober

Queued async events appear to be lost on normal CLI process exit

Summary

When a CLI process exits normally with async_backend_comm = true, some events that are still queued in the background sender appear not to arrive at APM Server after the process ends.

I reproduced this on Linux with agent version v1.16.0. The script reaches the end normally and exits with code 0.

This report is about the normal-exit loss path specifically. In the runs behind this report, I do not see queue-overflow errors explaining the missing data.

Version Scope

  • My original reproduction is on v1.16.0
  • I have not yet retested this exact scenario on the latest 1.16.x

Environment

  • PHP 7.3
  • Elastic APM PHP agent v1.16.0
  • CLI SAPI
  • Linux
  • async_backend_comm = true
  • server_timeout = 120s

Reproduction Pattern

The workload is a CLI script that:

  1. Loops over many units of work in a single process
  2. Creates one APM transaction per iteration
  3. Produces roughly 200 to 400 auto-instrumented DB spans per transaction
  4. Reaches the end of the script normally and exits with code 0

This is not a fatal-error, crash, or forced-kill scenario.

Observed Behavior

  • The number of transactions that arrive at APM Server depends on how much the background sender flushes before the script ends.
  • Events that are still queued when the process exits do not appear later.
  • In the runs behind this report, I do not see queue-overflow errors such as:
Already queued events are above max queue
  • With dev_internal_backend_comm_log_verbose = true, the last backend-comm logs show a request body being uploaded:
CURLINFO_DATA_OUT: [10923 bytes of span data]
CURLINFO_TEXT: upload completely sent off: 10923 out of 10923 bytes
  • I do not see the corresponding HTTP response afterwards, such as 202 Accepted.

Additional Evidence

1. DEBUG logs stop at request shutdown

With log_level_stderr = DEBUG, the last log line in one run was:

[22:24:57] [Lifecycle] [lifecycle.cpp:615] [elasticApmRequestShutdown] Exiting...

After that, I do not see log entries from:

  • elasticApmModuleShutdown
  • signalBackgroundBackendCommThreadToExit
  • unwindBackgroundBackendComm

I do not take that as proof that module shutdown never ran. It only means I do not have positive log evidence that the async shutdown path completed.

2. TRACE-level shared-state logs did not show the shutdown flag flipping

In a separate TRACE-level run, the last shared-state snapshot I captured from the background thread still showed shouldExit: false.

I do not think that is conclusive by itself, because the sender thread logs snapshots at loop boundaries. If shutdown starts while the thread is blocked in an in-flight HTTP send, the next snapshot may never be logged.

Why This Looks Unexpected

From reading the source, I expected queued events to keep draining during module shutdown, at least until the queue became empty or the background sender decided its shutdown deadline had expired:

PHP_MSHUTDOWN_FUNCTION(elastic_apm)
    -> elasticApmModuleShutdown()
        -> backgroundBackendCommOnModuleShutdown()
            -> signalBackgroundBackendCommThreadToExit()
            -> unwindBackgroundBackendComm()

The relevant pieces appear to be:

  • PHP_MSHUTDOWN_FUNCTION(elastic_apm) calls elasticApmModuleShutdown()
  • elasticApmModuleShutdown() calls backgroundBackendCommOnModuleShutdown()
  • signalBackgroundBackendCommThreadToExit() sets shouldExit = true
  • it computes shouldExitBy from server_timeout
  • the worker loop keeps running until the queue is empty or the shutdown deadline has passed
  • if the shutdown deadline is exceeded with queued data still pending, the worker logs that the remaining events will be dropped

One Linux-specific nuance in v1.16.0 may matter here:

  • unwindBackgroundBackendComm() passes the computed deadline into timedJoinAndDeleteThread()
  • but the Linux implementation uses pthread_join() rather than pthread_timedjoin_np()

So on Linux v1.16.0, server_timeout clearly affects the worker's internal shutdown deadline and curl timeout, but it is not a hard timeout for the join helper itself.

Question

Given the shutdown path above, I would expect queued async events to continue draining during normal CLI module shutdown until the queue becomes empty or the worker reaches its shutdown deadline.

What I am seeing looks like that expected drain path is not completing, or is starting too late to finish, in this normal-exit CLI case.

Is that interpretation correct?

If so, this seems like a shutdown/drain bug or a shutdown-while-POST-is-in-flight bug rather than expected behavior.

Suggested Investigation

  • Retest this exact scenario on the latest 1.16.x
  • Confirm whether elasticApmApiEntered() can fail in PHP_MSHUTDOWN_FUNCTION(elastic_apm) for this CLI case
  • Confirm whether elasticApmModuleShutdown() reaches backgroundBackendCommOnModuleShutdown()
  • Add temporary logs around:
    • PHP_MSHUTDOWN_FUNCTION(elastic_apm)
    • elasticApmModuleShutdown()
    • signalBackgroundBackendCommThreadToExit()
    • unwindBackgroundBackendComm()
    • the return from timedJoinAndDeleteThread()
  • Correlate those logs with the final verbose curl output to see whether shutdown begins while the sender is still blocked in curl_easy_perform()
  • Confirm whether the existing async-shutdown timeout warning is emitted anywhere in this reproduction path

Additional Context

  • I can provide a reduced reproduction script if useful
  • This report is meant to describe the exit-time loss path specifically, not the separate queue-overflow path that logs Already queued events are above max queue

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions