Queued async events appear to be lost on normal CLI process exit

# Queued async events appear to be lost on normal CLI process exit

## Summary

When a CLI process exits normally with `async_backend_comm = true`, some events that are still queued in the background sender appear not to arrive at APM Server after the process ends.

I reproduced this on Linux with agent version `v1.16.0`. The script reaches the end normally and exits with code `0`.

This report is about the normal-exit loss path specifically. In the runs behind this report, I do not see queue-overflow errors explaining the missing data.

## Version Scope

- My original reproduction is on `v1.16.0`
- I have not yet retested this exact scenario on the latest `1.16.x`

## Environment

- PHP 7.3
- Elastic APM PHP agent `v1.16.0`
- CLI SAPI
- Linux
- `async_backend_comm = true`
- `server_timeout = 120s`

## Reproduction Pattern

The workload is a CLI script that:

1. Loops over many units of work in a single process
2. Creates one APM transaction per iteration
3. Produces roughly 200 to 400 auto-instrumented DB spans per transaction
4. Reaches the end of the script normally and exits with code `0`

This is not a fatal-error, crash, or forced-kill scenario.

## Observed Behavior

- The number of transactions that arrive at APM Server depends on how much the background sender flushes before the script ends.
- Events that are still queued when the process exits do not appear later.
- In the runs behind this report, I do not see queue-overflow errors such as:

```text
Already queued events are above max queue
```

- With `dev_internal_backend_comm_log_verbose = true`, the last backend-comm logs show a request body being uploaded:

```text
CURLINFO_DATA_OUT: [10923 bytes of span data]
CURLINFO_TEXT: upload completely sent off: 10923 out of 10923 bytes
```

- I do not see the corresponding HTTP response afterwards, such as `202 Accepted`.

## Additional Evidence

### 1. DEBUG logs stop at request shutdown

With `log_level_stderr = DEBUG`, the last log line in one run was:

```text
[22:24:57] [Lifecycle] [lifecycle.cpp:615] [elasticApmRequestShutdown] Exiting...
```

After that, I do not see log entries from:

- `elasticApmModuleShutdown`
- `signalBackgroundBackendCommThreadToExit`
- `unwindBackgroundBackendComm`

I do not take that as proof that module shutdown never ran. It only means I do not have positive log evidence that the async shutdown path completed.

### 2. TRACE-level shared-state logs did not show the shutdown flag flipping

In a separate TRACE-level run, the last shared-state snapshot I captured from the background thread still showed `shouldExit: false`.

I do not think that is conclusive by itself, because the sender thread logs snapshots at loop boundaries. If shutdown starts while the thread is blocked in an in-flight HTTP send, the next snapshot may never be logged.

## Why This Looks Unexpected

From reading the source, I expected queued events to keep draining during module shutdown, at least until the queue became empty or the background sender decided its shutdown deadline had expired:

```c
PHP_MSHUTDOWN_FUNCTION(elastic_apm)
    -> elasticApmModuleShutdown()
        -> backgroundBackendCommOnModuleShutdown()
            -> signalBackgroundBackendCommThreadToExit()
            -> unwindBackgroundBackendComm()
```

The relevant pieces appear to be:

- `PHP_MSHUTDOWN_FUNCTION(elastic_apm)` calls `elasticApmModuleShutdown()`
- `elasticApmModuleShutdown()` calls `backgroundBackendCommOnModuleShutdown()`
- `signalBackgroundBackendCommThreadToExit()` sets `shouldExit = true`
- it computes `shouldExitBy` from `server_timeout`
- the worker loop keeps running until the queue is empty or the shutdown deadline has passed
- if the shutdown deadline is exceeded with queued data still pending, the worker logs that the remaining events will be dropped

One Linux-specific nuance in `v1.16.0` may matter here:

- `unwindBackgroundBackendComm()` passes the computed deadline into `timedJoinAndDeleteThread()`
- but the Linux implementation uses `pthread_join()` rather than `pthread_timedjoin_np()`

So on Linux `v1.16.0`, `server_timeout` clearly affects the worker's internal shutdown deadline and curl timeout, but it is not a hard timeout for the join helper itself.

## Question

Given the shutdown path above, I would expect queued async events to continue draining during normal CLI module shutdown until the queue becomes empty or the worker reaches its shutdown deadline.

What I am seeing looks like that expected drain path is not completing, or is starting too late to finish, in this normal-exit CLI case.

Is that interpretation correct?

If so, this seems like a shutdown/drain bug or a shutdown-while-POST-is-in-flight bug rather than expected behavior.

## Suggested Investigation

- Retest this exact scenario on the latest `1.16.x`
- Confirm whether `elasticApmApiEntered()` can fail in `PHP_MSHUTDOWN_FUNCTION(elastic_apm)` for this CLI case
- Confirm whether `elasticApmModuleShutdown()` reaches `backgroundBackendCommOnModuleShutdown()`
- Add temporary logs around:
  - `PHP_MSHUTDOWN_FUNCTION(elastic_apm)`
  - `elasticApmModuleShutdown()`
  - `signalBackgroundBackendCommThreadToExit()`
  - `unwindBackgroundBackendComm()`
  - the return from `timedJoinAndDeleteThread()`
- Correlate those logs with the final verbose curl output to see whether shutdown begins while the sender is still blocked in `curl_easy_perform()`
- Confirm whether the existing async-shutdown timeout warning is emitted anywhere in this reproduction path

## Additional Context

- I can provide a reduced reproduction script if useful
- This report is meant to describe the exit-time loss path specifically, not the separate queue-overflow path that logs `Already queued events are above max queue`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queued async events appear to be lost on normal CLI process exit #1450

Queued async events appear to be lost on normal CLI process exit

Summary

Version Scope

Environment

Reproduction Pattern

Observed Behavior

Additional Evidence

1. DEBUG logs stop at request shutdown

2. TRACE-level shared-state logs did not show the shutdown flag flipping

Why This Looks Unexpected

Question

Suggested Investigation

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Queued async events appear to be lost on normal CLI process exit #1450

Description

Queued async events appear to be lost on normal CLI process exit

Summary

Version Scope

Environment

Reproduction Pattern

Observed Behavior

Additional Evidence

1. DEBUG logs stop at request shutdown

2. TRACE-level shared-state logs did not show the shutdown flag flipping

Why This Looks Unexpected

Question

Suggested Investigation

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions