Title
Handle NoSuchUploadException in UploadPart operation to prevent broker crash
Description
When using AutoMQ with Tencent Cloud COS (or other S3-compatible storage), the broker crashes with Runtime.getRuntime().halt(1) when encountering NoSuchUploadException during UploadPart operations.
Root Cause
The current implementation in AwsObjectStorage.toRetryStrategyAndCause() only handles NoSuchUploadException specially for COMPLETE_MULTI_PART_UPLOAD operation:
if (COMPLETE_MULTI_PART_UPLOAD == operation) {
if (cause instanceof NoSuchUploadException) {
strategy = RetryStrategy.VISIBILITY_CHECK;
}
}
However, for UPLOAD_PART operation, NoSuchUploadException (HTTP 404) results in RetryStrategy.ABORT, which eventually propagates to S3Storage.commitDeltaWALUpload() and triggers:
Runtime.getRuntime().halt(1);
Scenario
This issue occurs when:
- A multipart upload is initiated
- The upload takes longer than expected (due to network issues, throttling, or large data)
- The cloud storage's lifecycle rule automatically aborts incomplete multipart uploads (e.g., after 1-7 days)
- Subsequent
UploadPart calls fail with NoSuchUploadException
- Broker crashes
Error Log
[ERROR] UploadPart for object 2b180480/_kafka_ops_sh/138445234-2 fail (com.automq.stream.s3.operator.AbstractObjectStorage)
software.amazon.awssdk.services.s3.model.NoSuchUploadException: The specified multipart upload does not exist.
The upload ID might be invalid, or the multipart upload might have been aborted or completed.
[ERROR] Unexpected exception when commit stream set object (com.automq.stream.s3.S3Storage)
java.util.concurrent.CompletionException: software.amazon.awssdk.services.s3.model.NoSuchUploadException: ...
Proposed Solution
-
When NoSuchUploadException occurs during UPLOAD_PART, instead of aborting, the system should:
- Invalidate the current uploadId
- Re-initiate a new multipart upload
- Retry the upload from the beginning
-
Add uploadId validity tracking to detect stale upload sessions early
-
Consider adding a configurable timeout for multipart uploads to proactively restart uploads that are taking too long
Environment
- Cloud Provider: Tencent Cloud COS (S3-compatible)
- The bucket has lifecycle rules configured to abort incomplete multipart uploads
Impact
- Severity: High
- Broker crashes and requires manual restart
- Data durability may be affected if WAL upload fails
Title
Handle NoSuchUploadException in UploadPart operation to prevent broker crash
Description
When using AutoMQ with Tencent Cloud COS (or other S3-compatible storage), the broker crashes with
Runtime.getRuntime().halt(1)when encounteringNoSuchUploadExceptionduringUploadPartoperations.Root Cause
The current implementation in
AwsObjectStorage.toRetryStrategyAndCause()only handlesNoSuchUploadExceptionspecially forCOMPLETE_MULTI_PART_UPLOADoperation:However, for
UPLOAD_PARToperation,NoSuchUploadException(HTTP 404) results inRetryStrategy.ABORT, which eventually propagates toS3Storage.commitDeltaWALUpload()and triggers:Scenario
This issue occurs when:
UploadPartcalls fail withNoSuchUploadExceptionError Log
Proposed Solution
When
NoSuchUploadExceptionoccurs duringUPLOAD_PART, instead of aborting, the system should:Add uploadId validity tracking to detect stale upload sessions early
Consider adding a configurable timeout for multipart uploads to proactively restart uploads that are taking too long
Environment
Impact