Skip to content

Conversation

@linguoxuan
Copy link

@linguoxuan linguoxuan commented Jan 24, 2026

Purpose

Linked issue: close #2360

Brief change log

  1. Add exception handling in TieringSplitReader to catch tiering task failures and queue failure information instead of crashing the entire job.
  2. Add FailedTableInfo inner class to store failed table ID and failure reason.
  3. Add TableTieringFailedEvent event class for Enumerator to broadcast table tiering failure notifications to all Readers.
  4. In TieringSourceReader, add processFailedTables() method to detect failed tables and send FailedTieringEvent to Enumerator. And add handleSourceEvents() to process TableTieringFailedEvent from Enumerator and generate failure markers to downstream Committer.
  5. In TieringSourceEnumerator, add broadcastTableTieringFailedEvent() method to broadcast failure events to all other Readers upon receiving a failure event.
  6. In TableBucketWriteResult, add failedMarker and failReason fields, and add failedMarker() static factory method to create failure markers.
  7. In TieringCommitOperator, add failure marker handling logic: when a failure marker is detected, clean up collected write results for the failed table.
  8. Update TableBucketWriteResultSerializer to support serialization/deserialization of the new failedMarker and failReason fields.

Tests

TieringCommitOperatorTest
TableBucketWriteResultSerializerTest
TieringSourceEnumeratorTest

API and Format

Documentation

@linguoxuan
Copy link
Author

Hi, @luoyuxia. Can you take a look if you have time? The failure notification mechanism works as follows: when a Reader encounters an unrecoverable exception, it sends a FailedTieringEvent to the Enumerator, which then broadcasts a TableTieringFailedEvent to all Readers; upon receiving this event, each Reader clean up the state and emits a failure marker through the data stream to notify the downstream Committer to clean up the state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle individual table failures without failing the entire job

1 participant