[feat] Handle individual table failures without failing the entire job #2463

linguoxuan · 2026-01-24T07:25:25Z

Purpose

Linked issue: close #2360

Brief change log

Add exception handling in TieringSplitReader to catch tiering task failures and queue failure information instead of crashing the entire job.
Add FailedTableInfo inner class to store failed table ID and failure reason.
Add TableTieringFailedEvent event class for Enumerator to broadcast table tiering failure notifications to all Readers.
In TieringSourceReader, add processFailedTables() method to detect failed tables and send FailedTieringEvent to Enumerator. And add handleSourceEvents() to process TableTieringFailedEvent from Enumerator and generate failure markers to downstream Committer.
In TieringSourceEnumerator, add broadcastTableTieringFailedEvent() method to broadcast failure events to all other Readers upon receiving a failure event.
In TableBucketWriteResult, add failedMarker and failReason fields, and add failedMarker() static factory method to create failure markers.
In TieringCommitOperator, add failure marker handling logic: when a failure marker is detected, clean up collected write results for the failed table.
Update TableBucketWriteResultSerializer to support serialization/deserialization of the new failedMarker and failReason fields.

Tests

TieringCommitOperatorTest
TableBucketWriteResultSerializerTest
TieringSourceEnumeratorTest

API and Format

Documentation

linguoxuan · 2026-01-24T07:47:20Z

Hi, @luoyuxia. Can you take a look if you have time? The failure notification mechanism works as follows: when a Reader encounters an unrecoverable exception, it sends a FailedTieringEvent to the Enumerator, which then broadcasts a TableTieringFailedEvent to all Readers; upon receiving this event, each Reader clean up the state and emits a failure marker through the data stream to notify the downstream Committer to clean up the state.

[feat] Handle individual table failures without failing the entire job

bbb2761

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Handle individual table failures without failing the entire job #2463

[feat] Handle individual table failures without failing the entire job #2463

Uh oh!

linguoxuan commented Jan 24, 2026 •

edited

Loading

Uh oh!

linguoxuan commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[feat] Handle individual table failures without failing the entire job #2463

Are you sure you want to change the base?

[feat] Handle individual table failures without failing the entire job #2463

Uh oh!

Conversation

linguoxuan commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

linguoxuan commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

linguoxuan commented Jan 24, 2026 •

edited

Loading