Skip to content

feat: add alerting for unreachable errors#7046

Open
simone-stacks wants to merge 11 commits intostacks-network:developfrom
simone-stacks:feat/unreachable-errors-alert
Open

feat: add alerting for unreachable errors#7046
simone-stacks wants to merge 11 commits intostacks-network:developfrom
simone-stacks:feat/unreachable-errors-alert

Conversation

@simone-stacks
Copy link
Copy Markdown
Contributor

@simone-stacks simone-stacks commented Mar 26, 2026

Description

Adds distinct alerting when "unreachable" errors are triggered. These are errors that should never occur in valid execution and indicate potential bugs.

Today these errors flow through the generic "problematic transaction" path and get logged at info! level with no way to distinguish them from normal rejectable errors.

This PR adds:

  • is_unreachable() methods on RuntimeCheckErrorKind, StaticCheckErrorKind, and ParseError
  • error! level logs with a UNREACHABLE_ERROR_TRIGGERED marker and structured event_name=unreachable_error field at all consensus rejection sites (block validation + miner)
  • a stacks_unreachable_errors_total Prometheus counter (behind monitoring_prom feature flag)

There's still an open discussion about whether some StaticCheckErrorKind variants documented as internal errors (TypeAlreadyAnnotatedFailure, CheckerImplementationFailure ecc) should also be alerted on, see stx-labs/core-epics#119 (comment). Keeping those out for now.

Applicable issues

  • fixes stx-labs/core-epics#160

Checklist

  • Test coverage for new or modified code paths
  • For new Clarity features or consensus changes, add property tests (see docs/property-testing.md)
  • Changelog is updated
  • Required documentation changes (e.g., rpc/openapi.yaml for RPC endpoints, event-dispatcher.md for new events)
  • New clarity functions have corresponding PR in clarity-benchmarking repo

@coveralls
Copy link
Copy Markdown

coveralls commented Mar 27, 2026

Coverage Report for CI Build 24078323220

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-8.3%) to 77.398%

Details

  • Coverage decreased (-8.3%) from the base build.
  • Patch coverage: 8 uncovered changes across 1 file (134 of 142 lines covered, 94.37%).
  • 19497 coverage regressions across 183 files.

Uncovered Changes

File Changed Covered %
stackslib/src/chainstate/stacks/db/transactions.rs 122 114 93.44%

Coverage Regressions

19497 previously-covered lines in 183 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
stacks-node/src/neon_node.rs 2537 4.48%
stacks-node/src/burnchains/bitcoin_regtest_controller.rs 1946 24.22%
stacks-signer/src/v0/signer.rs 1249 0.0%
stacks-node/src/nakamoto_node/relayer.rs 1037 15.13%
stacks-node/src/nakamoto_node/miner.rs 1012 8.62%
stacks-node/src/run_loop/neon.rs 556 0.94%
stacks-signer/src/v0/signer_state.rs 510 34.32%
stacks-node/src/event_dispatcher.rs 468 33.41%
stackslib/src/net/p2p.rs 389 65.4%
stacks-node/src/run_loop/nakamoto.rs 386 0.0%

Coverage Stats

Coverage Status
Relevant Lines: 217854
Covered Lines: 168614
Line Coverage: 77.4%
Coverage Strength: 7582881.09 hits per line

💛 - Coveralls

@simone-stacks simone-stacks marked this pull request as ready for review March 31, 2026 19:58
Comment on lines +1253 to +1303

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn runtime_check_unreachable_returns_true() {
let err = RuntimeCheckErrorKind::Unreachable("test bug".into());
assert!(err.is_unreachable());
}

#[test]
fn runtime_check_cost_overflow_not_unreachable() {
assert!(!RuntimeCheckErrorKind::CostOverflow.is_unreachable());
}

#[test]
fn runtime_check_type_error_not_unreachable() {
let err = RuntimeCheckErrorKind::TypeError(
Box::new(TypeSignature::IntType),
Box::new(TypeSignature::UIntType),
);
assert!(!err.is_unreachable());
}

#[test]
fn runtime_check_value_too_large_not_unreachable() {
assert!(!RuntimeCheckErrorKind::ValueTooLarge.is_unreachable());
}

#[test]
fn static_check_unreachable_returns_true() {
let err = StaticCheckErrorKind::Unreachable("test bug".into());
assert!(err.is_unreachable());
}

#[test]
fn static_check_supertype_too_large_not_unreachable() {
assert!(!StaticCheckErrorKind::SupertypeTooLarge.is_unreachable());
}

#[test]
fn static_check_cost_overflow_not_unreachable() {
assert!(!StaticCheckErrorKind::CostOverflow.is_unreachable());
}

#[test]
fn static_check_type_already_annotated_not_unreachable() {
assert!(!StaticCheckErrorKind::TypeAlreadyAnnotatedFailure.is_unreachable());
}
}
Copy link
Copy Markdown
Contributor

@federico-stacks federico-stacks Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few suggestions that might help:

  • for consistency in test structure (at least it seems for vm/analysis), it may be worth moving these into src/vm/analysis/tests/errors.rs
  • tests could be collapsed per error type, e.g. static_check_is_unreachable() and runtime_check_is_unreachable()
  • At the moment, the tests are not exhaustive, as they don’t cover all variants. I see a few possible approaches here:
    • Make them fully exhaustive by matching all variants per type (though this could become quite heavy and maybe not worthy in this case).
    • Keep the pattern matching focused on the “unreachable” variants, and use a catch-all _ arm for the rest.
    • Alternatively, we may consider to drop them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crc: 4fca9f1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that point 1 and point 2 have been addressed, but not point 3. I would like to know your thoughts about it :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I missed the exhaustiveness concern, crc is in 68e514f, I dropped them entirely. is_unreachable() is a trivial matches! one-liner, these were just testing Rust pattern matching. removed from both analysis and ast. What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the point here isn’t the triviality of the code, but rather that we’re not using is_unreachable() as part of the business logic (as stated in the docstring).

I’m fine with this. Let’s keep this open for now to gather feedback from other reviewers.

brice-stacks
brice-stacks previously approved these changes Apr 7, 2026
Copy link
Copy Markdown
Contributor

@brice-stacks brice-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@federico-stacks federico-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good overall.

I’ve left a specific note in this thread: #7046 (comment)

One additional thought: since there are other error variants that might be considered unreachable (as mentioned in the PR description), it could be worth including them in our monitoring as well. More in general we could collect from the consensus tests documentation all the error variants currently treated as unreachable and track them accordingly. (This could also be done in a follow-up PR)

Any thoughts?

@simone-stacks
Copy link
Copy Markdown
Contributor Author

Yeah i think it's a great follow-up. The strongest candidates are TypeAlreadyAnnotatedFailure and CheckerImplementationFailure: they're documented as "indicating a bug" and their error message says "internal error please file an issue"

There's also VmInternalError::Expect and InvariantViolation in vm/errors.rs which are documented as "believed to be unreachable", but those flow through the Acceptable path in handle_clarity_runtime_error so they silently eat fees without any alerting today

There's also epoch-gated stuff (SupertypeTooLarge, ExpressionStackDepthTooDeep) but I believe the situation is different here since those are expected consensus behavior and not bugs

I'd tackle it in a follow-up PR, focusing on the first two categories. does that make sense? anything else I might be missing?

@federico-stacks
Copy link
Copy Markdown
Contributor

I'd tackle it in a follow-up PR, focusing on the first two categories. does that make sense? anything else I might be missing?

we could consider all the unreachable variants from the consensus tests (the one under stackslib src/chainstate/tests/)

Copy link
Copy Markdown
Contributor

@federico-stacks federico-stacks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

heads-up: a lot of CI checks are failed. Should not be related, but worth to have a clean run before merge this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants