Skip to content

feat(cpp-phase2): PR-04 resolution-quality fixes (glibc attrs, transitive includes)#682

Open
shivasurya wants to merge 4 commits intoshiva/cpp-phase2-pr03-remotefrom
shiva/cpp-phase2-pr04-resolution-quality
Open

feat(cpp-phase2): PR-04 resolution-quality fixes (glibc attrs, transitive includes)#682
shivasurya wants to merge 4 commits intoshiva/cpp-phase2-pr03-remotefrom
shiva/cpp-phase2-pr04-resolution-quality

Conversation

@shivasurya
Copy link
Copy Markdown
Owner

Summary

PR-04 of the C/C++ Phase 2 stdlib stack — three resolution-quality fixes uncovered during local validation against redis and proxygen. Stacked on PR-03; should merge after.

What's in here

Three independent fixes, each in its own commit:

1. Recover glibc-decorated declarations (fix(generator))

Tree-sitter's C grammar can't parse:

extern size_t strlen (const char *__s)
     __THROW __attribute_pure__ __nonnull ((1));

It collapses the whole declaration into an ERROR node and the symbol vanishes. Effect: strlen, strcmp, strncmp, strcasecmp, memcmp, memmem, snprintf, vsnprintf and dozens of similar glibc functions never made it into the manifest.

tools/internal/clikeextract/clike_preprocess.go strips the macros to whitespace before tree-sitter parses (length-preserving, line-number-accurate). Macro list sourced from <sys/cdefs.h>. Wired into both C and C++ extractors.

2. Resolver tolerant of generator's map placement (fix(registry))

std::move is emitted into manifest's functions map but PR-02's GetFreeFunction only consulted free_functions. Lookup now falls through to both maps. Strict superset of previous behaviour.

3. Transitive-include fallback (feat(builder))

Real-world C++ relies heavily on transitive includes. A file that calls std::move without directly #include <utility> (utility comes through <vector> etc.) used to fail resolution. The resolver now walks every manifest header as a fallback, bounded to qualified names so unqualified project symbols can't accidentally bind. Required adding ListHeaders() to the CStdlibLoader / CppStdlibLoader interfaces.

Validation results

Resolution-report on real codebases, regenerated registries, file:// loader.

Project PR-03 baseline PR-04
smoke fixture (testdata/c/stdlib) 71% (5/7) 100% (7/7)
redis 60.6% 63.1% (+2.5 pp)
proxygen 12.0% 17.1% (+5.1 pp)

Top-1 unresolved in proxygen (std::move, 1603 occurrences) is gone — all std::move calls now resolve via the transitive fallback. Total Linux C functions extracted: 8,467 → 9,205 (+738).

The proxygen number is still well below the 98% spec target, but the remaining unresolveds are largely non-stdlib: gtest / glog / folly logging macros (VLOG, EXPECT_EQ, CHECK), folly types (folly::makeUnexpected, hasError), and unqualified class-method calls that need receiver-type inference (separate concern, future PR).

Verification

  • gradle buildGo — clean
  • go test ./... — all packages pass
  • golangci-lint run ./... — 0 issues

Out of scope (future PRs)

  • Receiver-type inference for chained STL method calls (vec.size(), str.empty())
  • gtest / glog / folly macro recognition (project-specific overlay)
  • Generator emitter cleanup so namespaced symbols always land in free_functions

Test plan

  • CI green
  • Operator validation: regenerate Linux registries from this branch and re-run resolution-report against own codebases

🤖 Generated with Claude Code

shivasurya and others added 3 commits May 3, 2026 20:31
Tree-sitter's C grammar can't parse the trailing GCC attribute macro
chain on glibc declarations:

    extern size_t strlen (const char *__s)
         __THROW __attribute_pure__ __nonnull ((1));

The parser collapses the entire surrounding declaration into an ERROR
node, the walker skips ERROR nodes by design, and the symbol vanishes
from the manifest. Effect on Linux: ~50% of <string.h> and parts of
<stdio.h> were missing — strlen, strcmp, strncmp, strcasecmp, memcmp,
memmem, snprintf, vsnprintf and friends.

clike_preprocess.go strips known no-op glibc attribute macros to
whitespace before tree-sitter parses. Two passes:

- Bare-token list (__THROW, __THROWNL, __attribute_pure__, __wur,
  __returns_nonnull, …) sourced from /usr/include/x86_64-linux-gnu/
  sys/cdefs.h. Word-boundary regex so we don't match inside identifiers.
- Function-like list (__attribute__, __nonnull, __attr_access,
  __attr_dealloc, __REDIRECT, …) with balanced-paren consumption so
  the trailing argument list is fully eaten.

Length-preserving (replace with spaces, leave newlines intact) so
tree-sitter's reported line numbers stay accurate. Idempotent. Doesn't
strip __inline / __restrict / __flexarr — those are real keywords or
have semantic effect on parsing.

Wired into both extractCHeader and extractCppHeader since libstdc++
borrows the same decorations.

Validation against /usr/include on Ubuntu 24.04 (libstdc++-13):
- string.h: 38 → 48 functions
- stdio.h:  recovered scanf/sprintf/snprintf/vsnprintf
- Total Linux C: 8467 → 9205 functions (+738)

Smoke fixture (testdata/c/stdlib): 5/7 → 7/7 resolved.
Redis full tree: 60.6% → 63.1% (+2.5 pp).

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The PR-01 generator emits some namespaced free-function entries
(notably std::move, std::forward, std::swap as overlay-only entries)
into the manifest's `functions` map rather than `free_functions`. The
PR-02 resolver only consulted `free_functions`, so those symbols
silently failed to resolve at the call site.

GetFreeFunction now consults both maps. The lookup order is
free_functions first (the canonical map), then functions (the
fallback). This is a strict superset of the previous behaviour — any
entry that resolved before still resolves at the same priority — so
it's safe to ship without coordinating a generator-side cleanup.

Future work: tighten the generator emitter so namespaced symbols
always land in `free_functions`. Tracked outside this stack.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Real-world C++ code routinely relies on transitive includes — a file
calls std::move without `#include <utility>` because some other header
it includes pulls utility in. PR-02's resolver walked only the caller
file's direct system includes, so namespaced std::* calls failed to
resolve in the majority of files that should have benefited.

lookupCppStdlibFreeFunction now runs in two stages:

1. Direct system includes (the existing fast path).
2. Manifest-wide scan when the direct walk doesn't yield a hit. Bounded
   to qualified names (containing "::") so unqualified project symbols
   can never accidentally bind to a stdlib entry. First hit wins;
   stdlib FQNs are unique across headers so order doesn't affect
   correctness.

To support the fallback, both CStdlibLoader and CppStdlibLoader gain a
ListHeaders() method returning every manifest header name. Implemented
on the file://+HTTP loaders by reading manifest.Headers; test fakes
implement it by walking their fixture map. Fakes in the cmd package
reuse the same pattern.

Validation against proxygen full tree:
- 12.0% → 17.1% resolved (+5.1 pp)
- std::move drops out of the top-20 unresolved (was #1 with 1603
  occurrences, all resolved now)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@shivasurya shivasurya added bug Something isn't working enhancement New feature or request go Pull requests that update go code labels May 4, 2026
@shivasurya shivasurya self-assigned this May 4, 2026
@safedep
Copy link
Copy Markdown

safedep Bot commented May 4, 2026

SafeDep Report Summary

Green Malicious Packages Badge Green Vulnerable Packages Badge Green Risky License Badge

No dependency changes detected. Nothing to scan.

View complete scan results →

This report is generated by SafeDep Github App

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

Code Pathfinder Security Scan

Pass Critical High Medium Low Info

No security issues detected.

Metric Value
Files Scanned 14
Rules 205

Powered by Code Pathfinder

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.68%. Comparing base (2d5c149) to head (f0bfd81).

Additional details and impacted files
@@                       Coverage Diff                        @@
##           shiva/cpp-phase2-pr03-remote     #682      +/-   ##
================================================================
+ Coverage                         85.67%   85.68%   +0.01%     
================================================================
  Files                               202      203       +1     
  Lines                             29141    29209      +68     
================================================================
+ Hits                              24966    25029      +63     
- Misses                             3203     3207       +4     
- Partials                            972      973       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Direct tests for the new ListHeaders() method on both loaders:
- canonical case after LoadManifest returns the manifest's header set
- before LoadManifest returns nil (matches HeaderCount semantics)
- mutating the returned slice doesn't bleed into a subsequent call

Pushes both methods from 0% to 100% coverage; satisfies codecov/patch
threshold on PR #682.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@shivasurya
Copy link
Copy Markdown
Owner Author

Pure-stdlib gap analysis (third-party / macros excluded)

Cross-referencing each unresolved call's target name against the actual symbols in our linux/c and linux/cpp manifests reveals two distinct unresolved buckets — both fully fixable on the stdlib side, neither addressed by this PR but both worth queueing as follow-ups.

C — transitive-include fallback gap

3,107 redis unresolved calls have targets that already exist in our linux/c manifest. Top:

Symbol Count
strcasecmp 952
memcpy 146
strlen 136
snprintf 128
strerror 119
memset 114
exit 104
free / malloc 89 + 63
strcmp 82
printf 81
fclose / fopen 42 + 20
pthread_mutex_lock / unlock 27 + 27
ntohs / htons 30 + 17

Root cause: redis files #include "server.h", and server.h pulls in <string.h>, <stdio.h>, <stdlib.h>, <pthread.h>, etc. The PR-02 C resolver walks only the caller file's direct system includes, so when redis-cli.c calls strlen it never looks in <string.h> because that header is brought in transitively through server.h.

Fix: mechanical port of this PR's C++ transitive fallback (lookupCppStdlibFreeFunction) to the C resolver's lookupCStdlib. About 30 LoC. Bounded to the manifest's header set so no risk of binding project symbols.

Estimated uplift: redis 63.1% → ~67.5% (+4.4 pp). Similar shape on any C codebase that uses an internal "common.h" pattern, which is essentially every non-trivial C project.

C++ — receiver-type inference gap

~3,621 proxygen unresolved calls have targets that exist as STL class methods. Top:

Method Calls Likely class
empty 525 vector / string / map / optional
size 460 vector / string / map
append 221 string
end 206 iterator endpoints
emplace_back 202 vector
value 166 optional
reset 154 unique_ptr / shared_ptr
find 145 string / map / set
front / back 137 + 54 vector / deque
data 136 string / vector
push_back 132 vector
begin 122 iterator endpoints
length 105 string
erase 93 container methods
has_value 89 optional
release 85 unique_ptr
clear / reserve 66 + 64 container methods
c_str 64 string

Root cause: these are unqualified vec.size() / str.empty() / opt.value() calls. The resolver needs to know the receiver's type to dispatch through GetMethod("vector", "std::vector", "size"). Today the type engine doesn't propagate types through:

  • function returns (auto v = obj.getVec();)
  • class members (vec_.size() inside a method when vec_ is a private field)
  • chained calls (obj.getVec().size())

Fix: larger work — extending the type engine's propagation rules. Estimated uplift: proxygen 17.1% → ~30% (+13 pp).

Out of scope (third-party / macros)

  • gtest/glog macros (VLOG, EXPECT_EQ, CHECK, WillRepeatedly) — ~5,500 proxygen calls
  • folly library (folly::makeUnexpected, folly::to, chainLength) — ~700
  • redis project API (RedisModule_*, serverLog, REDISMODULE_NOT_USED) — thousands

Recommended sequence

  1. PR-05 — port the transitive-include fallback to C. Small, mechanical, ships the redis +4.4 pp uplift.
  2. (Future) — Receiver-type inference. Larger, ships the proxygen +13 pp uplift.

Both target the stdlib-resolved-rate metric specifically, not the noisier "% of all calls" overall number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request go Pull requests that update go code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant