feat(tools): C/C++ stdlib registry generator + Linux overlay#679
Open
shivasurya wants to merge 3 commits intomainfrom
Open
feat(tools): C/C++ stdlib registry generator + Linux overlay#679shivasurya wants to merge 3 commits intomainfrom
shivasurya wants to merge 3 commits intomainfrom
Conversation
Adds the schema contract consumed by both the PR-01 generator (this stack) and the loader landing in PR-02: - CStdlibRegistry / NewCStdlibRegistry — root in-memory container per (platform, language) axis, with accessors HasHeader, GetHeader, GetFunction, GetClass, GetMethod that mirror the existing GoStdlibRegistry surface. - CStdlibManifest + CStdlibHeaderEntry + CStdlibStatistics — the top-level manifest.json shape; HasHeader / GetHeaderEntry helpers for the loader's lazy-fetch path. - CStdlibHeader — per-header content; one type works for both C and C++ (C++-only fields are tagged omitempty so C output stays clean). - CStdlibFunction / CStdlibParam / CStdlibTypedef / CStdlibConstant / CppStdlibClass / CppStdlibConstructor — leaf entries. - Source / language / platform string constants so consumers don't hard-code "header" / "overlay" / "merged" / "linux" / "c" / "cpp". JSON tags are snake_case to match the Python and Go stdlib registry files already on disk; nolint:tagliatelle directives match the pattern in go_stdlib_types.go. 100% test coverage on the new file via 12 round-trip + accessor tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tools/internal/clikeextract — the Go package that walks installed
C/C++ system headers and emits per-header JSON registry files. Mirrors
the existing tools/internal/goextract layout (thin entry-point binary
plus a fully-tested internal package), one concern per file:
- doc.go — package docs (pipeline overview, reuse rules)
- config.go — Config + GeneratorVersion / SchemaVersion / RegistryVersion
/ DefaultBaseURL constants. Validate() rejects unsupported targets and
languages early.
- normalize.go — strip __attribute__((...)), _GLIBCXX_*, _LIBCPP_*,
__THROW, _Nonnull etc.; canonicalize std::__cxx11:: -> std::; private-
symbol detection (single-underscore lowercase, double-underscore);
SanitizeHeaderName for output filenames.
- walker.go — DiscoverHeaderSources for linux/c (glibc) and linux/cpp
(libstdc++), system-tag detection, deterministic header walking with
bits/ / internal/ skip rules. Windows/Darwin paths return an explicit
PR-03-deferred error so the surface is forward-compatible.
- overlay.go — yaml.v3-based loader for c_stdlib_overlay.yaml /
cpp_stdlib_overlay.yaml. Validates language match, exactly-one-of
function/method/typedef/constant, and skip-rule shape. MergeOverlay
applies overrides in place and returns the count for statistics.
- c_extractor.go — C function / typedef / preproc-def extraction over
the tree-sitter AST. Reuses graph/clike helpers (ExtractFunctionInfo,
ExtractTypeString, ExtractParameters). Conservative #define handling:
emit constants only when the body parses as a literal.
- cpp_extractor.go — C++ classes (with template_parameter_list capture),
methods, namespace-qualified free functions, constructors. Adds a
local findFunctionDeclarator that handles C++ reference_declarator
wrappers (Phase 1's clike helper does not — this is reachable for
T&-returning methods like vector::operator[] / vector::at).
- emitter.go — per-header JSON write with sha256 checksums, statistics
tally, top-level manifest.json. Output is deterministic across runs
(sorted by header name) and idempotent.
- extractor.go — Run() orchestrator stitching discover -> walk -> extract
-> merge -> emit. Continue-on-parse-failure pattern matches goextract;
fatal errors only on missing search dirs, invalid overlay, unwritable
output dir.
testdata/c/{stdio.h,string.h,unistd.h,inline.h} and
testdata/cpp/{vector,string,utility} provide synthetic fixture headers
for unit + integration tests. End-to-end TestRunFixtureLinux{C,Cpp}
exercise the full pipeline.
Coverage: 91.5% on the new package across 99 test cases. Remaining
gaps are defensive nil paths and tree-sitter shapes the synthetic
fixtures don't reach (operator_name, destructor_name fallbacks).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing generator binary plus the two hand-curated YAML
overlay files that augment tree-sitter extraction with security tags,
template return types, and skip rules:
- tools/generate_clike_stdlib_registry.go — //go:build cpf_generate_stdlib_registry
entry-point binary. Mirrors the layout of generate_go_stdlib_registry.go:
flag-parse target / language / output-dir / overlay / base-url, hand
off to clikeextract.NewExtractor(cfg).Run().
- tools/c_stdlib_overlay.yaml — 28 hand-curated entries covering
format-string sinks (printf family with __attribute__((format))),
command-injection sinks (system, popen, exec*), buffer-overflow sinks
(strcpy, gets, sprintf), allocation sources (malloc, calloc), tainted
sources (getenv, read), plus skip rules for compiler-internal symbols.
- tools/cpp_stdlib_overlay.yaml — 55 entries covering STL methods whose
template return types tree-sitter cannot substitute: vector at /
operator[] / data, basic_string c_str / data, unique_ptr/shared_ptr
get/reset/operator*, optional value/value_or, map/unordered_map find/
insert/operator[]/at, std::move / std::forward, ostream/istream stream
operators. Throws annotations on at() (std::out_of_range), value()
(std::bad_optional_access).
End-to-end smoke against this host's /usr/include + /usr/include/c++/13
produced 1875 C headers (8467 functions) and 121 C++ headers (497
classes, 564 functions) — both manifests parsed and statistics check
out.
Run with:
go run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go \
--target=linux --language=c --output-dir=/tmp/cpf-c
go run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go \
--target=linux --language=cpp --output-dir=/tmp/cpf-cpp
Output is local-only in this PR; remote deployment + CDN URL come in
PR-03. Loader + engine integration come in PR-02.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SafeDep Report SummaryNo dependency changes detected. Nothing to scan. This report is generated by SafeDep Github App |
Code Pathfinder Security ScanNo security issues detected.
Powered by Code Pathfinder |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #679 +/- ##
==========================================
+ Coverage 85.43% 85.55% +0.11%
==========================================
Files 187 196 +9
Lines 27278 28341 +1063
==========================================
+ Hits 23305 24247 +942
- Misses 3082 3156 +74
- Partials 891 938 +47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



PR-01 of the C/C++ Phase 2 stdlib-resolution stack. Closes the 88% unresolved-call gap from Phase 1 by giving us a generator that walks installed system headers and emits per-header JSON manifests the loader (PR-02) will consume.
What's in this PR
graph/callgraph/core/clike_stdlib_types.go— public schema (CStdlibRegistry,CStdlibManifest,CStdlibHeader, function/class/typedef/constant/param types). Stable contract between this PR's generator and PR-02's loader.tools/internal/clikeextract/— new internal package, one concern per file:walker.go—DiscoverHeaderSources(linux/c, linux/cpp), glibc/libstdc++ probing, deterministic header walk withbits//internal/skip rules.c_extractor.go,cpp_extractor.go— tree-sitter walkers reusinggraph/clikehelpers; C++ adds template-parameter capture, namespace-stack tracking, and a localfindFunctionDeclaratorthat handles C++reference_declarator(Phase 1's clike helper doesn't, blockingT&-returning methods likevector::operator[]/vector::at).overlay.go— yaml.v3 loader +MergeOverlaywith strict validation (language match, exactly-one-of function/method/typedef/constant, skip-rule shape).emitter.go— per-header JSON write + sha256 checksums + deterministicmanifest.json.normalize.go— strip__attribute__,_GLIBCXX_*,_LIBCPP_*, canonicalizestd::__cxx11::→std::,SanitizeHeaderName.extractor.go—Run()orchestrator (continue-on-parse-failure, mirrorsgoextract).tools/c_stdlib_overlay.yaml(28 entries) — security-critical sinks: format-string (printffamily with__attribute__((format))), command-injection (system,popen,exec*), buffer-overflow (strcpy,gets,sprintf), allocation/tainted-source markers.tools/cpp_stdlib_overlay.yaml(55 entries) — STL methods whose template return types tree-sitter cannot substitute: vector / basic_string / unique_ptr / shared_ptr / optional / map / unordered_map;std::move,std::forward; throws annotations onat()/value().tools/generate_clike_stdlib_registry.go— thin//go:build cpf_generate_stdlib_registryentry-point wiring CLI flags intoclikeextract.Extractor.Why a separate
clikeextractpackage (not flat intools/)Mirrors the existing
tools/internal/goextract/precedent — thin entry-point + heavy logic in an internal package, one file per concern, fully testable under regulargo test ./.... The tech spec sketched everything flat intools/; this layout is cleaner and consistent with the rest of the repo.End-to-end smoke (this host's
/usr/include+/usr/include/c++/13)Both manifests parse round-trip. C output exceeds the spec's "~80 headers / ~1800 functions" budget — the walker captures full POSIX/sys surface in addition to libc.
Out of scope (per PR-01 plan)
c_builder.go/cpp_builder.go— PR-02PR-03-deferred error)--diagnose-stdlib, resolution-report enhancements — PR-04Verification
go test ./...— all tests pass, no regressions on Phase 1golangci-lint run ./...— zero issues across entire repogradle buildGo— clean buildclikeextractpackage, 100% on newclike_stdlib_types.gogo run -tags cpf_generate_stdlib_registry tools/generate_clike_stdlib_registry.go --target=linux --language={c,cpp} --output-dir=...produces valid manifests on a real Ubuntu host.Test plan
go test ./tools/internal/clikeextract/... ./graph/callgraph/core/— passesgo run -tags cpf_generate_stdlib_registry ./tools/generate_clike_stdlib_registry.go --target=linux --language=c --output-dir=/tmp/cpf-c— producesmanifest.json+ per-header JSONsgradle buildGoandgradle lintGosucceed (Python lint failure on unrelated files is pre-existing)🤖 Generated with Claude Code