A stable version of intent formalization for AutoCLRS by shuvendu-lahiri · Pull Request #19 · microsoft/intent-formalization

shuvendu-lahiri · 2026-03-16T23:03:00Z

No description provided.

The in-tree F* test results table in README.md is the authoritative evaluation report. The old audit report is superseded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Implements the correct Appendix B completeness check methodology: - Soundness: phi(input, expected) holds (spec accepts correct output) - Completeness: forall y. phi(input, y) ==> y == expected (spec uniquely determines output) Contains 3 soundness proofs and 3 completeness proofs for InsertionSort with helper lemmas (count2, count3, count5) for Seq.count unfolding. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…pect_failure - Replace Test.InsertionSort.fst with correct Appendix B methodology (3 soundness + 3 completeness proofs, no @expect_failure) - Delete Test.InsertionSort.AppendixB.fst (merged into main file) - Update README: fix completeness description, update example, update counts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add .github/copilot-instructions.md with Appendix B completeness methodology - Add .github/agents/ and .github/skills/ for Copilot agent/skill definitions - Add verification-log.txt (48 files, 21 chapters all pass) - Update README with verification log link and fixed FSTAR_EXE path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tion log - Add verification log reference (48 files, 21 chapters) - Document soundness vs completeness count mismatch and methodology gap - Note InsertionSort and GCD as Appendix B completeness exemplars Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace all [@@expect_failure] tests with genuine Appendix B completeness: forall y. phi(input,y) ==> y == expected Results (verified with F* v2025.12.15, Z3 4.13.3): - 44/47 algorithms verified (3 require Pulse: InsertionSort, MergeSort, BinarySearch) - 326 total assertions: 186 soundness + 140 completeness - 43/47 files have completeness tests - 4 files have soundness only (prop predicates with no unique output) - 0 [@@expect_failure] remaining Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rewrote completeness tests to call the actual algorithm function instead of testing helper predicates. Each test follows: test_completeness(y) { y = algo(input); assert y == expected } Changed files: - BFS: bfs_distance (was has_edge/is_visited) - DFS: dfs (was init_state/discover_vertex) - KMP: count_matches_spec (was matched_prefix_at) - RabinKarp: matches_at_dec (was pow/hash/pow_mod) - Huffman: huffman_build (was freq_of/merge/wpl) - Quickselect: select_spec (was partition_ordered) - JarvisMarch: jarvis_march_spec (was find_leftmost_spec only) - ActivitySelection: documented GTot limitation (max_compatible_count) Updated README with Top-Level Function column for all 47 algorithms. All 8 files verified with F* (44/47 pass, 3 need Pulse). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- TopologicalSort: replaced has_edge tests with position_in_order completeness tests (uses normalize_term_spec for recursive function) - README: added 'Impl Function' column showing the Pulse implementation function for each algorithm alongside the pure spec function tested - All 44/47 pass F* verification (3 need Pulse plugin) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Added AutoCLRS as submodule at commit 1984af1 (eval-autoclrs-specs/autoclrs/) - Rewrote Test.Quicksort.fst as #lang-pulse test that calls the implementation directly: y = quicksort(x); assert(y == expected) - Test imports only CLRS.Ch07.Quicksort.Impl no spec modules exposed - Pattern: black-box completeness check against the implementation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Removed old 47-algorithm table referencing spec functions - README now shows only Quicksort as the single example - Documents black-box completeness pattern: call impl, assert result - No spec modules exposed in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Show quicksort postcondition (sorted /\ permutation) in example - Add inline comments explaining what F* must prove for completeness - Add Verification section with make verify instructions - Note current status: pending Pulse plugin build Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Test.Quicksort.fst: black-box Pulse test calling quicksort impl on [3,1,2] and asserting output is [1,2,3] - Completeness lemma bridges BoundedIntegers typeclass operators to standard Prims operators for Z3 reasoning - Uses count-based permutation reasoning + reveal_opaque - Verified: all conditions discharged successfully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Black-box completeness tests following Appendix B methodology: test(x) { y = algorithm(x); assert(y == expected) } Results: 32 verified, 6 failed, 5 blocked (upstream), 5 not reached Verified (32): InsertionSort, MergeSort, Heapsort(spec), Quicksort, BucketSort, RadixSort, BinarySearch, MaxSubarray, Stack, SLL, HashTable, BST, BSTArray, ActivitySelection, Huffman, UnionFind, BFS, DFS, TopologicalSort, FloydWarshall, MaxFlow, GCD, ModExp, ModExpLR, ExtendedGCD, NaiveStringMatch, KMP, RabinKarp, Segments, GrahamScan, JarvisMarch, VertexCover Failed (6): MatrixMultiply, CountingSort, DLL, LCS, Prim, BellmanFord Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

TopologicalSort now calls the actual Pulse topological_sort implementation instead of pure spec functions. Includes acyclicity proof, completeness lemma, and ghost reference handling. Verified in ~3s. Added PATTERNS.md documenting reusable patterns for writing completeness tests (sorting, graph, pure spec). Key simplification: use admit() after completeness assertions to skip resource cleanup proofs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rewrite all test files to call Pulse implementation functions directly (not spec functions). Each test follows the pattern: test(x) { y = algorithm(x); assert(y == expected) } 41 tests call Impl functions, 7 are spec-only (no Impl module exists). Completeness lemmas use admit() as proof obligations to be discharged. Update README with Impl Function column listing the exact function called in each test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace admit() with assert_norm() proofs in completeness lemmas: - gcd_spec 12 8 == 4 - mod_exp_spec 2 10 1000 == 24 - segments_intersect_spec 0 0 2 0 1 (-1) 1 1 == true All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…narySearch, UnionFind, FloydWarshall Proved completeness lemmas (replaced admit() with actual proofs): - InsertionSort/MergeSort/Heapsort: count-based reasoning with BoundedIntegers bridge and reveal_opaque for permutation - BinarySearch: sorted input + postcondition implies result == 2 - UnionFind: union(0,1) then find(0) == find(1) from postconditions - FloydWarshall: APSP matrix entries for 3-node graph - BSTArray: explicit ghost args for tree_search inference All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Proved completeness lemmas for graph traversal algorithms: - BFS: distances [0,1,1] for 3-node graph with edges 0->1, 0->2 Added reachability helper lemmas and BFS postcondition strengthening - DFS: all nodes colored black, discovery < finish timestamps Strengthened requires with DFS postcondition universals All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, RodCutting, Huffman, GrahamScan, JarvisMarch Proved 8 more completeness lemmas (replaced admit() with proofs): - MatrixMultiply: matrix product equality for 2x2 matrices - CountingSort: sorted output via counting sort postcondition - BST: assert_norm for bst_search after insert sequence - LCS: longest common subsequence length - RodCutting: optimal revenue computation - Huffman: greedy_cost == 3 for 2-frequency input - GrahamScan: find_bottom_spec returns index 1 - JarvisMarch: jarvis_march_spec returns hull size 3 Updated README: 39/48 proofs discharged, 9 remaining with admit(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…axFlow Proved completeness lemmas: - ActivitySelection: activity_input_ok + completeness_activity_selection fully discharged (count==2, selected activities [0,2]) - HashTable: completeness_search_42 proved (insert still admit) - MaxFlow: max_flow_complete proved (caps_ok still admit - abstract pred) Remaining admit(): HashTable insert, MaxFlow caps_ok, VertexCover, BSTArray, Kruskal, Prim, BellmanFord, Dijkstra (8 tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…d/Dijkstra Proved completeness lemmas: - BSTArray: subtree_in_range + key_in_subtree with delta/fuel unfolding - MaxFlow: valid_caps via friend + caps_ok, flow_complete via assert_norm - BellmanFord: weights_in_range input_ok proved (complete still admit) - Dijkstra: all_weights_non_negative + weights_in_range proved (complete still admit) 42/48 proofs discharged. Remaining 6 admit(): HashTable (insert), Kruskal, Prim, BellmanFord (complete), Dijkstra (complete), VertexCover Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Proved dijkstra_complete by threading exact postcondition (shortest distances and predecessors) through the lemma. 3-node graph: distances [0,1,3], predecessors [0,0,1]. 43/48 proofs discharged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Link each Impl Function to its .fsti file in autoclrs submodule - Fix GCD, ModExp, ModExpLR, Segments proof status (were proved but README still showed admit) - 43/48 proofs discharged Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

HashTable, Kruskal, Prim, BellmanFord, VertexCover postconditions are too weak to uniquely determine outputs genuine completeness failures detected by the evaluation methodology. Final: 43/48 , 5/48 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add alternative-input test files that demonstrate completeness is input-dependent for algorithms with relational (non-functional) postconditions: - BinarySearch2: sorted [1,3,3,5] key=3 index 1 and 2 both valid - DFS2: fork graph 01,02 timestamps d[1]=2 or d[1]=4 - TopologicalSort2: fork graph 01,02 [0,1,2] and [0,2,1] valid - Dijkstra2: diamond graph equal weights pred[3]=1 or pred[3]=2 - ActivitySelection2: tied finish times activity 1 or 2 selectable All tests verify with admit() (structure OK). Completeness lemmas are expected to be unprovable that is the point of these tests. README updated with ᴺ sub-rows in tables and new Input-Dependent Completeness section explaining the distinction from failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

GCD(15,10)=5, ModExp(3,5,13)=9, ModExpLR(3,5,13)=9, ExtendedGCD(48,18), Segments(parallel non-intersecting), NaiveStringMatch(shifted pattern), KMP(different text/pattern), RabinKarp(different input), MaxSubarray([-1,3,-2,5,-1]=6), BucketSort([5,1,4,2][1,2,4,5]), RadixSort(456). All proofs discharged with assert_norm. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

InsertionSort([5,1,3][1,3,5]), MergeSort([5,1,3][1,3,5]), Heapsort([5,1,3][1,3,5]), Quicksort([5,1,3][1,3,5]), CountingSort([5,1,3][1,3,5]). All proofs discharged with count-based permutation reasoning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Data structures: Stack2, Queue2, SLL2, DLL2, BST2, BSTArray2, RBTree2 DP: LCS2(x=[1,3,5],y=[3,5,7]2), MatrixChain2, RodCutting2(prices=[2,5,9]9) Search: BinarySearch3([2,4,6,8,10] key=83), MatrixMultiply2([[2,0],[1,3]][[1,4],[2,1]]) Selection: MinMax2, PartialSelectionSort2, Quickselect2, SimultaneousMinMax2 Greedy: Huffman2, ActivitySelection3(all compatible3) Geometry: GrahamScan2, JarvisMarch2 All proofs discharged. Verified in WSL. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

union(1,2) then find(1)==find(2) different pair from original union(0,1). Proof discharged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

BFS2: chain graph 012, distances [0,1,2]. Proof discharged with reachability lemmas for 2-hop path. README: Added '2nd Example' column to all tables showing 38 verified second completeness examples across all algorithm categories. 5 algorithms have no 2nd example. DFS, TopologicalSort, Dijkstra, FloydWarshall, MaxFlow 2nd examples pending (complex proofs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

DFS3 (2-node graph), TopologicalSort3 (4-node chain 0123), Dijkstra3 (2-node, edge weight 3), FloydWarshall2 (different weights), MaxFlow2 (different capacity graph). All proofs discharged. README updated: 43/43 proved algorithms now have second completeness examples verified. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Each column now uses uniform format: [Test.File.fst](link) / This generalizes cleanly to additional example columns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Non-deterministic tests are not spec failures the spec correctly permits multiple valid outputs. Added TODO to extend methodology for verifying completeness with non-deterministic outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The fork graph example demonstrates non-deterministic output (), not a spec incompleteness failure. Updated section title, markers, and descriptions consistently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Keep our changes on conflicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shuvendu-lahiri and others added 30 commits March 12, 2026 14:22

Remove stale SPEC_AUDIT_REPORT.md from eval-autoclrs-specs

6ff00dc

The in-tree F* test results table in README.md is the authoritative evaluation report. The old audit report is superseded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove Copilot config files (not needed in this repo)

738ea0e

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore top-level README to prior content

68a5479

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shuvendu-lahiri and others added 7 commits March 15, 2026 23:18

Add second completeness example for UnionFind

25d5107

union(1,2) then find(1)==find(2) different pair from original union(0,1). Proof discharged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restructure README: merge Test File + Proof into Completeness column

7ab4b52

Each column now uses uniform format: [Test.File.fst](link) / This generalizes cleanly to additional example columns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge main into intree-spec-tests

4e5b472

Keep our changes on conflicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A stable version of intent formalization for AutoCLRS #19

A stable version of intent formalization for AutoCLRS #19
shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
shuvendu-lahiri:intree-spec-tests

shuvendu-lahiri commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuvendu-lahiri commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant