A stable version of intent formalization for AutoCLRS #19
Open
shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
Open
A stable version of intent formalization for AutoCLRS #19shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
Conversation
The in-tree F* test results table in README.md is the authoritative evaluation report. The old audit report is superseded. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements the correct Appendix B completeness check methodology: - Soundness: phi(input, expected) holds (spec accepts correct output) - Completeness: forall y. phi(input, y) ==> y == expected (spec uniquely determines output) Contains 3 soundness proofs and 3 completeness proofs for InsertionSort with helper lemmas (count2, count3, count5) for Seq.count unfolding. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pect_failure - Replace Test.InsertionSort.fst with correct Appendix B methodology (3 soundness + 3 completeness proofs, no @expect_failure) - Delete Test.InsertionSort.AppendixB.fst (merged into main file) - Update README: fix completeness description, update example, update counts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add .github/copilot-instructions.md with Appendix B completeness methodology - Add .github/agents/ and .github/skills/ for Copilot agent/skill definitions - Add verification-log.txt (48 files, 21 chapters all pass) - Update README with verification log link and fixed FSTAR_EXE path Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion log - Add verification log reference (48 files, 21 chapters) - Document soundness vs completeness count mismatch and methodology gap - Note InsertionSort and GCD as Appendix B completeness exemplars Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all [@@expect_failure] tests with genuine Appendix B completeness: forall y. phi(input,y) ==> y == expected Results (verified with F* v2025.12.15, Z3 4.13.3): - 44/47 algorithms verified (3 require Pulse: InsertionSort, MergeSort, BinarySearch) - 326 total assertions: 186 soundness + 140 completeness - 43/47 files have completeness tests - 4 files have soundness only (prop predicates with no unique output) - 0 [@@expect_failure] remaining Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrote completeness tests to call the actual algorithm function instead of
testing helper predicates. Each test follows:
test_completeness(y) { y = algo(input); assert y == expected }
Changed files:
- BFS: bfs_distance (was has_edge/is_visited)
- DFS: dfs (was init_state/discover_vertex)
- KMP: count_matches_spec (was matched_prefix_at)
- RabinKarp: matches_at_dec (was pow/hash/pow_mod)
- Huffman: huffman_build (was freq_of/merge/wpl)
- Quickselect: select_spec (was partition_ordered)
- JarvisMarch: jarvis_march_spec (was find_leftmost_spec only)
- ActivitySelection: documented GTot limitation (max_compatible_count)
Updated README with Top-Level Function column for all 47 algorithms.
All 8 files verified with F* (44/47 pass, 3 need Pulse).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TopologicalSort: replaced has_edge tests with position_in_order completeness tests (uses normalize_term_spec for recursive function) - README: added 'Impl Function' column showing the Pulse implementation function for each algorithm alongside the pure spec function tested - All 44/47 pass F* verification (3 need Pulse plugin) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Added AutoCLRS as submodule at commit 1984af1 (eval-autoclrs-specs/autoclrs/) - Rewrote Test.Quicksort.fst as #lang-pulse test that calls the implementation directly: y = quicksort(x); assert(y == expected) - Test imports only CLRS.Ch07.Quicksort.Impl no spec modules exposed - Pattern: black-box completeness check against the implementation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Removed old 47-algorithm table referencing spec functions - README now shows only Quicksort as the single example - Documents black-box completeness pattern: call impl, assert result - No spec modules exposed in tests Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Show quicksort postcondition (sorted /\ permutation) in example - Add inline comments explaining what F* must prove for completeness - Add Verification section with make verify instructions - Note current status: pending Pulse plugin build Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Test.Quicksort.fst: black-box Pulse test calling quicksort impl on [3,1,2] and asserting output is [1,2,3] - Completeness lemma bridges BoundedIntegers typeclass operators to standard Prims operators for Z3 reasoning - Uses count-based permutation reasoning + reveal_opaque - Verified: all conditions discharged successfully Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Black-box completeness tests following Appendix B methodology:
test(x) { y = algorithm(x); assert(y == expected) }
Results: 32 verified, 6 failed, 5 blocked (upstream), 5 not reached
Verified (32): InsertionSort, MergeSort, Heapsort(spec), Quicksort,
BucketSort, RadixSort, BinarySearch, MaxSubarray, Stack, SLL,
HashTable, BST, BSTArray, ActivitySelection, Huffman, UnionFind,
BFS, DFS, TopologicalSort, FloydWarshall, MaxFlow, GCD, ModExp,
ModExpLR, ExtendedGCD, NaiveStringMatch, KMP, RabinKarp,
Segments, GrahamScan, JarvisMarch, VertexCover
Failed (6): MatrixMultiply, CountingSort, DLL, LCS, Prim, BellmanFord
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TopologicalSort now calls the actual Pulse topological_sort implementation instead of pure spec functions. Includes acyclicity proof, completeness lemma, and ghost reference handling. Verified in ~3s. Added PATTERNS.md documenting reusable patterns for writing completeness tests (sorting, graph, pure spec). Key simplification: use admit() after completeness assertions to skip resource cleanup proofs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite all test files to call Pulse implementation functions directly
(not spec functions). Each test follows the pattern:
test(x) { y = algorithm(x); assert(y == expected) }
41 tests call Impl functions, 7 are spec-only (no Impl module exists).
Completeness lemmas use admit() as proof obligations to be discharged.
Update README with Impl Function column listing the exact function
called in each test.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace admit() with assert_norm() proofs in completeness lemmas: - gcd_spec 12 8 == 4 - mod_exp_spec 2 10 1000 == 24 - segments_intersect_spec 0 0 2 0 1 (-1) 1 1 == true All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…narySearch, UnionFind, FloydWarshall Proved completeness lemmas (replaced admit() with actual proofs): - InsertionSort/MergeSort/Heapsort: count-based reasoning with BoundedIntegers bridge and reveal_opaque for permutation - BinarySearch: sorted input + postcondition implies result == 2 - UnionFind: union(0,1) then find(0) == find(1) from postconditions - FloydWarshall: APSP matrix entries for 3-node graph - BSTArray: explicit ghost args for tree_search inference All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Proved completeness lemmas for graph traversal algorithms: - BFS: distances [0,1,1] for 3-node graph with edges 0->1, 0->2 Added reachability helper lemmas and BFS postcondition strengthening - DFS: all nodes colored black, discovery < finish timestamps Strengthened requires with DFS postcondition universals All verified with F* + Pulse plugin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, RodCutting, Huffman, GrahamScan, JarvisMarch Proved 8 more completeness lemmas (replaced admit() with proofs): - MatrixMultiply: matrix product equality for 2x2 matrices - CountingSort: sorted output via counting sort postcondition - BST: assert_norm for bst_search after insert sequence - LCS: longest common subsequence length - RodCutting: optimal revenue computation - Huffman: greedy_cost == 3 for 2-frequency input - GrahamScan: find_bottom_spec returns index 1 - JarvisMarch: jarvis_march_spec returns hull size 3 Updated README: 39/48 proofs discharged, 9 remaining with admit(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…axFlow Proved completeness lemmas: - ActivitySelection: activity_input_ok + completeness_activity_selection fully discharged (count==2, selected activities [0,2]) - HashTable: completeness_search_42 proved (insert still admit) - MaxFlow: max_flow_complete proved (caps_ok still admit - abstract pred) Remaining admit(): HashTable insert, MaxFlow caps_ok, VertexCover, BSTArray, Kruskal, Prim, BellmanFord, Dijkstra (8 tests). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d/Dijkstra Proved completeness lemmas: - BSTArray: subtree_in_range + key_in_subtree with delta/fuel unfolding - MaxFlow: valid_caps via friend + caps_ok, flow_complete via assert_norm - BellmanFord: weights_in_range input_ok proved (complete still admit) - Dijkstra: all_weights_non_negative + weights_in_range proved (complete still admit) 42/48 proofs discharged. Remaining 6 admit(): HashTable (insert), Kruskal, Prim, BellmanFord (complete), Dijkstra (complete), VertexCover Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Proved dijkstra_complete by threading exact postcondition (shortest distances and predecessors) through the lemma. 3-node graph: distances [0,1,3], predecessors [0,0,1]. 43/48 proofs discharged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Link each Impl Function to its .fsti file in autoclrs submodule - Fix GCD, ModExp, ModExpLR, Segments proof status (were proved but README still showed admit) - 43/48 proofs discharged Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HashTable, Kruskal, Prim, BellmanFord, VertexCover postconditions are too weak to uniquely determine outputs genuine completeness failures detected by the evaluation methodology. Final: 43/48 , 5/48 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add alternative-input test files that demonstrate completeness is input-dependent for algorithms with relational (non-functional) postconditions: - BinarySearch2: sorted [1,3,3,5] key=3 index 1 and 2 both valid - DFS2: fork graph 01,02 timestamps d[1]=2 or d[1]=4 - TopologicalSort2: fork graph 01,02 [0,1,2] and [0,2,1] valid - Dijkstra2: diamond graph equal weights pred[3]=1 or pred[3]=2 - ActivitySelection2: tied finish times activity 1 or 2 selectable All tests verify with admit() (structure OK). Completeness lemmas are expected to be unprovable that is the point of these tests. README updated with ᴺ sub-rows in tables and new Input-Dependent Completeness section explaining the distinction from failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GCD(15,10)=5, ModExp(3,5,13)=9, ModExpLR(3,5,13)=9, ExtendedGCD(48,18), Segments(parallel non-intersecting), NaiveStringMatch(shifted pattern), KMP(different text/pattern), RabinKarp(different input), MaxSubarray([-1,3,-2,5,-1]=6), BucketSort([5,1,4,2][1,2,4,5]), RadixSort(456). All proofs discharged with assert_norm. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertionSort([5,1,3][1,3,5]), MergeSort([5,1,3][1,3,5]), Heapsort([5,1,3][1,3,5]), Quicksort([5,1,3][1,3,5]), CountingSort([5,1,3][1,3,5]). All proofs discharged with count-based permutation reasoning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Data structures: Stack2, Queue2, SLL2, DLL2, BST2, BSTArray2, RBTree2 DP: LCS2(x=[1,3,5],y=[3,5,7]2), MatrixChain2, RodCutting2(prices=[2,5,9]9) Search: BinarySearch3([2,4,6,8,10] key=83), MatrixMultiply2([[2,0],[1,3]][[1,4],[2,1]]) Selection: MinMax2, PartialSelectionSort2, Quickselect2, SimultaneousMinMax2 Greedy: Huffman2, ActivitySelection3(all compatible3) Geometry: GrahamScan2, JarvisMarch2 All proofs discharged. Verified in WSL. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
union(1,2) then find(1)==find(2) different pair from original union(0,1). Proof discharged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BFS2: chain graph 012, distances [0,1,2]. Proof discharged with reachability lemmas for 2-hop path. README: Added '2nd Example' column to all tables showing 38 verified second completeness examples across all algorithm categories. 5 algorithms have no 2nd example. DFS, TopologicalSort, Dijkstra, FloydWarshall, MaxFlow 2nd examples pending (complex proofs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DFS3 (2-node graph), TopologicalSort3 (4-node chain 0123), Dijkstra3 (2-node, edge weight 3), FloydWarshall2 (different weights), MaxFlow2 (different capacity graph). All proofs discharged. README updated: 43/43 proved algorithms now have second completeness examples verified. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each column now uses uniform format: [Test.File.fst](link) / This generalizes cleanly to additional example columns. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Non-deterministic tests are not spec failures the spec correctly permits multiple valid outputs. Added TODO to extend methodology for verifying completeness with non-deterministic outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fork graph example demonstrates non-deterministic output (), not a spec incompleteness failure. Updated section title, markers, and descriptions consistently. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep our changes on conflicts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.