Skip to content

A stable version of intent formalization for AutoCLRS #19

Open
shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
shuvendu-lahiri:intree-spec-tests
Open

A stable version of intent formalization for AutoCLRS #19
shuvendu-lahiri wants to merge 37 commits intomicrosoft:mainfrom
shuvendu-lahiri:intree-spec-tests

Conversation

@shuvendu-lahiri
Copy link
Member

No description provided.

shuvendu-lahiri and others added 30 commits March 12, 2026 14:22
The in-tree F* test results table in README.md is the authoritative
evaluation report. The old audit report is superseded.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements the correct Appendix B completeness check methodology:
- Soundness: phi(input, expected) holds (spec accepts correct output)
- Completeness: forall y. phi(input, y) ==> y == expected (spec uniquely determines output)

Contains 3 soundness proofs and 3 completeness proofs for InsertionSort
with helper lemmas (count2, count3, count5) for Seq.count unfolding.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pect_failure

- Replace Test.InsertionSort.fst with correct Appendix B methodology
  (3 soundness + 3 completeness proofs, no @expect_failure)
- Delete Test.InsertionSort.AppendixB.fst (merged into main file)
- Update README: fix completeness description, update example, update counts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add .github/copilot-instructions.md with Appendix B completeness methodology
- Add .github/agents/ and .github/skills/ for Copilot agent/skill definitions
- Add verification-log.txt (48 files, 21 chapters all pass)
- Update README with verification log link and fixed FSTAR_EXE path

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion log

- Add verification log reference (48 files, 21 chapters)
- Document soundness vs completeness count mismatch and methodology gap
- Note InsertionSort and GCD as Appendix B completeness exemplars

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace all [@@expect_failure] tests with genuine Appendix B completeness:
  forall y. phi(input,y) ==> y == expected

Results (verified with F* v2025.12.15, Z3 4.13.3):
- 44/47 algorithms verified (3 require Pulse: InsertionSort, MergeSort, BinarySearch)
- 326 total assertions: 186 soundness + 140 completeness
- 43/47 files have completeness tests
- 4 files have soundness only (prop predicates with no unique output)
- 0 [@@expect_failure] remaining

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrote completeness tests to call the actual algorithm function instead of
testing helper predicates. Each test follows:
  test_completeness(y) { y = algo(input); assert y == expected }

Changed files:
- BFS: bfs_distance (was has_edge/is_visited)
- DFS: dfs (was init_state/discover_vertex)
- KMP: count_matches_spec (was matched_prefix_at)
- RabinKarp: matches_at_dec (was pow/hash/pow_mod)
- Huffman: huffman_build (was freq_of/merge/wpl)
- Quickselect: select_spec (was partition_ordered)
- JarvisMarch: jarvis_march_spec (was find_leftmost_spec only)
- ActivitySelection: documented GTot limitation (max_compatible_count)

Updated README with Top-Level Function column for all 47 algorithms.
All 8 files verified with F* (44/47 pass, 3 need Pulse).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- TopologicalSort: replaced has_edge tests with position_in_order
  completeness tests (uses normalize_term_spec for recursive function)
- README: added 'Impl Function' column showing the Pulse implementation
  function for each algorithm alongside the pure spec function tested
- All 44/47 pass F* verification (3 need Pulse plugin)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Added AutoCLRS as submodule at commit 1984af1 (eval-autoclrs-specs/autoclrs/)
- Rewrote Test.Quicksort.fst as #lang-pulse test that calls the
  implementation directly: y = quicksort(x); assert(y == expected)
- Test imports only CLRS.Ch07.Quicksort.Impl  no spec modules exposed
- Pattern: black-box completeness check against the implementation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Removed old 47-algorithm table referencing spec functions
- README now shows only Quicksort as the single example
- Documents black-box completeness pattern: call impl, assert result
- No spec modules exposed in tests

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Show quicksort postcondition (sorted /\ permutation) in example
- Add inline comments explaining what F* must prove for completeness
- Add Verification section with make verify instructions
- Note current status: pending Pulse plugin build

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Test.Quicksort.fst: black-box Pulse test calling quicksort impl
  on [3,1,2] and asserting output is [1,2,3]
- Completeness lemma bridges BoundedIntegers typeclass operators
  to standard Prims operators for Z3 reasoning
- Uses count-based permutation reasoning + reveal_opaque
- Verified: all conditions discharged successfully

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Black-box completeness tests following Appendix B methodology:
  test(x) { y = algorithm(x); assert(y == expected) }

Results: 32 verified, 6 failed, 5 blocked (upstream), 5 not reached

Verified (32): InsertionSort, MergeSort, Heapsort(spec), Quicksort,
  BucketSort, RadixSort, BinarySearch, MaxSubarray, Stack, SLL,
  HashTable, BST, BSTArray, ActivitySelection, Huffman, UnionFind,
  BFS, DFS, TopologicalSort, FloydWarshall, MaxFlow, GCD, ModExp,
  ModExpLR, ExtendedGCD, NaiveStringMatch, KMP, RabinKarp,
  Segments, GrahamScan, JarvisMarch, VertexCover

Failed (6): MatrixMultiply, CountingSort, DLL, LCS, Prim, BellmanFord

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TopologicalSort now calls the actual Pulse topological_sort implementation
instead of pure spec functions. Includes acyclicity proof, completeness
lemma, and ghost reference handling. Verified in ~3s.

Added PATTERNS.md documenting reusable patterns for writing completeness
tests (sorting, graph, pure spec). Key simplification: use admit() after
completeness assertions to skip resource cleanup proofs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rewrite all test files to call Pulse implementation functions directly
(not spec functions). Each test follows the pattern:
  test(x) { y = algorithm(x); assert(y == expected) }

41 tests call Impl functions, 7 are spec-only (no Impl module exists).
Completeness lemmas use admit() as proof obligations to be discharged.

Update README with Impl Function column listing the exact function
called in each test.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace admit() with assert_norm() proofs in completeness lemmas:
- gcd_spec 12 8 == 4
- mod_exp_spec 2 10 1000 == 24
- segments_intersect_spec 0 0 2 0 1 (-1) 1 1 == true

All verified with F* + Pulse plugin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…narySearch, UnionFind, FloydWarshall

Proved completeness lemmas (replaced admit() with actual proofs):
- InsertionSort/MergeSort/Heapsort: count-based reasoning with
  BoundedIntegers bridge and reveal_opaque for permutation
- BinarySearch: sorted input + postcondition implies result == 2
- UnionFind: union(0,1) then find(0) == find(1) from postconditions
- FloydWarshall: APSP matrix entries for 3-node graph
- BSTArray: explicit ghost args for tree_search inference

All verified with F* + Pulse plugin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Proved completeness lemmas for graph traversal algorithms:
- BFS: distances [0,1,1] for 3-node graph with edges 0->1, 0->2
  Added reachability helper lemmas and BFS postcondition strengthening
- DFS: all nodes colored black, discovery < finish timestamps
  Strengthened requires with DFS postcondition universals

All verified with F* + Pulse plugin.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, RodCutting, Huffman, GrahamScan, JarvisMarch

Proved 8 more completeness lemmas (replaced admit() with proofs):
- MatrixMultiply: matrix product equality for 2x2 matrices
- CountingSort: sorted output via counting sort postcondition
- BST: assert_norm for bst_search after insert sequence
- LCS: longest common subsequence length
- RodCutting: optimal revenue computation
- Huffman: greedy_cost == 3 for 2-frequency input
- GrahamScan: find_bottom_spec returns index 1
- JarvisMarch: jarvis_march_spec returns hull size 3

Updated README: 39/48 proofs discharged, 9 remaining with admit().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…axFlow

Proved completeness lemmas:
- ActivitySelection: activity_input_ok + completeness_activity_selection
  fully discharged (count==2, selected activities [0,2])
- HashTable: completeness_search_42 proved (insert still admit)
- MaxFlow: max_flow_complete proved (caps_ok still admit - abstract pred)

Remaining admit(): HashTable insert, MaxFlow caps_ok, VertexCover,
BSTArray, Kruskal, Prim, BellmanFord, Dijkstra (8 tests).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d/Dijkstra

Proved completeness lemmas:
- BSTArray: subtree_in_range + key_in_subtree with delta/fuel unfolding
- MaxFlow: valid_caps via friend + caps_ok, flow_complete via assert_norm
- BellmanFord: weights_in_range input_ok proved (complete still admit)
- Dijkstra: all_weights_non_negative + weights_in_range proved (complete still admit)

42/48 proofs discharged. Remaining 6 admit():
HashTable (insert), Kruskal, Prim, BellmanFord (complete),
Dijkstra (complete), VertexCover

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Proved dijkstra_complete by threading exact postcondition
(shortest distances and predecessors) through the lemma.
3-node graph: distances [0,1,3], predecessors [0,0,1].

43/48 proofs discharged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Link each Impl Function to its .fsti file in autoclrs submodule
- Fix GCD, ModExp, ModExpLR, Segments proof status (were proved but
  README still showed admit)
- 43/48 proofs discharged

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
HashTable, Kruskal, Prim, BellmanFord, VertexCover postconditions
are too weak to uniquely determine outputs  genuine completeness
failures detected by the evaluation methodology.

Final: 43/48 , 5/48

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add alternative-input test files that demonstrate completeness is input-dependent
for algorithms with relational (non-functional) postconditions:

- BinarySearch2: sorted [1,3,3,5] key=3  index 1 and 2 both valid
- DFS2: fork graph 01,02  timestamps d[1]=2 or d[1]=4
- TopologicalSort2: fork graph 01,02  [0,1,2] and [0,2,1] valid
- Dijkstra2: diamond graph equal weights  pred[3]=1 or pred[3]=2
- ActivitySelection2: tied finish times  activity 1 or 2 selectable

All tests verify with admit() (structure OK). Completeness lemmas are
expected to be unprovable  that is the point of these tests.

README updated with ᴺ sub-rows in tables and new Input-Dependent
Completeness section explaining the distinction from  failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GCD(15,10)=5, ModExp(3,5,13)=9, ModExpLR(3,5,13)=9, ExtendedGCD(48,18),
Segments(parallel non-intersecting), NaiveStringMatch(shifted pattern),
KMP(different text/pattern), RabinKarp(different input), MaxSubarray([-1,3,-2,5,-1]=6),
BucketSort([5,1,4,2][1,2,4,5]), RadixSort(456).

All proofs discharged with assert_norm.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
InsertionSort([5,1,3][1,3,5]), MergeSort([5,1,3][1,3,5]),
Heapsort([5,1,3][1,3,5]), Quicksort([5,1,3][1,3,5]),
CountingSort([5,1,3][1,3,5]).

All proofs discharged with count-based permutation reasoning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Data structures: Stack2, Queue2, SLL2, DLL2, BST2, BSTArray2, RBTree2
DP: LCS2(x=[1,3,5],y=[3,5,7]2), MatrixChain2, RodCutting2(prices=[2,5,9]9)
Search: BinarySearch3([2,4,6,8,10] key=83), MatrixMultiply2([[2,0],[1,3]][[1,4],[2,1]])
Selection: MinMax2, PartialSelectionSort2, Quickselect2, SimultaneousMinMax2
Greedy: Huffman2, ActivitySelection3(all compatible3)
Geometry: GrahamScan2, JarvisMarch2

All proofs discharged. Verified in WSL.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
shuvendu-lahiri and others added 7 commits March 15, 2026 23:18
union(1,2) then find(1)==find(2)  different pair from original union(0,1).
Proof discharged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
BFS2: chain graph 012, distances [0,1,2]. Proof discharged with
reachability lemmas for 2-hop path.

README: Added '2nd Example' column to all tables showing 38 verified
second completeness examples across all algorithm categories.
5  algorithms have no 2nd example. DFS, TopologicalSort, Dijkstra,
FloydWarshall, MaxFlow 2nd examples pending (complex proofs).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
DFS3 (2-node graph), TopologicalSort3 (4-node chain 0123),
Dijkstra3 (2-node, edge weight 3), FloydWarshall2 (different weights),
MaxFlow2 (different capacity graph).

All proofs discharged. README updated: 43/43 proved algorithms now
have second completeness examples verified.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Each column now uses uniform format: [Test.File.fst](link) /
This generalizes cleanly to additional example columns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Non-deterministic tests are not spec failures  the spec correctly
permits multiple valid outputs. Added TODO to extend methodology
for verifying completeness with non-deterministic outputs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The fork graph example demonstrates non-deterministic output (),
not a spec incompleteness failure. Updated section title, markers,
and descriptions consistently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Keep our changes on conflicts.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant