This is proof recording code for Lean 3 associated with the paper Proof Artifact Co-training for Theorem Proving with Language Models.
Have elan and leanproject installed. The scripts require Python 3.7+.
python3 -m venv venv # new virtual env
source ./venv/bin/activate
pip install . # or pip install -e . for editable modeOnly tested on mac and linux.
- Make an empty directory to store the data.
- Run
bash ./scripts/run_all.sh $DATA_DIR.
This will take from many minutes to many hours based on the available
parallelization. The data needed to train a language model can be found in
$DATA_DIR/cleaned_training_data.
This will go through the steps of the master script and describe the various types of recorded data, which may be of interest to various researchers. Only tested on mac and linux.
This tool works by modifying lean code so that it traces proof step information. It does this with a combination of python scripts and some lean hacks. No files outside of this directory are modified, so you don't have to worry about messing up your lean install.
Run this command to build lean, and populate _target with both mathlib and lean/library code.
It also makes some unstable modifications to leanpkg.path.
python3 -m lean_proof_recording.pipeline.refreshRun this command to modify the lean and mathlib files. (Note, the modifications are only to files in
_target and can be undone with the refresh command above. Use the flag --dryrun to see what will
be modified without modifying it.)
python3 -m lean_proof_recording.pipeline.insert_proof_recording_code(Optional) It's helpful to compile the base lean to make sure the modifications worked. No trace code will show since certain options are not set.
lean --make src/empty.leanBuild all.lean for mathlib.
bash _target/deps/mathlib/scripts/mk_all.shThis command generates data. This is very time consuming (many hours). The data directory must exist
and is encouraged to be empty. Replace src/empty.lean with _target/deps/mathlib/src/all.lean to
trace all of mathlib.
python3 -m lean_proof_recording.pipeline.run_lean_and_save_results src/empty.lean <path/to/data/directory>After this completes, the data directory (which you supplied above) will have the following structure.
lean_filesA copy of all the (modified) lean files. These will be used to extract other information in the next steps.leanpkg.tomlA copy of this file which records the version of lean and mathlib for reproducibility.path_map.jsonInformation connecting the paths which are in the trace data to the new paths in this dataset.lean_stdout.logThe traced results of the run. Sinceleanwas run with--jsonall lean outputs are in json form. Later steps will parse this data.lean_stderr.logAny errors (but not regular lean compile errors.) Check to make sure this file is empty.
Run this command to extract the traced data into a series of tables.
python3 -m lean_proof_recording.pipeline.extract_trace_data <path/to/data/directory>This will add the following to the data directory:
lean_errors.jsonlThis includes all Lean messages from the run which are not informational traces. This includes any Lean compiliation errors. Check that there are no messages.raw_traced_dataThis is a collection of relational tables containing the traced data (along with the filename and position of the trace). You can load the files into apandasdataframe as follows:
pd.read_json("path/to/file.jsonl", orient="records", lines=True)(Further cleaning may be neccessary to properly handle missing values.) Each row (within a table) is uniquely keyed by the "filename" and "key" columns.
Run this command to parse the Lean code proofs. This will use the traced data to guide a custom (and ad-hoc) Lean code parser to parse all tactic proofs into their respective tactics. It will also parse the tactic arguments.
python3 -m lean_proof_recording.pipeline.extract_proof_data <path/to/data/directory>This script will add a directory extacted_proof_data to the data directory. It contains the
following jsonl files:
proof_trees.jsonlThis includes JSON ASTs for every proof. (Note, this doesn't include term proofs, and for various reasons a Lean theorem can be have multiple associated proofs. Roughly speaking, but with some exceptions, a tactic proof is one which is started by thebyorbeginkeywords.)proof.jsonlA 2D table of data for each proof. Use thekeycolumn to relate to the other data structures.tactic.jsonlA 2D table of data for each tactic command. Use thekeycolumn to relate to the other data structures.args.jsonlA 2D table of data for each tactic command argument. Use thekeycolumn to relate to the other data structures.
This file step is to condense the currently extracted data into training data suitable for training a language model.
python3 -m lean_proof_recording.pipeline.extract_training_testing_data <path/to/data/directory>This will filter and clean the data slighlty as well as split the data into training (92%), validation (4%), and testing (4%). The split is deterministic based on a hash of the declaration name, allowing intercompatibility with other sources of lean data and other lean environments.
For ease of understanding the data, an example notebook is included.
Edit the leanpkg.toml file to change the version of lean and mathlib. This
is really important to maintain exact compatability with other projects. (Double
check after running the refresh.py script above that the versions are the
same as you set them.)
The information traced by these tools may not be the information that another user needs.
For that reason, these tools make it easy to add custom tracing code. The tracing code is stored in
lean_modifications. Feel free to add additional traces (following the examples there). The best
places to add new code are where it is labeled "BEGIN/END CUSTOMIZABLE CODE".
Some additional ideas for tracing:
- Trace goal, hypothesis, or parameter expressions into another form, e.g. as s-expressions (or other abstract syntax tree) or using the pp.all=true form
- Trace type information about expressions or their parts. For example, knowing is an expression is of type Prop or not
- Trace simp lemmas available in the environment
All of these are fairly easy to add, but may slow down the time to trace all of mathlib.