mzIdentML-polars

A fast Rust-based writer for mzIdentML 1.3 files using Polars DataFrames as input. This project simplifies the generation of standard-compliant proteomics identification files, with built-in support for:

Polars Integration: Directly write mzIdentML from high-performance DataFrames.
ProForma v2: Support for standard peptide sequence notation (e.g., PEPT[Unimod:35]IDEK).
Crosslinking: Native encoding for crosslinked peptide matches (CSMs).
mzIdentML 1.3.0 Compliance: Generates valid XML according to the latest PSI-PI standards.

Installation

You can install the Python bindings directly from the source using maturin:

# Clone the repository
git clone https://github.com/Rappsilber-Laboratory/mzIdentML-polars.git
cd mzIdentML-polars

# Install via pipenv (requires a Rust toolchain and maturin)
pipenv install
pipenv run maturin develop

Usage

The primary functions are write_mzidentml (for file output) and serialize_mzidentml (for string output). Both take Polars DataFrames and a dictionary for metadata. Note that write_mzidentml takes the output path as its first argument.

Writing to a File (Recommended)

This method is memory-efficient as it streams the XML directly to the disk. It also supports automatic Gzip compression if the filename ends in .gz.

import polars as pl
import mzidentml_polars

# ... define DataFrames ...

# Generate mzIdentML directly to a file
mzidentml_polars.write_mzidentml("output.mzid", csms, prot_seqs, spectra, metadata)

# Automatic Gzip compression
mzidentml_polars.write_mzidentml("output.mzid.gz", csms, prot_seqs, spectra, metadata)

Serializing to a String

# Generate mzIdentML as a string (if needed for further processing)
xml_string = mzidentml_polars.serialize_mzidentml(csms, prot_seqs, spectra, metadata)

Testing

The project includes a comprehensive test suite using pytest that validates output against official mzIdentML XML schemas.

Prerequisites

pip install pytest lxml

Running Tests

pipenv run pytest tests/

Troubleshooting

`TypeError: ... compat_level has invalid type: 'int'`

If you see this error, it indicates a version mismatch between your Python polars and the polars Rust crate used during compilation.

The build process now automatically synchronizes these versions by updating pyproject.toml based on Cargo.toml. If you encounter this after manual dependency changes, simply rebuild the project:

pipenv run maturin develop

This will ensure your Python environment matches the compiled extension's expected ABI.

`No module named 'pyarrow'`

pyo3-polars may require pyarrow for internal data conversions:

pip install pyarrow

Input Schemas

`prot_seqs` (DataFrame)

Column	Type	Description
`protein_id`	String	Unique internal ID for the protein
`accession`	String	Public accession (e.g., UniProt)
`protein_name`	String	Optional. Descriptive name for the protein (e.g., `BIPA_BACSU`)
`sequence`	String	Full amino acid sequence
`is_decoy`	Boolean	Whether the protein is a decoy (default: `false`)

`csms` (DataFrame)

Column	Type	Description
`spectrum_id`	String	ID of the spectrum (e.g., `index=1` or `scan=123`)
`file_path`	String	Path to the source file to resolve duplicate IDs across files.
`peptide1_seq`	String	ProForma v2 sequence of the first peptide
`protein1_id`	String / List[Str]	ID matching `prot_seqs`
`peptide1_start`	UInt32 / List[U32]	Start position in protein (1-based)
`peptide1_end`	UInt32 / List[U32]	End position in protein (1-based)
`charge`	Int32	Precursor charge state
`rank`	UInt32	Identification rank (1 = top match)
`is_crosslink`	Boolean	Whether this is a crosslink match
`is_looplink`	Boolean	Whether this is a looplink match
`experimental_mz`	Float64	Recommended. Observed precursor m/z
`calculated_mz`	Float64	Recommended. Theoretical precursor m/z
`score`	Float64	Recommended. Primary search engine score
`peptide1_link_pos`	Int32	1-based link position on peptide 1
`peptide2_link_pos`	Int32	1-based link position on peptide 2 (or site 2 for looplink)
`peptide2_seq`	String	(Crosslink only) Second peptide sequence
`protein2_id`	String / List[Str]	(Crosslink only) Second protein ID
`peptide2_start`	UInt32 / List[U32]	(Crosslink only) Start position
`peptide2_end`	UInt32 / List[U32]	(Crosslink only) End position
`crosslinker_name`	String	Recommended. Name of the crosslinker (e.g., `DSSO`)
`crosslinker_accession`	String	Recommended. CV accession of the crosslinker (e.g., `MS:1003124`)
`crosslinker_mass`	Float64	Recommended. Mass of the crosslinker

`metadata` (Dictionary)

Key	Type	Description
`software_name`	String	Name of the analysis software (default: `mzidentml-polars`)
`software_version`	String	Version of the software
`author`	String	Name of the primary researcher/author
`is_ppm`	Boolean	Whether tolerances are in PPM (default: `true`)
`parent_plus`	Float	Precursor tolerance upper bound
`parent_minus`	Float	Precursor tolerance lower bound
`frag_plus`	Float	Fragment tolerance upper bound
`frag_minus`	Float	Fragment tolerance lower bound
`enzymes`	List[Dict]	Enzymes used: `[{"name": "Trypsin", "accession": "MS:1001251"}]`
`modifications`	List[Dict]	Search mods: `[{"fixed": true, "mass": 57.02, "residues": "C", "name": "Carbamidomethyl", "accession": "UNIMOD:4"}]`
`search_params`	List[Dict]	Additional parameters: `[{"name": "xi:min_peptide_length", "accession": "MS:1002543", "value": "6"}]`

Protein Ambiguity

If a peptide sequence maps to multiple proteins, you can encode this using Polars List columns in the csms DataFrame. For each mapped protein, provide the corresponding ID, start, and end positions in the lists. The library will generate multiple <PeptideEvidence> entries for that match.

csms = pl.DataFrame({
    "protein1_id": [["PROT_A", "PROT_B"], ["PROT_C"]],
    "peptide1_start": [[1, 50], [10]],
    "peptide1_end": [[10, 60], [20]],
    # ... other columns
})

Development & Releases

Version Management

This project uses Git tags as the single source of truth for versioning.

Python: Managed by setuptools_scm. The version is automatically derived from the latest Git tag (e.g., v0.1.0). If no tag is present, it defaults to a .dev version.
Rust: The version is hardcoded in Cargo.toml. To ensure consistency, always use cargo-release to bump versions.

Bumping the Version

To create a new release (e.g., moving from 0.1.0 to 0.2.0):

Install cargo-release:
```
cargo install cargo-release
```
Verify changes (Dry Run):
```
cargo release minor --no-publish
```
Perform the Release:
```
# Bumps version, commits, tags, and pushes
cargo release minor --execute --no-publish
```
Note: --no-publish skips publishing to crates.io; the GitHub Action handles PyPI publishing via tags.
CI/CD: The GitHub Action (.github/workflows/pypi.yml) will automatically trigger on the new tag and publish the updated wheels to PyPI.

Syncing Polars

If you change the polars version in Cargo.toml, the build script (build.rs) will automatically run sync_polars.py to update the constraints in pyproject.toml.

License

This project is licensed under the Apache-2.0 License.

TODO

Implementing basic Protein Grouping (ProteinDetectionList) support, even as a simple 1-to-1 mapping if full inference isn't required.
Write passThreshold column

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
.cargo		.cargo
.github/workflows		.github/workflows
context		context
cv		cv
python/mzidentml_polars		python/mzidentml_polars
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEVELOPMENT.md		DEVELOPMENT.md
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
build.rs		build.rs
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mzIdentML-polars

Installation

Usage

Writing to a File (Recommended)

Serializing to a String

Testing

Prerequisites

Running Tests

Troubleshooting

`TypeError: ... compat_level has invalid type: 'int'`

`No module named 'pyarrow'`

Input Schemas

`prot_seqs` (DataFrame)

`csms` (DataFrame)

`metadata` (Dictionary)

Protein Ambiguity

Development & Releases

Version Management

Bumping the Version

Syncing Polars

License

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mzIdentML-polars

Installation

Usage

Writing to a File (Recommended)

Serializing to a String

Testing

Prerequisites

Running Tests

Troubleshooting

TypeError: ... compat_level has invalid type: 'int'

No module named 'pyarrow'

Input Schemas

prot_seqs (DataFrame)

csms (DataFrame)

metadata (Dictionary)

Protein Ambiguity

Development & Releases

Version Management

Bumping the Version

Syncing Polars

License

TODO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`TypeError: ... compat_level has invalid type: 'int'`

`No module named 'pyarrow'`

`prot_seqs` (DataFrame)

`csms` (DataFrame)

`metadata` (Dictionary)

Packages