Skip to content

bin_refinement.sh copies non-FASTA files (e.g., .txt) into output directory, causing CheckM errors #589

@XREAL1TY

Description

@XREAL1TY

Hi, thank you for developing metaWRAP — it has been incredibly helpful for my metagenomics workflow!

Bug Description

In bin_refinement.sh, the file-copying loop for input bins only filters by file size but does not filter by file extension. This causes non-FASTA files (e.g., bin.BinMembers.txt generated by some binning tools) to be copied into the output directory (e.g., binsA/), which subsequently leads to CheckM errors.

Relevant Code

https://github.com/bxlab/metaWRAP/blob/master/bin/metawrap-modules/bin_refinement.sh

for F in ${bins1}/*; do
    SIZE=$(stat -c%s "$F")
    if (( $SIZE > 50000)) && (( $SIZE < 20000000)); then 
        BASE=${F##*/}
        cp $F ${out}/binsA/${BASE%.*}.fa
    fi
done

The loop iterates over all files in the input directory. Any non-FASTA file that happens to fall within the 50KB–20MB size range will be copied and renamed with a .fa extension, corrupting the input for CheckM.

Suggested Fix

Add a file extension filter to only process FASTA files. For example:

for F in ${bins1}/*.fa ${bins1}/*.fasta; do
    [[ -e "$F" ]] || continue
    SIZE=$(stat -c%s "$F")
    if (( $SIZE > 50000 )) && (( $SIZE < 20000000 )); then 
        BASE=${F##*/}
        cp "$F" "${out}/binsA/${BASE%.*}.fa"
    else 
        echo "Skipping $F because the bin size is not between 50kb and 20Mb"
    fi
done

The same fix should also be applied to the corresponding blocks for binsB and binsC.


Thank you for your time, and I'd be happy to submit a pull request if that would be helpful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions