Hi, thank you for developing metaWRAP — it has been incredibly helpful for my metagenomics workflow!
Bug Description
In bin_refinement.sh, the file-copying loop for input bins only filters by file size but does not filter by file extension. This causes non-FASTA files (e.g., bin.BinMembers.txt generated by some binning tools) to be copied into the output directory (e.g., binsA/), which subsequently leads to CheckM errors.
Relevant Code
https://github.com/bxlab/metaWRAP/blob/master/bin/metawrap-modules/bin_refinement.sh
for F in ${bins1}/*; do
SIZE=$(stat -c%s "$F")
if (( $SIZE > 50000)) && (( $SIZE < 20000000)); then
BASE=${F##*/}
cp $F ${out}/binsA/${BASE%.*}.fa
fi
done
The loop iterates over all files in the input directory. Any non-FASTA file that happens to fall within the 50KB–20MB size range will be copied and renamed with a .fa extension, corrupting the input for CheckM.
Suggested Fix
Add a file extension filter to only process FASTA files. For example:
for F in ${bins1}/*.fa ${bins1}/*.fasta; do
[[ -e "$F" ]] || continue
SIZE=$(stat -c%s "$F")
if (( $SIZE > 50000 )) && (( $SIZE < 20000000 )); then
BASE=${F##*/}
cp "$F" "${out}/binsA/${BASE%.*}.fa"
else
echo "Skipping $F because the bin size is not between 50kb and 20Mb"
fi
done
The same fix should also be applied to the corresponding blocks for binsB and binsC.
Thank you for your time, and I'd be happy to submit a pull request if that would be helpful!
Hi, thank you for developing metaWRAP — it has been incredibly helpful for my metagenomics workflow!
Bug Description
In
bin_refinement.sh, the file-copying loop for input bins only filters by file size but does not filter by file extension. This causes non-FASTA files (e.g.,bin.BinMembers.txtgenerated by some binning tools) to be copied into the output directory (e.g.,binsA/), which subsequently leads to CheckM errors.Relevant Code
https://github.com/bxlab/metaWRAP/blob/master/bin/metawrap-modules/bin_refinement.sh
The loop iterates over all files in the input directory. Any non-FASTA file that happens to fall within the 50KB–20MB size range will be copied and renamed with a .fa extension, corrupting the input for CheckM.
Suggested Fix
Add a file extension filter to only process FASTA files. For example:
The same fix should also be applied to the corresponding blocks for binsB and binsC.
Thank you for your time, and I'd be happy to submit a pull request if that would be helpful!