- Support PRO-seq background in addition to PRO-cap background via
bg_genebodyargument toTSSHMM(). train()automatically converges using the forward likelihood instead of requiring the user to loop through the full dataset.- Significantly sped up
train()by separating out theencode_obs()step and removing batch processing intrain(). The separate observationsIntegerListobject fromencode_obs()allows running faster replicate tests of randomizing the order of observations fed to the model bytrain(). - Model objects are now serializable, and therefore support the cache used in RMarkdown, etc without needing to explicitly reinitialize the model.
- Documentation in the vignette has been greatly expanded and also meets the Bioconductor contributor guidelines.
- Example subset data from Core 2014 is provided, along with a reproducible
script
inst/script/core2014.Rused to generate the R data from the NIH data repository. - Added
TSSHMM()convenience function instead of usingnew("TSSHMM", ...) - TSSHMM-class now performs S4 validation. Due to limitations in the GHMM C API, the specific list of failure reasons printed to stderr are not captured by the R character vector.
- Simplified
train()to handle convergence, which eliminates the need to loop over the entire dataset multiple times or to visually inspect parameter convergence by plotting. - The
seedargument totrain()has been removed following Bioconductor guidelines of not callingset.seed(). Instead,set.seed()may be set before runningtrain(). - All training data generated by
create_obs()is now fully stored in RAM as anIntegerList.train()is CPU bound, butcreate_obs()is slow and memory inefficient for large dataests and therefore still uses a batch process limit memory use and to report progress; there is good potential for acceleratingcreate_obs()using C.
- Increased test converage from under 50% to 98%.
- Private accessors for TSSHMM-class have been added following Bioconductor's
contributor guidelines, namely:
transitions(),emissions(),emissions_tied(),start(), andbg_genebody(). This also reduced code complexity. - Eliminated fragile
externalptrin the TSSHMM-class to the C model instance, and instead the model is quickly created on demand for each C function call. Usingexternalptrsomehow leads to horrific bugs where having 2 active instances of models clobber eachotherexternalptrvalues, where initializing a PRO-seq background model appears to change the number of parameters of an already intialized PRO-cap background model; a regression tests for this has been added totest-parameters.R. - All model building logic has now been moved from the C-layer into R, to
eliminate needing the fragile
externalptrTSSHMM-class slot. This generalizes the C API for communicating with GHMM and thus makes it possible to split this package into a separate R dependency RGHMM for any GHMM model construction, training, and inference. The separate package would significantly improved build times on systems using the bundled GHMM dependency and allow better reuse of GHMM. - The C API now treats all arguments as read-only, and instead allocates and
returns any new objects by wrapping them in a
list()if necessary. This safer approach of C functions not modifying R objects in-place was suggested by Martin Morgan in private correspondence. - Fixed
train()causing an intermittent error from the garbage collector "internal logical NA value has been modified" using the above principle of treating C arguments as read-only. - Fixed
replace_strand()edge case of incorrectly detecting unstranded reads and returning unsorted GRanges per suggestions from Hervé Pagès on the bioc-devel mailing list. - Set package LazyData to false falling Bioconductor contributor guidelines.
- Added an address sanitizer build option
--enable-asanfor debugging from feedback by Henrik Bengtsson on the bioc-devel mailing list. Using this build option requires carefully settingLD_PRELOADandASAN_OPTIONSboth of which have been documented inconfigure --help, otherwise the package will fail the loading from installed location test phase.
- Support Baum-Welch training. The trained model can be loaded and saved using
the
params()accessor and setter. - Support genome-wide Viterbi inference and remove the limiting regions input.
- New models are created by the S4
TSSHMM-classfor training and inference. hmm()has been renamed toviterbi()- Added
train()for model training. - Added
params()accessor and setter to load and save the model transition and emission matrices. tss()ties are now broken by strand. In other words, if identical peaks are found on the positive strand, the left-most peak is chosen, but on the negative strand, the right-most peak is chosen. This was feature was added to alleviate the weaker signal seen when plotting the CA transcription initiation motif of the negative strand compared to stronger signal seen on the positive strand.- Documentation in the vignette appendix now describes the derivation and calculation of the minimum distance between neighboring promoter regions. This distance is necessary to flank reads so that Viterbi can be run genome-wide without exhausting RAM with the dense genomic windows.
- Documentation in the vignette appendix now includes
sessionInfo()
- Generating large GRanges of reversed windows has been significantly sped up
from 52 minutes down to 2 seconds by hacking the GRanges and IRanges
tile()core with a drop-in replacement calledtile_with_rev(). This is effectively a one-line code change to the core Bioconductor functions which will be upstreamed. - All C functions are now documented using doxygen markup, and C documentation consistency is also checked by continuous integration.
- HMM computation is now handled using the published GHMM library. This was necessary for complex requirements of Baum-Welch training of the model such as tying emission states. A system installed GHMM library is automatically preferred; if no system GHMM is detected and no GHMM_ROOT to a prefix installation is supplied, the fallback bundled GHMM dependency is instead patched, compiled and installed.
- C-level tests for using GHMM using multiple models are in a separate repository gitlab.com/omsai/tssghmm/
- Build system has migrated from
Makevarsto autotoolsconfigure.acandsrc/Makefile.amto manage the GHMM dependency, and run recursivemakewhen using the bundled GHMM library.
hmm()now returns a metadata column containing all hidden states as anIntegerList()to inform whether the region is a peaked or non-peaked promoter, with the state of each window to aid visualization of the model behavior output track against the input sequencing data track.- The timing and intermediate steps of
hmm()can be inspected by importing thefutile.floggerlibrary and settingflog.threshold(DEBUG).
hmm()on negative strand data no longer suffers from a large speed penalty. Previously, calculations performed on negative strand were extremely inefficient due to a single callendoapply(..., rev)to reverse windows. This call has been eliminated by instead patchingtile()methods used byGRangesandIRangesto directly produce reversed windows.
viterbi()has been vectorized at the R layer by allowing for a list of regions to be provided resulting in about a 4x speedup.viterbi()has further sped up using OpenMP multithreading at the C layer. The speed ofviterbi()is now comparable to the speed oftss().- Cleaned up
devtools::check()warnings and notes.
hmm()andtss()now have documentation and examples.hmm()now returnsGRangesinstead of strand splitGRangesList.
- Require sorted signal input to
tss()required by underlying C algorithm.
- Support both strands for
hmm()andtss().
- Negative signal and background scores are no longer supported.
- Unstranded signal and background ranges are no longer supported.
- Remove the fragile starting state assumption of the published HMM by instead considering all starting states as equally likely.
- Added
tss()peak detection accelerated by C API.
- Added
hmm()high level GRanges interface toviterbi().
- Added package vignette illustrating the model.
- Package passes BiocCheck without any errors.
- Added
viterbi()decoding accelerated by C API.
- Added GitLab continuous integration to regularly install and run
R CMD checkand using scripts intools/directory. - C-level tests for
viterbi()using multiple models are in a separate repository gitlab.com/omsai/viterbi/