Skip to content

Latest commit

 

History

History
217 lines (163 loc) · 9.54 KB

File metadata and controls

217 lines (163 loc) · 9.54 KB

tsshmm 0.8.0 (2021-10-18)

New features

  • Support PRO-seq background in addition to PRO-cap background via bg_genebody argument to TSSHMM().
  • train() automatically converges using the forward likelihood instead of requiring the user to loop through the full dataset.
  • Significantly sped up train() by separating out the encode_obs() step and removing batch processing in train(). The separate observations IntegerList object from encode_obs() allows running faster replicate tests of randomizing the order of observations fed to the model by train().
  • Model objects are now serializable, and therefore support the cache used in RMarkdown, etc without needing to explicitly reinitialize the model.

Significant user-visible changes

  • Documentation in the vignette has been greatly expanded and also meets the Bioconductor contributor guidelines.
  • Example subset data from Core 2014 is provided, along with a reproducible script inst/script/core2014.R used to generate the R data from the NIH data repository.
  • Added TSSHMM() convenience function instead of using new("TSSHMM", ...)
  • TSSHMM-class now performs S4 validation. Due to limitations in the GHMM C API, the specific list of failure reasons printed to stderr are not captured by the R character vector.
  • Simplified train() to handle convergence, which eliminates the need to loop over the entire dataset multiple times or to visually inspect parameter convergence by plotting.
  • The seed argument to train() has been removed following Bioconductor guidelines of not calling set.seed(). Instead, set.seed() may be set before running train().
  • All training data generated by create_obs() is now fully stored in RAM as an IntegerList. train() is CPU bound, but create_obs() is slow and memory inefficient for large dataests and therefore still uses a batch process limit memory use and to report progress; there is good potential for accelerating create_obs() using C.

Bug fixes and improvements

  • Increased test converage from under 50% to 98%.
  • Private accessors for TSSHMM-class have been added following Bioconductor's contributor guidelines, namely: transitions(), emissions(), emissions_tied(), start(), and bg_genebody(). This also reduced code complexity.
  • Eliminated fragile externalptr in the TSSHMM-class to the C model instance, and instead the model is quickly created on demand for each C function call. Using externalptr somehow leads to horrific bugs where having 2 active instances of models clobber eachother externalptr values, where initializing a PRO-seq background model appears to change the number of parameters of an already intialized PRO-cap background model; a regression tests for this has been added to test-parameters.R.
  • All model building logic has now been moved from the C-layer into R, to eliminate needing the fragile externalptr TSSHMM-class slot. This generalizes the C API for communicating with GHMM and thus makes it possible to split this package into a separate R dependency RGHMM for any GHMM model construction, training, and inference. The separate package would significantly improved build times on systems using the bundled GHMM dependency and allow better reuse of GHMM.
  • The C API now treats all arguments as read-only, and instead allocates and returns any new objects by wrapping them in a list() if necessary. This safer approach of C functions not modifying R objects in-place was suggested by Martin Morgan in private correspondence.
  • Fixed train() causing an intermittent error from the garbage collector "internal logical NA value has been modified" using the above principle of treating C arguments as read-only.
  • Fixed replace_strand() edge case of incorrectly detecting unstranded reads and returning unsorted GRanges per suggestions from Hervé Pagès on the bioc-devel mailing list.
  • Set package LazyData to false falling Bioconductor contributor guidelines.
  • Added an address sanitizer build option --enable-asan for debugging from feedback by Henrik Bengtsson on the bioc-devel mailing list. Using this build option requires carefully setting LD_PRELOAD and ASAN_OPTIONS both of which have been documented in configure --help, otherwise the package will fail the loading from installed location test phase.

tsshmm 0.7.0 (2021-09-17)

New features

  • Support Baum-Welch training. The trained model can be loaded and saved using the params() accessor and setter.
  • Support genome-wide Viterbi inference and remove the limiting regions input.

Significant user-visible changes

  • New models are created by the S4 TSSHMM-class for training and inference.
  • hmm() has been renamed to viterbi()
  • Added train() for model training.
  • Added params() accessor and setter to load and save the model transition and emission matrices.
  • tss() ties are now broken by strand. In other words, if identical peaks are found on the positive strand, the left-most peak is chosen, but on the negative strand, the right-most peak is chosen. This was feature was added to alleviate the weaker signal seen when plotting the CA transcription initiation motif of the negative strand compared to stronger signal seen on the positive strand.
  • Documentation in the vignette appendix now describes the derivation and calculation of the minimum distance between neighboring promoter regions. This distance is necessary to flank reads so that Viterbi can be run genome-wide without exhausting RAM with the dense genomic windows.
  • Documentation in the vignette appendix now includes sessionInfo()

Bug fixes and improvements

  • Generating large GRanges of reversed windows has been significantly sped up from 52 minutes down to 2 seconds by hacking the GRanges and IRanges tile() core with a drop-in replacement called tile_with_rev(). This is effectively a one-line code change to the core Bioconductor functions which will be upstreamed.
  • All C functions are now documented using doxygen markup, and C documentation consistency is also checked by continuous integration.
  • HMM computation is now handled using the published GHMM library. This was necessary for complex requirements of Baum-Welch training of the model such as tying emission states. A system installed GHMM library is automatically preferred; if no system GHMM is detected and no GHMM_ROOT to a prefix installation is supplied, the fallback bundled GHMM dependency is instead patched, compiled and installed.
  • C-level tests for using GHMM using multiple models are in a separate repository gitlab.com/omsai/tssghmm/
  • Build system has migrated from Makevars to autotools configure.ac and src/Makefile.am to manage the GHMM dependency, and run recursive make when using the bundled GHMM library.

tsshmm 0.6.0 (2021-07-21)

New features

  • hmm() now returns a metadata column containing all hidden states as an IntegerList() to inform whether the region is a peaked or non-peaked promoter, with the state of each window to aid visualization of the model behavior output track against the input sequencing data track.
  • The timing and intermediate steps of hmm() can be inspected by importing the futile.flogger library and setting flog.threshold(DEBUG).

Significant user-visible changes

  • hmm() on negative strand data no longer suffers from a large speed penalty. Previously, calculations performed on negative strand were extremely inefficient due to a single call endoapply(..., rev) to reverse windows. This call has been eliminated by instead patching tile() methods used by GRanges and IRanges to directly produce reversed windows.

Bug fixes and improvements

  • viterbi() has been vectorized at the R layer by allowing for a list of regions to be provided resulting in about a 4x speedup.
  • viterbi() has further sped up using OpenMP multithreading at the C layer. The speed of viterbi() is now comparable to the speed of tss().
  • Cleaned up devtools::check() warnings and notes.

tsshmm 0.5.0 (2020-06-15)

Significant user-visible changes

  • hmm() and tss() now have documentation and examples.
  • hmm() now returns GRanges instead of strand split GRangesList.

Bug fixes and improvements

  • Require sorted signal input to tss() required by underlying C algorithm.

tsshmm 0.4.0 (2020-06-13)

New features

  • Support both strands for hmm() and tss().

Significant user-visible changes

  • Negative signal and background scores are no longer supported.
  • Unstranded signal and background ranges are no longer supported.

Bug fixes and improvements

  • Remove the fragile starting state assumption of the published HMM by instead considering all starting states as equally likely.

tsshmm 0.3.0 (2020-06-02)

New features

  • Added tss() peak detection accelerated by C API.

tsshmm 0.2.0 (2020-05-31)

New features

  • Added hmm() high level GRanges interface to viterbi().

Significant user-visible changes

  • Added package vignette illustrating the model.

Bug fixes and improvements

  • Package passes BiocCheck without any errors.

tsshmm 0.1.0 (2020-05-30)

New features

  • Added viterbi() decoding accelerated by C API.

Bug fixes and improvements

  • Added GitLab continuous integration to regularly install and run R CMD check and using scripts in tools/ directory.
  • C-level tests for viterbi() using multiple models are in a separate repository gitlab.com/omsai/viterbi/