blit/README.Rmd at main · WangLabCSU/blit · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    fig.path = "man/figures/README-",
    out.width = "100%"
)
```

# blit: Bioinformatics Library for Integrated Tools <img src="man/figures/logo.png" alt="logo" align="right" height="140" width="120"/>

<!-- badges: start -->
[![R-CMD-check](https://github.com/WangLabCSU/blit/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/WangLabCSU/blit/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/blit)](https://CRAN.R-project.org/package=blit)
[![](https://cranlogs.r-pkg.org/badges/blit)](https://cran.r-project.org/package=blit)
<!-- badges: end -->

The goal of `blit` is to make it easy to execute command line tool from R.

## Installation

You can install `blit` from `CRAN` using:
```{r, eval=FALSE}
install.packages("blit")
```

Alternatively, install the development version from
[GitHub](https://github.com/) with:
``` r
# install.packages("remotes")
remotes::install_github("WangLabCSU/blit")
```

## Example
```{r}
library(blit)
```

### Execute command
To build a `command`, simply use `exec`. The first argument is the command name,
and you can also provide the full path. After that, pass the command parameters.
This will create a `command` object:
```{r}
exec("echo", "$PATH")
```

To run the command, just pass the `command` object to the `cmd_run()` (Note:
`stdout = "|"` is always used in the vignette to ensure that the standard output
can be captured by knitr.)
```{r}
Sys.setenv(TEST = "blit is awesome")
exec("echo", "$TEST") |> cmd_run(stdout = "|")
```

Alternatively, you can run it in the background. In this case, a
[`process`](https://processx.r-lib.org/index.html) object will be returned. For
more information, refer to the official site:
```{r eval=FALSE}
proc <- exec("echo", "$TEST") |> cmd_background(stdout = "")
proc$kill()
Sys.unsetenv("TEST")
```

> We use some tricks to capture the output from the background process. The
actual implementation in the `README.Rmd` differs, but the output remains the
same.

```{r echo=FALSE}
file <- tempfile()
proc <- exec("echo", "$TEST") |> cmd_background(stdout = file)
cat(readLines(file), sep = "\n")
Sys.unsetenv("TEST")
```

`cmd_background()` is provided for completeness. Instead of using this function,
we recommend using `cmd_parallel()`, which can run multiple commands in the
background while ensuring that all processes are properly cleaned up when the
process exits.
```{r}
# ip address are copied from quora <What are some famous IP addresses?>: https://qr.ae/pYlnbQ
address <- c("localhost", "208.67.222.222", "8.8.8.8", "8.8.4.4")
cmd_parallel(
    !!!lapply(address, function(ip) exec("ping", ip)),
    stdouts = TRUE,
    stdout_callbacks = lapply(
        seq_len(4),
        function(i) {
            force(i)
            function(text, proc) {
                sprintf("Connection %d: %s", i, text)
            }
        }
    ),
    timeouts = 4, # terminate after 4s
    threads = 4
)
```

### Environment context
The `blit` package provides several functions to manage and control the
environment context:

  - `cmd_wd`: define the working directory.
  - `cmd_envvar`: define the environment variables.
  - `cmd_envpath`: define the `PATH`-like environment variables.
  - `cmd_condaenv`: define the `PATH` environment variables with conda environment.

```{r}
exec("echo", "$(pwd)") |>
    cmd_wd(tempdir()) |>
    cmd_run(stdout = "|")
```

```{r}
exec("echo", "$TEST") |>
    cmd_envvar(TEST = "blit is very awesome") |>
    cmd_run(stdout = "|")
```

```{r}
exec("echo", "$PATH") |>
    cmd_envpath("PATH_IS_HERE", action = "replace") |>
    cmd_run(stdout = "|")
```

> Note: `echo` is a built-in command of the linux shell, so it remains available
even after modifying the `PATH` environment variable.

`cmd_condaenv()` can add `conda`/`mamba` environment prefix to the `PATH`
environment variable.

`Conda`/`mamba` are open-source package and environment management systems that
facilitate the installation of multiple software versions and their
dependencies. They allow easy switching between environments and are compatible
with Linux, macOS, and Windows.

`cmd_condaenv()` function accepts multiple `conda`/`mamba` environment prefixes
and an optional `root` argument specifying the path to the `conda`/`mamba` root
prefix. If `root` is not provided, the function searches for the root in the
following order:

   1. the option: `blit.conda.root`.
   2. the environment variable: `BLIT_CONDA_ROOT`.
   3. the root prefix of [`appmamba()`] (Please see the `Software management` section for details).

The `cmd_condaenv()` function searches for the specified environment prefix within
the provided `root` path.

### Software management
The `blit` package integrates with `micromamba`, a lightweight version of the
mamba package manager, for efficient software environment management.

You can install `micromamba` with `install_appmamba()`.
```{r}
install_appmamba()
```

The `appmamba()` function executes specified `micromamba` commands. Running it
without arguments shows the help document:
```{r}
appmamba()
```

To create a new environment named `samtools` and install `samtools` from
`Bioconda`, use:
```{r}
appmamba("create", "--yes", "--name samtools", "bioconda::samtools")
```

Once the environment is created, you can execute commands within it. The
following example locates the samtools binary within the specified environment:
```{r}
exec("which", "samtools") |>
    cmd_condaenv("samtools") |>
    cmd_run()
```

You may want to clean the created environment-`samtools`.
```{r}
appmamba("env", "remove", "--yes", "--name samtools")
```

For more details, please see
<https://mamba.readthedocs.io/en/latest/user_guide/micromamba.html>.

### Schedule expressions
Several functions allow you to schedule expressions:

  - `cmd_on_start`/`cmd_on_exit`: define the startup, or exit code of the command.
  - `cmd_on_succeed`/`cmd_on_fail`: define the code to be run when command succeed or fail.

```{r}
file <- tempfile()
file.create(file)
file.exists(file)
exec("ping", "localhost") |>
    cmd_on_exit(file.remove(file)) |>
    cmd_run(timeout = 5, stdout = "|") # terminate it after 5s
file.exists(file)
```

We can also register code for succeessful or failure command respectively
(Timeout means command fail):
```{r}
file <- tempfile()
file.create(file)
file.exists(file)
exec("ping", "localhost") |>
    cmd_on_fail(file.remove(file)) |>
    cmd_run(timeout = 5, stdout = "|") # terminate it after 5s
file.exists(file)
```

```{r}
file <- tempfile()
file.create(file)
file.exists(file)
exec("ping", "localhost") |>
    cmd_on_succeed(file.remove(file)) |>
    cmd_run(timeout = 5, stdout = "|") # terminate it after 5s
file.exists(file) # file remain exist as timeout means command failed
file.remove(file)
```

### Built-in commands
`blit` provides several built-in functions for directly executing specific
commands., these include:
[samtools](https://www.htslib.org/),
[alleleCounter](https://github.com/cancerit/alleleCount),
[cellranger](https://www.10xgenomics.com/cn/support/software/cell-ranger/latest),
[fastq_pair](https://github.com/linsalrob/fastq-pair),
[gistic2](https://broadinstitute.github.io/gistic2/),
[KrakenTools](https://github.com/jenniferlu717/KrakenTools),
[kraken2](https://github.com/DerrickWood/kraken2/wiki/Manual),
[perl](https://www.perl.org/), [pySCENIC](https://github.com/aertslab/pySCENIC),
[python](https://www.python.org/), [seqkit](https://bioinf.shenwei.me/seqkit/),
[trust4](https://github.com/liulab-dfci/TRUST4).

For these commands, you can also use `cmd_help()` to print the help document.
```{r}
python() |> cmd_help(stdout = "|")
```

```{r}
perl() |> cmd_help(stdout = "|")
```

And it is very easily to extend for other commands.

### Pipe
One of the great features of `blit` is its ability to translate the R pipe
(`%>%` or `|>`) into the Linux pipe (`|`). All functions used to create a
`command` object can accept another `command` object. The internal will capture
the first unnamed input value. If it is a `command` object, it will be removed
from the call and saved. When the `command` object is run, the saved command
will be passed through the pipe (`|`) to the command. Here we take the `gzip`
command as an example (assuming you're using a Linux system).

```{r}
tmpdir <- tempdir()
file <- tempfile(tmpdir = tmpdir)
writeLines(letters, con = file)
file2 <- tempfile()
exec("gzip", "-c", file) |>
    exec("gzip", "-d", ">", file2) |>
    cmd_run(stdout = "|")
identical(readLines(file), readLines(file2))
```

In the last we clean the temporary files.
```{r}
file.remove(file)
file.remove(file2)
```

## Development
To add a new command, use the `make_command` function. This helper function is
designed to assist developers in creating functions that initialize new
`command` objects. A `command` object is a bundle of multiple `Command` R6
objects (note the uppercase `"C"` in `Command`, which distinguishes it from the
`command` object) and the associated running environment (including the working
directory and environment variables).

The `make_command` function accepts a function that initializes a new `Command`
object and, when necessary, validates the input arguments. The core purpose is
to create a new `Command` R6 object, so familiarity with the R6 class system is
essential.

There are several private methods or fields you may want to override when
creating a new `Command` R6 object. The first method is `command_locate`, which
determines how to locate the command path. By default, it will attempt to use
the `cmd` argument provided by the user. If no `cmd` argument is supplied, it
will try to locate the command using the `alias` method. In most cases, you will
only need to provide values for the `alias` method, rather than overriding the
`command_locate` method.

For example, consider the `ping` command. Here is how you can define it:
```{r, error = TRUE}
Ping <- R6::R6Class(
    "Ping",
    inherit = Command,
    private = list(alias = function() "ping")
)
ping <- make_command("ping", function(..., ping = NULL) {
    Ping$new(cmd = ping, ...)
})
ping("8.8.8.8") |> cmd_run(timeout = 5, stdout = "|") # terminate it after 5s
```

For command-line tools, the input parameters should always be characters. The
core principle of the `Command` object is to convert all R objects (such as data
frames) into characters—typically file paths of R objects that have been saved
to disk.


## Session Informations
```{r}
sessionInfo()
```