-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathreadme.shtml
More file actions
275 lines (156 loc) · 10.8 KB
/
readme.shtml
File metadata and controls
275 lines (156 loc) · 10.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>msync -- wrapper for rsync that allows multistream transfer</title>
</head>
<style>
body {Background-color: #FFFFFF link: #0000CC vlink: #660066}
h1 { font-family: arial; text-align: Center; color: #CC0000 }
h2 { font-family: arial; text-align: Center; color: #0000CC }
h3 { font-family: arial; text-align: Center }
h4 { font-family: Arial; color: #006400}
h5 { font-family: Arial; color: #333300; font-style: italic}
h6 { font-family: Arial; font-size: 10pt; font-style: bold}
em { font-style: italic; font-weight: bold; color: #FF0000}
cite { font-style: italic; font-weight: bold; color: blue }
a {color: #0000FF;}
td {font-family:arial,helvetica,sans-serif; font-size:10pt;}
p.petit { font-family: arial; font-size: smaller }
table.swb {Background-color: cyan}
td.swb { border-right-style: solid; border-bottom-style: solid }
table.nb {Background-color: #ffffcc}
tr.nb { text-align: Center; font-weight: bold }
td.nb { border-style: solid }
kbd {font-family: Fixedsys; font-size: 12pt; color: #0000FF;}
tt {font-family: Fixedsys; font-size: 12pt; color: #0000FF;}
code {font-family: Fixedsys; font-size: 12pt; color: #0000FF;}
pre {font-family: Fixedsys; color: #0000FF; font-size: 12pt; margin: 1em 40px; }
pre.code {font-family: Fixedsys; color: #0000FF; margin: 1em 40px;}
blockquote {font-size: 10pt; font-family: arial; background: #F9F9F9}
</style>
<body>
<h1>msync -- wrapper for rsync that allows multistream transfer of a number of large files or a directory tree over high latency WAN
links</h1>
<h3>Can be used for transferring files to Amazon and AWS<h3>
<p> <b>Nikolai Bezroukov, 2018-2020. Licensed under Artistic license
</b>
<h3> Abstract</h3>
<p> Utility<tt> msync </tt>is a wrapper for rsync and<tt> </tt>is designed to speed up transfer multiple large files such as
genomes files to the target server over high latency WAN links (for example to/from AWS)
<p> It organizes files into given number of "piles" and transfers all piles in parallel using for each separate invocation of rsync or tar pipe to the target server.
Can be used for transferring large file such as genomes files to/from AWS. The number of parallel processed launched is specified
via option -p (the default is 4).
<p>It detects and automatically restarts the transmission of partially transferred files, but only if the file name of partially
transferred file is the same as the original (this will not be the case if you use rsync and the target server was rebooted; in this
case you need to rename the files manually)</p>
<p>While this restart option is handy, it is still it is recommended to split files over 5TB into chunks for transfer.</p>
<p>
If not all files were transferred on the next invocation it calculates the differences and acts accordingly until all files
or directory tree is transferred to the target server.
<p>
For this purpose it scans the content of the remote size at the beginning and compare its status with the original:
only missing files will be transferred.
<p>
Can serve as poor man replacement for UDP based transfere software (such as IBM Aspera), although UDP protocol is definitly better suited
for WAN.
<p>
With latency around 100 msec and 8 threads transfer rate varies between 50 and 100 MB/sec.
<p>
Sometimes I managed to transfer over 4TB in 24 hours on 100 msec latency links.
<h3> INVOCATION
</h3>
<p> <em>ATTENTION</em>: both parameters are obligatory and can't be omitted unless they are specified in config file
<p> Like may copy utilities, this utility has two parameters (source dir and target dir). Both are obligatory and can't be omitted unless they are specified in config file
<ul>
<li><b>1st (BASE_DIR)</b> -- <b>the name of BASE directory:</b> the root of the tree from which files are transferred.
<ul>
<li>If option -f is given abs path for all entries in the list that do not stat with '/' will be prefixed this value (see below).
</li>
<li>If option -f is not given the utility transfers all files in this tree, unless grep regular expression is specified via
option -g. In the latter case only subset filtered by this this regex is transferred</li>
</ul>
</li>
<li><b>2nd (TARGET_DIR )</b> -- name of the target directory on target server where files need to be copied to. </li>
</ul>
<p> There are two ways to invoke this utility:
<p>1. By specifying absolute path to BASE directory on the source and target. All files in this directory will be transferred
although it might need multiple invocations.</p>
<pre class="code">msync BASE_DIR_SOURCE BASE_DIR_ON_TARGET </pre>
<p>For example:
</p>
<pre class="code">msync -u backup -t 10.1.1.1 /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 </pre>
<p>2. Specifying selected list of files and directories either with abs path or relative to BASE directory. Only they will be transferred.
</p>
<p> This is subset should be stored one entry per line in the file provided in option -f.
<p> For example:
<pre class="code">msync -t 10.1.1.1 -f critical.lst /doolittle/Projects/Critical/Genome033/ /gpfs/backup/data/bioinformatics/Critical/Genome033 </pre>
<p> <em>NOTE:
</em>
<p> if invoked with three parameters the third parameter is interpreted as the common tail and added to both <tt>BASE_DIR_SOURCE</tt> and
<tt>BASE_DIR_ON_TARGET</tt>
<pre class="code">msync BASE_DIR TARGET_DIRECTORY SUBDIRECTORY </pre>
<p>That means that the previous example can be rewritten as
</p>
<pre class="code">msync -u backup -t 10.1.1.1 -f critical.lst /doolittle/Projects /gpfs/nobackup/data/bioinformatics Critical/Genome033 </pre>
<p> </p>
<h3> OPTIONS
</h3>
<ul>
<li><tt>-c</tt> -- <b>absolute or relative path to the configuration file</b>. Default is the first in the following list:<tt> ~/.config/msync.conf</tt>,
<tt>~/msync.conf</tt>,<tt> /etc/msync.conf</tt> </li>
<li><tt>-f</tt> -- <b>filelist</b>. Either absolute pathname should be specified, or the file should be in BASE directory. Each line can specify either a file with path or a directory (in the latter case all files in this subtree will be transferred).
Files with relative path are considered to based on the BASE DIRECTORY (the first parameter of the invocation) to the target.
</li>
<li><tt>-h</tt> -- help </li>
<li><tt>-g</tt> -- <b>egrep regular expression for selecting files and directories</b> (works only in tree copy mode,
ignored if -f is specified)</li>
<li><tt>-l</tt> -- <b>directory for log files </b>(the default is <tt>/tmp/Msync_</tt><font color="#0000FF"><i><b><userid></b></i></font> )
</li>
<li><tt>-m</tt> -- <b>max amount to be transferred</b> (useful for weekend transfers). Can be specified in T,G,M or K. For example -m 4T
</li>
<li><tt>-p</tt> -- (<b>level of parallelism</b>) number of parallel streams (default is 4);
</li>
<li><tt>-r</tt> -- <b>RUN AT specified time</b>. Specified time is passed to at command (so <tt>now</tt> is acceptable.) If this option is specified it is
passed to at command which launches the set of parallel transfer scripts (as many as specified in option -p) generated by
this utility. For example as <tt>-r now</tt> or -r 19:30 If option -r is not specified, then the launcher
script is generated but not executed. The command for launching them <tt>(launcher.sh </tt>) is very simple and is listed in the protocol. You can schedule the msync utility via cron can be invoked at 7PM each day
and use limit of transferred file (see option -m above) to finish it in the morning. The same trick can work to transfers
sceduled for weekends.</li>
<li><tt>-S</tt> -- <b>list of ssh parameters which is passed to SSH</b> and RSYNC "as is" (for example -S <tt>'-i /home/bezroun/.ssh/id_rsa.tt -P 2222'</tt>;
</li>
<li><tt>-t </tt>-- <b>target site IP or DNS name</b> (should be configured with the passwordless SSH login from the source site)
</li>
<li><tt>-u</tt> -- <b>user name on target site</b> if different from the <tt>USER</tt> env variable (should be configured with the passwordless SSH login from the source site)
</li>
<li><tt>-v</tt> -- <b>verbosity</b> (0-3). 0 -- no messages; 1 -- only serious; 2 - serious and errors; 3 -- serious, errors and warnings;
Default 3</li>
<li><tt>-h</tt> -- <b>help</b> </li>
<li><tt>-d</tt> -- <b>debug flag</b> (0 -production mode; 1-3 -various debugging modes with additional debugging output). if debug is greater then zero, the source is saved in archive
<tt>~/Archive</tt>, if it was changed form the prev run. </li>
</ul>
<h3> NOTES
</h3>
<ol>
<li>Acceleration depends on the number of processes that will be launched in parallel. While default is 4, the number
up to 8 usually work faster on links with latency 100 msec. Further increase to 16 leads to eventual saturation.
<p>You need to experiment to find the optimal number for your situation. For some reason on large files rsync works better
than tar,<br>
even if they do not exist of the target.</li>
<li>If the utility is started in debug mode it compares its body with the last version in archive and if it changed,
updates the archive preserving prev generation<br>
The archive directory is $HOME/Archive. Again, this happens only if debug variable is set to non zero via option -d or
config file. This directory should be different from the directory, from which script is launched</li>
<li>In the future versions option <tt>-b</tt> specified chunk size might be implemented. Right now I do not feel that it is necessary
although I did encountered situation in which some large files need to be split for storage. One advantage of option -b would be that
if the number of threads equal 8 and the file we need to transfer is 40TB will be split into exactly 8 chunks. </li>
</ol>
<pre class="code"> split --bytes 5T --numeric-suffixes --suffix-length=3 foo /tmp/foo. </pre>
<blockquote>
<p>The split commands generate chunks named: foo.000, foo.001 ...
</p>
<p>For re-assembling you need to sort chunks first
<pre class="code">cd $TARGET && cat `ls foo.* | sort` > foo </pre>
</blockquote>
</pre>
</body>
</html>