To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.
roughly, the following scheme is employed, starting with a sorted cdx_lines_source:
cdx_block_line_count = 3000
idx = open(idx_output,'wb')
cdx = open(cdx_output,'wb')
while cdx_lines_source:
chunk = list(itertools.islice(cdx_lines_source, cdx_block_line_count))
b = cStringIO.StringIO()
g = gzip.GzipFile(fileobj=b, mode='wb')
g.write('\n'.join(chunk))
g.write('\n')
g.close()
z = b.getvalue()
cdx_key = ' '.join(chunk[0][0:2])
idx.write('%s\t%s\t%d\t%d\n' % (cdx_key, cdx_filename, cdx.tell(), len(z)))
cdx.write(z)
Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow).
To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.
roughly, the following scheme is employed, starting with a sorted cdx_lines_source:
Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow).