support idx, for 2 level indexes

To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.  

roughly, the following scheme is employed, starting with a sorted cdx_lines_source:

```
cdx_block_line_count = 3000

idx = open(idx_output,'wb')
cdx = open(cdx_output,'wb')
while cdx_lines_source:
    chunk = list(itertools.islice(cdx_lines_source, cdx_block_line_count))
    b = cStringIO.StringIO()
    g = gzip.GzipFile(fileobj=b, mode='wb')
    g.write('\n'.join(chunk))
    g.write('\n')
    g.close()
    z = b.getvalue()
    cdx_key = ' '.join(chunk[0][0:2])
    idx.write('%s\t%s\t%d\t%d\n' % (cdx_key, cdx_filename, cdx.tell(), len(z)))
    cdx.write(z)
```

Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow). 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support idx, for 2 level indexes #8

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

support idx, for 2 level indexes #8

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions