Skip to content

support idx, for 2 level indexes #8

@samuelinsf

Description

@samuelinsf

To deal with large indexes (where sorted cdx data cannot all fit in core), an idx file is produced, containing one line for each chunk of (typically 3000) cdx lines. Each idx record is: the surt + ' ' + date (or cdx key), the name of the cdx file the record refers to, the offset in the file where the key is located, and the length of the gzip record all tab separated, one record per line.

roughly, the following scheme is employed, starting with a sorted cdx_lines_source:

cdx_block_line_count = 3000

idx = open(idx_output,'wb')
cdx = open(cdx_output,'wb')
while cdx_lines_source:
    chunk = list(itertools.islice(cdx_lines_source, cdx_block_line_count))
    b = cStringIO.StringIO()
    g = gzip.GzipFile(fileobj=b, mode='wb')
    g.write('\n'.join(chunk))
    g.write('\n')
    g.close()
    z = b.getvalue()
    cdx_key = ' '.join(chunk[0][0:2])
    idx.write('%s\t%s\t%d\t%d\n' % (cdx_key, cdx_filename, cdx.tell(), len(z)))
    cdx.write(z)

Ideally, the cdx writer could write cdx files with these handy idx index files on the side. Also, would be wonderful if we had a cdx editor which could, using the idx file, allow records in a cdx file to be edited (as long as the gzip block containing the record didn't grow).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions