Currently, the database is 378MB. This seems huge considering that it theoretically only contains hashes, versions and filenames.
A fast investigation revealed that:
- The database contains hashes for useless filenames. Some files are inside the intellij
.idea folder and knowning their hash is not usefull. Same for all files in test-related folders. A request in the database revealed that these test files account for at least 40% of all files. Some files also ends with the .php extension. There is no chance that these files will be usefull to detect a version. Php files represents 50% of all files.
- The versions are stored in json, which is quite verbose. All hashes seems to correspond to a continuous range of versions. Replacing the json field by two fields 'initial_version' and 'last_version' could greatly reduce the size of these data. The range could then be computed using the
version table. The best would be to have a version table with ordered versions. By limiting to numerical versions (other are useless because Cyberwatch cannot find associated CVE), the ordering can be alphanumeric.
- The hashes are stored as string, using 64 bytes instead of 8 bytes. As there are 600 000 hashes, converting these strings to binary could save up to 33MB.
- The table
versions seems useless, but I may be wrong. There are only 1600 entries so this is not very important.
- The name of the technology is stored in each row in tables
hash and file. Each of these entry use 8 bytes. Adding a table technology and using foreign keys of a small size (u32 for example) can save some space.
Action required
Currently, the database is 378MB. This seems huge considering that it theoretically only contains hashes, versions and filenames.
A fast investigation revealed that:
.ideafolder and knowning their hash is not usefull. Same for all files in test-related folders. A request in the database revealed that these test files account for at least 40% of all files. Some files also ends with the.phpextension. There is no chance that these files will be usefull to detect a version. Php files represents 50% of all files.versiontable. The best would be to have a version table with ordered versions. By limiting to numerical versions (other are useless because Cyberwatch cannot find associated CVE), the ordering can be alphanumeric.versionsseems useless, but I may be wrong. There are only 1600 entries so this is not very important.hashandfile. Each of these entry use 8 bytes. Adding a tabletechnologyand using foreign keys of a small size (u32 for example) can save some space.Action required