Skip to content

Cache does not detect stale files #251

@arthurlorenzi

Description

@arthurlorenzi

Problem

When a file is updated on the DataSUS FTP server, PySUS will silently return the outdated local copy if a file with the same name already exists in the cache. There is no staleness check of any kind (it only checks if the file with the same name exits):

if existing.exists():

if existing.exists():

Proposed Solution

The FileInfo metadata fetched from the FTP server already contains the file's modification date and size. These can be compared against cached metadata to determine staleness.

A safe approach is to store the server's modification timestamp in a separate, metadata file at download time:

# on download, write a sidecar
meta = {
    "last_update": self.__info["last_update"].isoformat(),
    "size": int(self.__info["size"])
}
with open(str(filepath) + ".pysus_meta.json", "w") as f:
    json.dump(meta, f)

Then on cache check, compare server metadata against the local file. Something like:

meta_path = pathlib.Path(str(existing) + ".pysus_meta.json")
if meta_path.exists():
    meta = json.loads(meta_path.read_text())
    cached_modify = datetime.fromisoformat(meta["last_update"])
    if cached_modify >= self.__info["last_update"] and meta["size"] == int(self.__info["size"]):
        return Data(str(existing))
# otherwise, re-download

No additional FTP requests are needed. The metadata is already available at download time.

Possible API Extension

Beyond staleness detection, it would be worth exposing a use_cache parameter to allow users to bypass the cache entirely:

file.download(use_cache=False)
file.async_download(use_cache=False)

This is particularly useful for pipelines that need guaranteed fresh data regardless of cache state, without having to manually delete cached files.

Context

This was identified while building a platform that will periodically monitor DataSUS sources and downloads files when they are new or updated. The final goal is compiling a collection of epidemiological indicators. When PySUS returns stale data due to caching we might have an issue. We are currently wrapping every download in a temporary file block, but that is just a workaround:

with tempfile.TemporaryDirectory() as tmp:
    data = sinan.download(file, local_dir=tmp)

If you are OK with the proposed changes, I can write the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions