Problem
When a file is updated on the DataSUS FTP server, PySUS will silently return the outdated local copy if a file with the same name already exists in the cache. There is no staleness check of any kind (it only checks if the file with the same name exits):
Proposed Solution
The FileInfo metadata fetched from the FTP server already contains the file's modification date and size. These can be compared against cached metadata to determine staleness.
A safe approach is to store the server's modification timestamp in a separate, metadata file at download time:
# on download, write a sidecar
meta = {
"last_update": self.__info["last_update"].isoformat(),
"size": int(self.__info["size"])
}
with open(str(filepath) + ".pysus_meta.json", "w") as f:
json.dump(meta, f)
Then on cache check, compare server metadata against the local file. Something like:
meta_path = pathlib.Path(str(existing) + ".pysus_meta.json")
if meta_path.exists():
meta = json.loads(meta_path.read_text())
cached_modify = datetime.fromisoformat(meta["last_update"])
if cached_modify >= self.__info["last_update"] and meta["size"] == int(self.__info["size"]):
return Data(str(existing))
# otherwise, re-download
No additional FTP requests are needed. The metadata is already available at download time.
Possible API Extension
Beyond staleness detection, it would be worth exposing a use_cache parameter to allow users to bypass the cache entirely:
file.download(use_cache=False)
file.async_download(use_cache=False)
This is particularly useful for pipelines that need guaranteed fresh data regardless of cache state, without having to manually delete cached files.
Context
This was identified while building a platform that will periodically monitor DataSUS sources and downloads files when they are new or updated. The final goal is compiling a collection of epidemiological indicators. When PySUS returns stale data due to caching we might have an issue. We are currently wrapping every download in a temporary file block, but that is just a workaround:
with tempfile.TemporaryDirectory() as tmp:
data = sinan.download(file, local_dir=tmp)
If you are OK with the proposed changes, I can write the PR.
Problem
When a file is updated on the DataSUS FTP server, PySUS will silently return the outdated local copy if a file with the same name already exists in the cache. There is no staleness check of any kind (it only checks if the file with the same name exits):
PySUS/pysus/ftp/__init__.py
Line 161 in 469548d
PySUS/pysus/ftp/__init__.py
Line 205 in 469548d
Proposed Solution
The
FileInfometadata fetched from the FTP server already contains the file's modification date and size. These can be compared against cached metadata to determine staleness.A safe approach is to store the server's modification timestamp in a separate, metadata file at download time:
Then on cache check, compare server metadata against the local file. Something like:
No additional FTP requests are needed. The metadata is already available at download time.
Possible API Extension
Beyond staleness detection, it would be worth exposing a
use_cacheparameter to allow users to bypass the cache entirely:This is particularly useful for pipelines that need guaranteed fresh data regardless of cache state, without having to manually delete cached files.
Context
This was identified while building a platform that will periodically monitor DataSUS sources and downloads files when they are new or updated. The final goal is compiling a collection of epidemiological indicators. When PySUS returns stale data due to caching we might have an issue. We are currently wrapping every download in a temporary file block, but that is just a workaround:
If you are OK with the proposed changes, I can write the PR.