Steps to reproduce:
- Write a container with an app that generates a PDF with content from the inputs. For example, something similar to the basic examples of the Python libraries, reportlab and matplotlib.
- Upload the container into Kive, and launch a run.
- When the run is finished, rerun it.
Expected behaviour: the outputs should match.
Actual behaviour: PDF outputs usually don't match.
Analysis
Most libraries write PDF files with a timestamp in the file content. That means that the exact same data inputs won't generate the exact same outputs, if the two runs happened at different times.
Good news, though. It looks like both matplotlib and reportlab support the SOURCE_DATE_EPOCH environment variable that is intended to help make outputs reproducible. If we took the timestamp of the container and passed it to each run in the SOURCE_DATE_EPOCH environment variable, that would probably avoid the problems with PDFs not matching after reruns.
Another option is to take the latest date from the container and all the input datasets. I think that would be easier to understand, but I suspect that would be unreliable. For example, if we rerun two nested runs where the output of one is the input of the other, and we have to recreate that output, then it would have a different timestamp from the first output.
If a pipeline wants to generate PDFs with the current date as the creation timestamp, it could unset the SOURCE_DATE_EPOCH environment variable before calling the library code.
Steps to reproduce:
Expected behaviour: the outputs should match.
Actual behaviour: PDF outputs usually don't match.
Analysis
Most libraries write PDF files with a timestamp in the file content. That means that the exact same data inputs won't generate the exact same outputs, if the two runs happened at different times.
Good news, though. It looks like both matplotlib and reportlab support the
SOURCE_DATE_EPOCHenvironment variable that is intended to help make outputs reproducible. If we took the timestamp of the container and passed it to each run in theSOURCE_DATE_EPOCHenvironment variable, that would probably avoid the problems with PDFs not matching after reruns.Another option is to take the latest date from the container and all the input datasets. I think that would be easier to understand, but I suspect that would be unreliable. For example, if we rerun two nested runs where the output of one is the input of the other, and we have to recreate that output, then it would have a different timestamp from the first output.
If a pipeline wants to generate PDFs with the current date as the creation timestamp, it could unset the
SOURCE_DATE_EPOCHenvironment variable before calling the library code.