feature/dataset-config-refactor by thommodin · Pull Request #256 · aodn/aodn_cloud_optimised

thommodin · 2026-03-02T04:50:06Z

This pull request updates dependency versions and improves configuration file handling in the codebase. The most significant changes are the upgrade of the h5py library, the addition of pydantic as a dependency, and a fix to the way configuration files are accessed using importlib.resources.

Dataset Config Refactor
Re-fatored DatasetConfig into separate files and added loading from disk instead of module:

Ability to load config from external sources
Readability and maintainability of the config pydantic classes.

Dependency updates:

~~Upgraded h5py from version 3.11.0 to 3.13.0 in pyproject.toml, allows installation on MacOS~~
Workaround for dev on MacOS is to upgrade to 3.13 for dev, then roll back for publication and allow unit test in CICD. Obviously, take special care around errors and ensure they are related to the nano/micro second errors.
Added pydantic>=2.12.5 to the dependencies in pyproject.toml for data validation and settings management.

Configuration handling improvements:

Refactored the import of importlib.resources to use as_file for safer and more correct file handling, and updated the context manager in load_variable_from_config to use as_file when accessing the common.json configuration file. This change improves compatibility and reliability when working with package resources. [1] [2]

Delete uv lock
uv.lock snuck into the main branch somehow(?)

Deleted it...

❯ git log --n
ame-status -- uv.lock
commit aabf718 (HEAD -> feature/dataset-config-refactor, origin/feature/dataset-config-refactor)
Author: Tom Galindo 98626996+thommodin@users.noreply.github.com
Date: Mon Mar 2 15:46:43 2026 +1100

init

D uv.lock

commit 686fe73
Author: karishma karishma24khanna@gmail.com
Date: Fri Jan 23 11:36:25 2026 +1100

AMSA Config File

A uv.lock

thommodin · 2026-03-02T04:52:42Z

+
+        # Validate
+        return DatasetConfig.model_validate(dataset_config)
+


load_from_path gives us the ability to load configuration from path instead of the module

I'm trying to get my bearings again, I thought the existing load_dataset_config could load from a path already:

14:from aodn_cloud_optimised.lib.config import load_dataset_config 121: dataset_ardc_netcdf_config = load_dataset_config(DATASET_CONFIG_NC_ARDC_JSON) 133: dataset_soop_sst_netcdf_config = load_dataset_config(

You are correct.

When I opened this task I must have mistook there being tight coupling to loading from the aodn_cloud_optimised config dir., it actually exists in dataflow-orchestration here.

dataset_config = load_dataset_config( str(files("aodn_cloud_optimised.config.dataset").joinpath(dataset_config)) )

Anyway, the refactoring of all these classes into separate files makes this a lot easier to maintain and extend.

Fixing the load logic to the class and return instances of the class from the load function are also improvenments that:

Make it easier to track and update load logic (now part of the class functionality)

Allow extension later eg to loading config from external config stores eg prefect/aws.

Copilot

Pull request overview

This PR upgrades key Python dependencies and refactors dataset configuration validation into a dedicated aodn_cloud_optimised/bin/config package, while also fixing package-resource access to use importlib.resources.as_file for safer config file handling.

Changes:

Upgrade h5py and add pydantic as a first-class dependency.
Switch load_variable_from_config to use importlib.resources.as_file when reading common.json.
Extract the large, inline Pydantic config schema from generic_cloud_optimised_creation.py into separate modules under aodn_cloud_optimised/bin/config/.

Reviewed changes

Copilot reviewed 2 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pyproject.toml	Updates dependency pins and adds `pydantic` dependency.
aodn_cloud_optimised/lib/config.py	Uses `as_file()` to safely access packaged `common.json`.
aodn_cloud_optimised/bin/generic_cloud_optimised_creation.py	Removes inline schema classes and imports new config models.
aodn_cloud_optimised/bin/config/init.py	Exposes new config models as a package.
aodn_cloud_optimised/bin/config/cluster_config.py	Defines Pydantic model for cluster configuration.
aodn_cloud_optimised/bin/config/coiled_cluster_options.py	Defines Pydantic model for Coiled cluster options.
aodn_cloud_optimised/bin/config/worker_options.py	Defines Pydantic model for worker sizing options.
aodn_cloud_optimised/bin/config/ec2_cluster_options.py	Defines Pydantic model for EC2 cluster options.
aodn_cloud_optimised/bin/config/ec2_adapt_options.py	Defines Pydantic model for EC2 adapt (min/max) options.
aodn_cloud_optimised/bin/config/path_config.py	Defines and validates dataset input path configuration.
aodn_cloud_optimised/bin/config/run_settings.py	Defines and validates run-level settings and S3 option blocks.
aodn_cloud_optimised/bin/config/csv_config_model.py	Validates mutually exclusive pandas vs polars CSV reader config.
aodn_cloud_optimised/bin/config/parquet_schema_transformation.py	Validates parquet schema transformation blocks.
aodn_cloud_optimised/bin/config/zarr_schema_transformation.py	Validates zarr schema transformation blocks.
aodn_cloud_optimised/bin/config/dataset_config.py	Defines the top-level dataset config model and file-loading helpers.

thommodin · 2026-03-02T05:04:15Z

@lbesnard I started work on that external config load ability, starting with a refactor of the DatasetConfig class.

Looks like it is failing some tests in the actions but they don't fail on local so a bit confused. If you search on AssertionError you see that numbers appear to be read incorrectly all over the place, have you seen this before?

lbesnard · 2026-03-02T05:11:54Z

Re-fatored DatasetConfig into separate files and added loading from disk instead of module:

could be maybe less confusing if all the pydantic functions would be in a separate folder under config/pydantic for example

lbesnard · 2026-03-02T05:17:17Z

maybe try using https://github.com/nektos/act to help debugging the failed tests? there are definitely a lot, this one is a bit weird

self.assertEqual(parquet_dataset["timestamp"][0], 1293840000)
E       AssertionError: 1293840 != 1293840000

thommodin · 2026-03-03T22:18:29Z

@lbesnard I think I pinpointed the issue with the h5py 5.13 upgrade; looks like the default timestamp export was as microseconds not nanoseconds for the new version which caused the unit test fails.

I rolled h5py back to 3.11 and looks like everything is fine now.

Also chucked the config models into a separate folder as suggested :)

codecov-commenter · 2026-03-03T23:32:23Z

Codecov Report

❌ Patch coverage is 83.14607% with 135 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.48%. Comparing base (0ee3bfa) to head (1c23fa0).

Files with missing lines	Patch %	Lines
.../bin/config/model/parquet_schema_transformation.py	72.52%	50 Missing ⚠️
...dn_cloud_optimised/bin/config/model/path_config.py	72.44%	27 Missing ⚠️
...sed/bin/config/model/zarr_schema_transformation.py	80.35%	22 Missing ⚠️
...cloud_optimised/bin/config/model/dataset_config.py	80.55%	21 Missing ⚠️
...n_cloud_optimised/bin/config/model/run_settings.py	87.87%	8 Missing ⚠️
test_aodn_cloud_optimised/conftest.py	33.33%	4 Missing ⚠️
...oud_optimised/bin/config/model/csv_config_model.py	88.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
+ Coverage   65.00%   67.48%   +2.48%     
==========================================
  Files          28       41      +13     
  Lines        5404     5678     +274     
==========================================
+ Hits         3513     3832     +319     
+ Misses       1891     1846      -45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thommodin · 2026-03-10T03:49:21Z

+        ),
+    )
+    # TODO: Implement aws_opendata_registry model for validation
+    aws_opendata_registry: dict | None = pydantic.Field(


Added aws_opendata_registry

There are some checks, for empty values but they now skip aws_opendata_registry so they pass.

thommodin · 2026-03-10T03:50:15Z

+    parent_config: str | None = pydantic.Field(
+        default=None,
+        description="Name of the parent config to load on top of. May be recursive.",
+    )


Also logger_name, handler_class and parent_config are all added with defaults. This allows the DatasetConfig class to pick them up if they exist.

thommodin · 2026-03-10T03:51:27Z

@@ -75,11 +75,6 @@
      }
    }
  },


The spatial_extent is properly captured elsewhere in the config now

thommodin · 2026-03-10T03:53:43Z

@@ -93,11 +93,9 @@ def load_variable_from_config(variable_name) -> str:
    :raises KeyError: If the variable is not found in the configuration file.
    """
    # Obtain the file path using the context manager


Could consider using the import directly eg

Kind of cleaner than the import lib file/as_file

path lib.Path(aodn_cloud_optimised.config.__path__[0]) / "common.json"

Copilot

Pull request overview

Copilot reviewed 8 out of 24 changed files in this pull request and generated 1 comment.

You can also share your feedback on Copilot code review. Take the survey.

lbesnard · 2026-03-16T01:32:38Z

@copilot can you check that nothing got missed from the previous aodn_cloud_optimised/bin/generic_cloud_optimised_creation.py file when refactored and moved to the various config/models/*.py . If the new class/methods/functions aren't a direct copy/paste, can you highlight it, explain the changes, and tell me which line and file to look at manually?

Copilot · 2026-03-16T01:32:46Z

@lbesnard I've opened a new pull request, #258, to work on those changes. Once the pull request is ready, I'll request review from you.

lbesnard · 2026-03-16T01:00:15Z

what is that file for? where is it used?

This routes the path correctly for udunits on MacOS

lbesnard · 2026-03-16T01:08:28Z

+
+# Config files that intentionally contain "FILL UP MANUALLY" placeholders and
+# are expected to fail DatasetConfig validation — excluded from the compat sweep.
+_PLACEHOLDER_CONFIGS = {


from what i understand here, you're running the tests directly on the configs available in config/dataset.
The load_from_cloud_optimised_directory function should be mocked or point to test configs available in the test/resources path.
The config files you pointing to right now may not have placeholders in the future

Currently this dynamically loads all config in the directory except for the _PLACEHOLDER_CONFIGS

These are skipped because they contain "FILL UP MANUALLY"

The intent of the test is to verify all the legacy loads match the new load method for all config that are valid to load.

yeah but that's my point.

later you have

def test_all_configs_load_without_error(self): for config_path in self._loadable_configs(): with self.subTest(config=config_path.name): DatasetConfig.load_from_cloud_optimised_directory(config_path.name)

which IMO should not be in the unittests. This is more for pre-commit
see

aodn_cloud_optimised/.pre-commit-config.yaml

Line 69 in 0ee3bfa

entry: python3 aodn_cloud_optimised/bin/pydantic_precommit_hook.py --validate-configs aodn_cloud_optimised/config/dataset

and
https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/bin/pydantic_precommit_hook.py

lbesnard · 2026-03-16T06:55:59Z

+
+        # Validate
+        return DatasetConfig.model_validate(dataset_config)
+


I'm trying to get my bearings again, I thought the existing load_dataset_config could load from a path already:

14:from aodn_cloud_optimised.lib.config import load_dataset_config 121: dataset_ardc_netcdf_config = load_dataset_config(DATASET_CONFIG_NC_ARDC_JSON) 133: dataset_soop_sst_netcdf_config = load_dataset_config(

…eck) The bug fix from 075c557 corrected any('.parquet' in f for f in filter) to '.parquet' in filter in generic_cloud_optimised_creation.py, but after the refactor moved PathConfig to bin/config/model/path_config.py, the old buggy code remained there. This commit ports the fix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…patibility @model_validator(mode='after') on a classmethod-style function (cls, values) is deprecated since Pydantic v2.12 and will be removed in v3.0. Changed validate_cross_fields to an instance method using self, as required by the Pydantic v2.12 API. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

lbesnard · 2026-04-10T06:50:19Z

@thommodin see my latest comment above re testing all config files, which i don't think should happen as part of the unittests. Anyway I'll merge this PR as we need to move forward

Copilot AI review requested due to automatic review settings March 2, 2026 04:50

Copilot started reviewing on behalf of thommodin March 2, 2026 04:50 View session

thommodin commented Mar 2, 2026

View reviewed changes

Copilot AI reviewed Mar 2, 2026

View reviewed changes

thommodin changed the title ~~init~~ feature/dataset-config-refactor Mar 3, 2026

thommodin commented Mar 10, 2026

View reviewed changes

Comment thread aodn_cloud_optimised/config/dataset/animal_acoustic_tracking_delayed_qc.json

thommodin commented Mar 10, 2026

View reviewed changes

thommodin requested a review from Copilot March 13, 2026 02:53

Copilot started reviewing on behalf of thommodin March 13, 2026 02:54 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread test_aodn_cloud_optimised/conftest.py

Copilot AI mentioned this pull request Mar 16, 2026

Fix refactoring artifacts in config/model: duplicate CSVConfigModel and PathConfig.type annotation #258

Closed

lbesnard requested changes Mar 16, 2026

View reviewed changes

thommodin added 8 commits April 10, 2026 15:43

init

d5c7dc4

move config to folder

5bbf494

Add unit test and fix a couple of configs

4ce3c09

revert changes to timestamp code

0d92f02

put datetime explicit ns conversion back in

d0061ad

Fix timestamp

906de36

Cleanup as per suggestions

b838158

Added sphinx doc for pandas and polars csv config

55b9e37

lbesnard and others added 3 commits April 10, 2026 15:56

Fix: pre-commit

3311314

lbesnard force-pushed the feature/dataset-config-refactor branch from e78df6f to 3311314 Compare April 10, 2026 06:24

Fix: remove uv.lock

1c23fa0

lbesnard merged commit 6daaa01 into main Apr 10, 2026
7 checks passed

lbesnard deleted the feature/dataset-config-refactor branch April 13, 2026 06:44


		# Validate
		return DatasetConfig.model_validate(dataset_config)

@@ @@ -75,11 +75,6 @@ @@
                     }
                   }
                 },

Conversation

thommodin commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

thommodin commented Mar 2, 2026

Uh oh!

lbesnard commented Mar 2, 2026

Uh oh!

lbesnard commented Mar 2, 2026

Uh oh!

thommodin commented Mar 3, 2026

Uh oh!

codecov-commenter commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

lbesnard commented Mar 16, 2026

Uh oh!

Copilot AI commented Mar 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbesnard commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thommodin commented Mar 2, 2026 •

edited

Loading

codecov-commenter commented Mar 3, 2026 •

edited

Loading