Skip to content

feature/dataset-config-refactor#256

Merged
lbesnard merged 12 commits intomainfrom
feature/dataset-config-refactor
Apr 10, 2026
Merged

feature/dataset-config-refactor#256
lbesnard merged 12 commits intomainfrom
feature/dataset-config-refactor

Conversation

@thommodin
Copy link
Copy Markdown
Contributor

@thommodin thommodin commented Mar 2, 2026

This pull request updates dependency versions and improves configuration file handling in the codebase. The most significant changes are the upgrade of the h5py library, the addition of pydantic as a dependency, and a fix to the way configuration files are accessed using importlib.resources.

Dataset Config Refactor
Re-fatored DatasetConfig into separate files and added loading from disk instead of module:

  • Ability to load config from external sources
  • Readability and maintainability of the config pydantic classes.

Dependency updates:

  • Upgraded h5py from version 3.11.0 to 3.13.0 in pyproject.toml, allows installation on MacOS
  • Workaround for dev on MacOS is to upgrade to 3.13 for dev, then roll back for publication and allow unit test in CICD. Obviously, take special care around errors and ensure they are related to the nano/micro second errors.
  • Added pydantic>=2.12.5 to the dependencies in pyproject.toml for data validation and settings management.

Configuration handling improvements:

  • Refactored the import of importlib.resources to use as_file for safer and more correct file handling, and updated the context manager in load_variable_from_config to use as_file when accessing the common.json configuration file. This change improves compatibility and reliability when working with package resources. [1] [2]

Delete uv lock
uv.lock snuck into the main branch somehow(?)

Deleted it...

❯ git log --n
ame-status -- uv.lock
commit aabf718 (HEAD -> feature/dataset-config-refactor, origin/feature/dataset-config-refactor)
Author: Tom Galindo 98626996+thommodin@users.noreply.github.com
Date: Mon Mar 2 15:46:43 2026 +1100

init

D uv.lock

commit 686fe73
Author: karishma karishma24khanna@gmail.com
Date: Fri Jan 23 11:36:25 2026 +1100

AMSA Config File

A uv.lock

Copilot AI review requested due to automatic review settings March 2, 2026 04:50

# Validate
return DatasetConfig.model_validate(dataset_config)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_from_path gives us the ability to load configuration from path instead of the module

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get my bearings again, I thought the existing load_dataset_config could load from a path already:

14:from aodn_cloud_optimised.lib.config import load_dataset_config
121:        dataset_ardc_netcdf_config = load_dataset_config(DATASET_CONFIG_NC_ARDC_JSON)
133:        dataset_soop_sst_netcdf_config = load_dataset_config(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct.

When I opened this task I must have mistook there being tight coupling to loading from the aodn_cloud_optimised config dir., it actually exists in dataflow-orchestration here.

dataset_config = load_dataset_config(
    str(files("aodn_cloud_optimised.config.dataset").joinpath(dataset_config))
)

Anyway, the refactoring of all these classes into separate files makes this a lot easier to maintain and extend.

Fixing the load logic to the class and return instances of the class from the load function are also improvenments that:

  1. Make it easier to track and update load logic (now part of the class functionality)
  2. Allow extension later eg to loading config from external config stores eg prefect/aws.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades key Python dependencies and refactors dataset configuration validation into a dedicated aodn_cloud_optimised/bin/config package, while also fixing package-resource access to use importlib.resources.as_file for safer config file handling.

Changes:

  • Upgrade h5py and add pydantic as a first-class dependency.
  • Switch load_variable_from_config to use importlib.resources.as_file when reading common.json.
  • Extract the large, inline Pydantic config schema from generic_cloud_optimised_creation.py into separate modules under aodn_cloud_optimised/bin/config/.

Reviewed changes

Copilot reviewed 2 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Updates dependency pins and adds pydantic dependency.
aodn_cloud_optimised/lib/config.py Uses as_file() to safely access packaged common.json.
aodn_cloud_optimised/bin/generic_cloud_optimised_creation.py Removes inline schema classes and imports new config models.
aodn_cloud_optimised/bin/config/init.py Exposes new config models as a package.
aodn_cloud_optimised/bin/config/cluster_config.py Defines Pydantic model for cluster configuration.
aodn_cloud_optimised/bin/config/coiled_cluster_options.py Defines Pydantic model for Coiled cluster options.
aodn_cloud_optimised/bin/config/worker_options.py Defines Pydantic model for worker sizing options.
aodn_cloud_optimised/bin/config/ec2_cluster_options.py Defines Pydantic model for EC2 cluster options.
aodn_cloud_optimised/bin/config/ec2_adapt_options.py Defines Pydantic model for EC2 adapt (min/max) options.
aodn_cloud_optimised/bin/config/path_config.py Defines and validates dataset input path configuration.
aodn_cloud_optimised/bin/config/run_settings.py Defines and validates run-level settings and S3 option blocks.
aodn_cloud_optimised/bin/config/csv_config_model.py Validates mutually exclusive pandas vs polars CSV reader config.
aodn_cloud_optimised/bin/config/parquet_schema_transformation.py Validates parquet schema transformation blocks.
aodn_cloud_optimised/bin/config/zarr_schema_transformation.py Validates zarr schema transformation blocks.
aodn_cloud_optimised/bin/config/dataset_config.py Defines the top-level dataset config model and file-loading helpers.

@thommodin
Copy link
Copy Markdown
Contributor Author

@lbesnard I started work on that external config load ability, starting with a refactor of the DatasetConfig class.

Looks like it is failing some tests in the actions but they don't fail on local so a bit confused. If you search on AssertionError you see that numbers appear to be read incorrectly all over the place, have you seen this before?

@lbesnard
Copy link
Copy Markdown
Collaborator

lbesnard commented Mar 2, 2026

Re-fatored DatasetConfig into separate files and added loading from disk instead of module:

could be maybe less confusing if all the pydantic functions would be in a separate folder under config/pydantic for example

@lbesnard
Copy link
Copy Markdown
Collaborator

lbesnard commented Mar 2, 2026

maybe try using https://github.com/nektos/act to help debugging the failed tests? there are definitely a lot, this one is a bit weird

self.assertEqual(parquet_dataset["timestamp"][0], 1293840000)
E       AssertionError: 1293840 != 1293840000

@thommodin thommodin changed the title init feature/dataset-config-refactor Mar 3, 2026
@thommodin
Copy link
Copy Markdown
Contributor Author

@lbesnard I think I pinpointed the issue with the h5py 5.13 upgrade; looks like the default timestamp export was as microseconds not nanoseconds for the new version which caused the unit test fails.

I rolled h5py back to 3.11 and looks like everything is fine now.

Also chucked the config models into a separate folder as suggested :)

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 3, 2026

Codecov Report

❌ Patch coverage is 83.14607% with 135 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.48%. Comparing base (0ee3bfa) to head (1c23fa0).

Files with missing lines Patch % Lines
.../bin/config/model/parquet_schema_transformation.py 72.52% 50 Missing ⚠️
...dn_cloud_optimised/bin/config/model/path_config.py 72.44% 27 Missing ⚠️
...sed/bin/config/model/zarr_schema_transformation.py 80.35% 22 Missing ⚠️
...cloud_optimised/bin/config/model/dataset_config.py 80.55% 21 Missing ⚠️
...n_cloud_optimised/bin/config/model/run_settings.py 87.87% 8 Missing ⚠️
test_aodn_cloud_optimised/conftest.py 33.33% 4 Missing ⚠️
...oud_optimised/bin/config/model/csv_config_model.py 88.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #256      +/-   ##
==========================================
+ Coverage   65.00%   67.48%   +2.48%     
==========================================
  Files          28       41      +13     
  Lines        5404     5678     +274     
==========================================
+ Hits         3513     3832     +319     
+ Misses       1891     1846      -45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

),
)
# TODO: Implement aws_opendata_registry model for validation
aws_opendata_registry: dict | None = pydantic.Field(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added aws_opendata_registry

There are some checks, for empty values but they now skip aws_opendata_registry so they pass.

parent_config: str | None = pydantic.Field(
default=None,
description="Name of the parent config to load on top of. May be recursive.",
)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also logger_name, handler_class and parent_config are all added with defaults. This allows the DatasetConfig class to pick them up if they exist.

@@ -75,11 +75,6 @@
}
}
},
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spatial_extent is properly captured elsewhere in the config now

@@ -93,11 +93,9 @@ def load_variable_from_config(variable_name) -> str:
:raises KeyError: If the variable is not found in the configuration file.
"""
# Obtain the file path using the context manager
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could consider using the import directly eg

Kind of cleaner than the import lib file/as_file

path lib.Path(aodn_cloud_optimised.config.__path__[0]) / "common.json"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 24 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.

Comment thread test_aodn_cloud_optimised/conftest.py
@lbesnard
Copy link
Copy Markdown
Collaborator

@copilot can you check that nothing got missed from the previous aodn_cloud_optimised/bin/generic_cloud_optimised_creation.py file when refactored and moved to the various config/models/*.py . If the new class/methods/functions aren't a direct copy/paste, can you highlight it, explain the changes, and tell me which line and file to look at manually?

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 16, 2026

@lbesnard I've opened a new pull request, #258, to work on those changes. Once the pull request is ready, I'll request review from you.

Comment thread pyproject.toml
Comment thread .vscode/settings.json Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is that file for? where is it used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This routes the path correctly for udunits on MacOS


# Config files that intentionally contain "FILL UP MANUALLY" placeholders and
# are expected to fail DatasetConfig validation — excluded from the compat sweep.
_PLACEHOLDER_CONFIGS = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what i understand here, you're running the tests directly on the configs available in config/dataset.
The load_from_cloud_optimised_directory function should be mocked or point to test configs available in the test/resources path.
The config files you pointing to right now may not have placeholders in the future

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this dynamically loads all config in the directory except for the _PLACEHOLDER_CONFIGS

These are skipped because they contain "FILL UP MANUALLY"

The intent of the test is to verify all the legacy loads match the new load method for all config that are valid to load.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but that's my point.

later you have

    def test_all_configs_load_without_error(self):
        for config_path in self._loadable_configs():
            with self.subTest(config=config_path.name):
                DatasetConfig.load_from_cloud_optimised_directory(config_path.name)

which IMO should not be in the unittests. This is more for pre-commit
see

entry: python3 aodn_cloud_optimised/bin/pydantic_precommit_hook.py --validate-configs aodn_cloud_optimised/config/dataset

and
https://github.com/aodn/aodn_cloud_optimised/blob/main/aodn_cloud_optimised/bin/pydantic_precommit_hook.py

Comment thread aodn_cloud_optimised/bin/config/model/cluster_config.py Outdated
Comment thread aodn_cloud_optimised/bin/config/model/dataset_config.py

# Validate
return DatasetConfig.model_validate(dataset_config)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to get my bearings again, I thought the existing load_dataset_config could load from a path already:

14:from aodn_cloud_optimised.lib.config import load_dataset_config
121:        dataset_ardc_netcdf_config = load_dataset_config(DATASET_CONFIG_NC_ARDC_JSON)
133:        dataset_soop_sst_netcdf_config = load_dataset_config(

lbesnard and others added 3 commits April 10, 2026 15:56
…eck)

The bug fix from 075c557 corrected any('.parquet' in f for f in filter)
to '.parquet' in filter in generic_cloud_optimised_creation.py, but after
the refactor moved PathConfig to bin/config/model/path_config.py, the
old buggy code remained there. This commit ports the fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…patibility

@model_validator(mode='after') on a classmethod-style function (cls, values)
is deprecated since Pydantic v2.12 and will be removed in v3.0. Changed
validate_cross_fields to an instance method using self, as required by the
Pydantic v2.12 API.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lbesnard lbesnard force-pushed the feature/dataset-config-refactor branch from e78df6f to 3311314 Compare April 10, 2026 06:24
@lbesnard
Copy link
Copy Markdown
Collaborator

@thommodin see my latest comment above re testing all config files, which i don't think should happen as part of the unittests. Anyway I'll merge this PR as we need to move forward

@lbesnard lbesnard merged commit 6daaa01 into main Apr 10, 2026
7 checks passed
@lbesnard lbesnard deleted the feature/dataset-config-refactor branch April 13, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants