diff --git a/README.md b/README.md index 734f2280..b7bf745e 100644 --- a/README.md +++ b/README.md @@ -59,17 +59,21 @@ git clone https://github.com/ROCm/MAD.git && cd MAD # Discover available models madengine discover --tags dummy -# Run locally +# Run locally (full workflow: discover/build/run as configured by the model) +madengine run --tags dummy + +# Or with explicit configuration madengine run --tags dummy \ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' ``` +> **Note**: For build operations, `gpu_vendor` defaults to `AMD` and `guest_os` defaults to `UBUNTU` if not specified. For production deployments or non-AMD/Ubuntu environments, explicitly specify these values. + If ROCm is not installed under `/opt/rocm` (e.g. Rock or pip install), use `--rocm-path` or set `ROCM_PATH`: ```bash -madengine run --tags dummy --rocm-path /path/to/rocm \ - --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' -# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ... +madengine run --tags dummy --rocm-path /path/to/rocm +# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy ``` **Results:** Performance data is written to `perf.csv` (and optionally `perf_entry.csv`). The file is created automatically if missing. Failed runs (including pre-run setup failures) are recorded with status `FAILURE` so every attempted model appears in the table. See [Exit Codes](docs/cli-reference.md#exit-codes) for CI/script usage. @@ -92,13 +96,14 @@ madengine provides five main commands for model automation and benchmarking: # Discover models madengine discover --tags dummy -# Build image -madengine build --tags dummy \ - --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +# Build image (uses AMD/UBUNTU defaults) +madengine build --tags dummy # Run model -madengine run --tags dummy \ - --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +madengine run --tags dummy + +# For non-AMD/Ubuntu environments, specify explicitly: +# madengine build --tags dummy --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' # Generate report madengine report to-html --csv-file perf_entry.csv diff --git a/docs/cli-reference.md b/docs/cli-reference.md index 7ccbb672..bfd1ca51 100644 --- a/docs/cli-reference.md +++ b/docs/cli-reference.md @@ -144,7 +144,30 @@ madengine build --tags model \ madengine build --tags model --live-output --verbose ``` -**Required Context for Build:** +**Default Values:** + +The build command applies the following defaults if not specified: + +- **gpu_vendor**: `AMD` +- **guest_os**: `UBUNTU` + +Example with defaults: +```bash +# Equivalent to providing {"gpu_vendor": "AMD", "guest_os": "UBUNTU"} +madengine build --tags dummy +``` + +You will see a message indicating which defaults were applied: + +``` +ā„¹ļø Using default values for build configuration: + • gpu_vendor: AMD (default) + • guest_os: UBUNTU (default) + +šŸ’” To customize, use --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' +``` + +**Supported Values:** - `gpu_vendor`: `"AMD"` or `"NVIDIA"` - `guest_os`: `"UBUNTU"` or `"CENTOS"` diff --git a/docs/configuration.md b/docs/configuration.md index 7ed01662..c1d4d13d 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -26,17 +26,66 @@ madengine run --tags model --additional-context-file config.json } ``` -## Basic Configuration +## Default Configuration Values -### Required for Local Execution +madengine provides sensible defaults for common AMD/Ubuntu workflows: -```json -{ - "gpu_vendor": "AMD", - "guest_os": "UBUNTU" -} +| Field | Default Value | Customization | +|-------|---------------|---------------| +| `gpu_vendor` | `AMD` | Set to `NVIDIA` for NVIDIA GPUs | +| `guest_os` | `UBUNTU` | Set to `CENTOS` for CentOS containers | + +### When Defaults Apply + +Defaults are applied during the **build** command when fields are not explicitly provided: + +```bash +# Uses defaults: {"gpu_vendor": "AMD", "guest_os": "UBUNTU"} +madengine build --tags model + +# Explicit override +madengine build --tags model \ + --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' ``` +When defaults are applied, you'll see an informative message: + +``` +ā„¹ļø Using default values for build configuration: + • gpu_vendor: AMD (default) + • guest_os: UBUNTU (default) + +šŸ’” To customize, use --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' +``` + +### Partial Configuration + +You can provide one field and let the other default: + +```bash +# Override only gpu_vendor (guest_os defaults to UBUNTU) +madengine build --tags model \ + --additional-context '{"gpu_vendor": "NVIDIA"}' + +# Override only guest_os (gpu_vendor defaults to AMD) +madengine build --tags model \ + --additional-context '{"guest_os": "CENTOS"}' +``` + +### Production Recommendations + +For production deployments: +- āœ… **DO** explicitly specify all configuration values +- āœ… **DO** use configuration files for reproducibility +- āš ļø **AVOID** relying on defaults in automated workflows + +### Run Command Behavior + +The **run** command does NOT require these values because it can detect GPU vendor at runtime. +Defaults only apply to the **build** command where Dockerfile selection requires them. + +## Basic Configuration + **gpu_vendor** (case-insensitive): - `"AMD"` - AMD ROCm GPUs - `"NVIDIA"` - NVIDIA CUDA GPUs diff --git a/docs/usage.md b/docs/usage.md index 8e846c7d..f29f8426 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -24,11 +24,16 @@ pip install git+https://github.com/ROCm/madengine.git # Discover models madengine discover --tags dummy -# Run locally +# Run locally (full workflow: discover/build/run as configured by the model) +madengine run --tags dummy + +# Or with explicit configuration madengine run --tags dummy \ --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' ``` +> **Note**: `gpu_vendor` defaults to `AMD` and `guest_os` defaults to `UBUNTU` for build operations. For production or non-AMD/Ubuntu environments, specify these values explicitly. + Results are saved to `perf_entry.csv`. ## Commands Overview @@ -51,13 +56,14 @@ For complete command options and detailed examples, see **[CLI Command Reference # Discover models madengine discover --tags dummy -# Build image -madengine build --tags model \ - --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +# Build image (uses AMD/UBUNTU defaults) +madengine build --tags model # Run model -madengine run --tags model \ - --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' +madengine run --tags model + +# For NVIDIA or other configurations, specify explicitly: +# madengine build --tags model --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' # Generate HTML report madengine report to-html --csv-file perf_entry.csv diff --git a/src/madengine/cli/commands/build.py b/src/madengine/cli/commands/build.py index b7d9a1c2..5b10a65c 100644 --- a/src/madengine/cli/commands/build.py +++ b/src/madengine/cli/commands/build.py @@ -173,16 +173,18 @@ def build( ) try: - # Validate additional context - validate_additional_context(additional_context, additional_context_file) + # Validate additional context and merge file + CLI; defaults wired into orchestrator + validated_context = validate_additional_context( + additional_context, additional_context_file + ) # Create arguments object args = create_args_namespace( tags=effective_tags, target_archs=target_archs, registry=registry, - additional_context=additional_context, - additional_context_file=additional_context_file, + additional_context=repr(validated_context), + additional_context_file=None, clean_docker_cache=clean_docker_cache, manifest_output=manifest_output, live_output=live_output, @@ -221,15 +223,8 @@ def build( # Handle batch manifest post-processing if batch_data: with console.status("Processing batch manifest..."): - additional_context_dict = getattr(args, "additional_context", None) - if isinstance(additional_context_dict, str): - additional_context_dict = json.loads(additional_context_dict) - guest_os = ( - additional_context_dict.get("guest_os") if additional_context_dict else None - ) - gpu_vendor = ( - additional_context_dict.get("gpu_vendor") if additional_context_dict else None - ) + guest_os = validated_context.get("guest_os") + gpu_vendor = validated_context.get("gpu_vendor") process_batch_manifest_entries( batch_data, manifest_output, registry, guest_os, gpu_vendor ) diff --git a/src/madengine/cli/commands/run.py b/src/madengine/cli/commands/run.py index 3b4f035c..a2b174f2 100644 --- a/src/madengine/cli/commands/run.py +++ b/src/madengine/cli/commands/run.py @@ -5,7 +5,6 @@ Copyright (c) Advanced Micro Devices, Inc. All rights reserved. """ -import ast import json import os from typing import List, Optional @@ -43,6 +42,11 @@ display_results_table, display_performance_table, ) +from ..validators import ( + additional_context_needs_cli_validation, + finalize_additional_context_dict, + merge_additional_context_from_sources, +) def run( @@ -167,33 +171,16 @@ def run( ) raise typer.Exit(ExitCode.INVALID_ARGS) - # When both --additional-context-file and --additional-context are provided, - # load file first then overlay CLI (CLI overrides file). + # Merge file + CLI (CLI wins), then validate (same rules as `build`) when non-empty. effective_additional_context = additional_context effective_additional_context_file = additional_context_file - if additional_context_file and additional_context != "{}": - merged = {} - if os.path.exists(additional_context_file): - try: - with open(additional_context_file, "r") as f: - merged = json.load(f) - except json.JSONDecodeError: - console.print( - f"āŒ [red]Invalid JSON format in {additional_context_file}[/red]" - ) - raise typer.Exit(ExitCode.INVALID_ARGS) - try: - try: - cli_context = json.loads(additional_context) - except json.JSONDecodeError: - cli_context = ast.literal_eval(additional_context) - if isinstance(cli_context, dict): - merged.update(cli_context) - except (json.JSONDecodeError, ValueError, SyntaxError): - console.print( - f"āŒ [red]Invalid additional_context format: {additional_context}[/red]" - ) - raise typer.Exit(ExitCode.INVALID_ARGS) + if additional_context_needs_cli_validation( + additional_context, additional_context_file + ): + merged, _ = merge_additional_context_from_sources( + additional_context, additional_context_file + ) + finalize_additional_context_dict(merged) effective_additional_context = repr(merged) effective_additional_context_file = None @@ -296,34 +283,7 @@ def run( raise typer.Exit(ExitCode.RUN_FAILURE) else: - # Check if MAD_CONTAINER_IMAGE is provided - this enables local image mode - additional_context_dict = {} - try: - if additional_context and additional_context != "{}": - additional_context_dict = json.loads(additional_context) - except json.JSONDecodeError: - try: - # Try parsing as Python dict literal - additional_context_dict = ast.literal_eval(additional_context) - except (ValueError, SyntaxError): - console.print( - f"āŒ [red]Invalid additional_context format: {additional_context}[/red]" - ) - raise typer.Exit(ExitCode.INVALID_ARGS) - - # Load additional context from file if provided - if additional_context_file and os.path.exists(additional_context_file): - try: - with open(additional_context_file, 'r') as f: - file_context = json.load(f) - additional_context_dict.update(file_context) - except json.JSONDecodeError: - console.print( - f"āŒ [red]Invalid JSON format in {additional_context_file}[/red]" - ) - raise typer.Exit(ExitCode.INVALID_ARGS) - - # MAD_CONTAINER_IMAGE handling is now done in RunOrchestrator + # MAD_CONTAINER_IMAGE handling is done in RunOrchestrator # Full workflow (may include MAD_CONTAINER_IMAGE mode) if manifest_file: console.print( diff --git a/src/madengine/cli/validators.py b/src/madengine/cli/validators.py index 6bfc7bdb..d2d0a288 100644 --- a/src/madengine/cli/validators.py +++ b/src/madengine/cli/validators.py @@ -5,16 +5,21 @@ Copyright (c) Advanced Micro Devices, Inc. All rights reserved. """ +import ast import glob import json import os -from typing import Dict, List, Optional +from typing import Any, Dict, List, Optional, Tuple import typer from rich.console import Console -from rich.panel import Panel from madengine.utils.discover_models import DiscoverModels +from madengine.core.additional_context_defaults import ( + DEFAULT_GPU_VENDOR, + DEFAULT_GUEST_OS, + apply_build_context_defaults, +) from .constants import ExitCode, VALID_GPU_VENDORS, VALID_GUEST_OS from .utils import create_args_namespace @@ -22,31 +27,59 @@ # Initialize Rich console console = Console() +_EXAMPLE_ADDITIONAL_CONTEXT = ( + '--additional-context \'{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}\'' +) -def validate_additional_context( - additional_context: str, - additional_context_file: Optional[str] = None, -) -> Dict[str, str]: - """ - Validate and parse additional context. - Args: - additional_context: JSON string containing additional context - additional_context_file: Optional file containing additional context +def parse_additional_context_cli_string(additional_context: str) -> Dict[str, Any]: + """ + Parse --additional-context string: JSON first, then Python literal (single-quoted dicts). + """ + if not additional_context or additional_context.strip() == "{}": + return {} + try: + parsed = json.loads(additional_context) + except json.JSONDecodeError: + try: + parsed = ast.literal_eval(additional_context) + except (ValueError, SyntaxError) as e: + console.print( + f"āŒ Invalid additional_context format: [red]{e}[/red]" + ) + console.print( + "šŸ’” Use JSON or a Python dict literal, e.g. " + + _EXAMPLE_ADDITIONAL_CONTEXT + ) + raise typer.Exit(ExitCode.INVALID_ARGS) + if not isinstance(parsed, dict): + console.print( + "āŒ additional_context must be a JSON object at the top level, not a list or scalar." + ) + raise typer.Exit(ExitCode.INVALID_ARGS) + return parsed - Returns: - Dict containing parsed additional context - Raises: - typer.Exit: If validation fails +def merge_additional_context_from_sources( + additional_context: str, + additional_context_file: Optional[str], +) -> Tuple[Dict[str, Any], bool]: """ - context = {} + Load file first, then overlay CLI string (CLI wins). - # Load from file first + Returns: + (merged dict, loaded_from_cli_non_empty) for messaging. + """ + context: Dict[str, Any] = {} if additional_context_file: try: with open(additional_context_file, "r") as f: context = json.load(f) + if not isinstance(context, dict): + console.print( + "āŒ additional-context file must contain a JSON object at the top level." + ) + raise typer.Exit(ExitCode.INVALID_ARGS) console.print( f"āœ… Loaded additional context from file: [cyan]{additional_context_file}[/cyan]" ) @@ -54,54 +87,102 @@ def validate_additional_context( console.print(f"āŒ Failed to load additional context file: [red]{e}[/red]") raise typer.Exit(ExitCode.INVALID_ARGS) - # Parse string context (overrides file) - if additional_context and additional_context != "{}": - try: - string_context = json.loads(additional_context) - context.update(string_context) - console.print("āœ… Loaded additional context from command line") - except json.JSONDecodeError as e: - console.print(f"āŒ Invalid JSON in additional context: [red]{e}[/red]") - console.print("šŸ’” Please provide valid JSON format") - raise typer.Exit(ExitCode.INVALID_ARGS) + cli_non_empty = bool(additional_context and additional_context.strip() != "{}") + if cli_non_empty: + string_context = parse_additional_context_cli_string(additional_context) + context.update(string_context) + console.print("āœ… Loaded additional context from command line") - if not context: - console.print("āŒ [red]No additional context provided[/red]") - console.print( - "šŸ’” For build operations, you must provide additional context with gpu_vendor and guest_os" - ) + return context, cli_non_empty + + +def _fail_structure(key: str, expected: str) -> None: + console.print( + f"āŒ Invalid additional context: key [red]{key}[/red] must be {expected}." + ) + console.print(f"šŸ’” Example: {_EXAMPLE_ADDITIONAL_CONTEXT}") + raise typer.Exit(ExitCode.INVALID_ARGS) + + +def validate_additional_context_structure(context: Dict[str, Any]) -> None: + """Validate types of known keys after defaults are applied.""" + if "docker_build_arg" in context: + v = context["docker_build_arg"] + if not isinstance(v, dict): + _fail_structure("docker_build_arg", "an object (string values)") + for bk, bv in v.items(): + if not isinstance(bk, str): + _fail_structure("docker_build_arg", "an object with string keys") + if not isinstance(bv, (str, int, float, bool)): + _fail_structure( + f"docker_build_arg['{bk}']", + "a string, number, or boolean (Docker build-arg values)", + ) - # Show example usage - example_panel = Panel( - """[bold cyan]Example usage:[/bold cyan] -madengine build --tags dummy --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}' + if "docker_env_vars" in context and not isinstance( + context["docker_env_vars"], dict + ): + _fail_structure("docker_env_vars", "an object") -[bold cyan]Or using a file:[/bold cyan] -madengine build --tags dummy --additional-context-file context.json + if "docker_mounts" in context and not isinstance(context["docker_mounts"], dict): + _fail_structure("docker_mounts", "an object") -[bold cyan]Required fields:[/bold cyan] -• gpu_vendor: [green]AMD[/green], [green]NVIDIA[/green] -• guest_os: [green]UBUNTU[/green], [green]CENTOS[/green]""", - title="Additional Context Help", - border_style="blue", - ) - console.print(example_panel) - raise typer.Exit(ExitCode.INVALID_ARGS) + if "env_vars" in context and not isinstance(context["env_vars"], dict): + _fail_structure("env_vars", "an object") - # Validate required fields - required_fields = ["gpu_vendor", "guest_os"] - missing_fields = [field for field in required_fields if field not in context] + if "tools" in context and not isinstance(context["tools"], list): + _fail_structure("tools", "an array") - if missing_fields: - console.print( - f"āŒ Missing required fields: [red]{', '.join(missing_fields)}[/red]" - ) - console.print( - "šŸ’” Both gpu_vendor and guest_os are required for build operations" - ) - raise typer.Exit(ExitCode.INVALID_ARGS) + if "pre_scripts" in context and not isinstance(context["pre_scripts"], list): + _fail_structure("pre_scripts", "an array") + + for nest in ( + "k8s", + "slurm", + "kubernetes", + "distributed", + "vllm", + "deployment_config", + "deploy", + ): + if nest in context and not isinstance(context[nest], dict): + _fail_structure(nest, "an object") - # Validate gpu_vendor + if "docker_gpus" in context and not isinstance(context["docker_gpus"], str): + _fail_structure("docker_gpus", "a string") + + if "MAD_CONTAINER_IMAGE" in context and not isinstance( + context["MAD_CONTAINER_IMAGE"], str + ): + _fail_structure("MAD_CONTAINER_IMAGE", "a string") + + if "timeout" in context and not isinstance( + context["timeout"], (int, float, type(None)) + ): + _fail_structure("timeout", "a number") + + if "debug" in context and not isinstance(context["debug"], (bool, type(None))): + _fail_structure("debug", "a boolean") + + if "gpu_vendor" in context and not isinstance(context["gpu_vendor"], str): + _fail_structure("gpu_vendor", "a string") + + if "guest_os" in context and not isinstance(context["guest_os"], str): + _fail_structure("guest_os", "a string") + + +def _normalize_docker_build_arg_values(context: Dict[str, Any]) -> None: + dba = context.get("docker_build_arg") + if not isinstance(dba, dict): + return + for k in list(dba.keys()): + v = dba[k] + if not isinstance(v, str): + dba[k] = str(v) + + +def _validate_gpu_vendor_guest_after_defaults(context: Dict[str, Any]) -> None: + """Validate gpu_vendor / guest_os enums (expects keys present after defaults).""" gpu_vendor = context["gpu_vendor"].upper() if gpu_vendor not in VALID_GPU_VENDORS: console.print(f"āŒ Invalid gpu_vendor: [red]{context['gpu_vendor']}[/red]") @@ -110,7 +191,6 @@ def validate_additional_context( ) raise typer.Exit(ExitCode.INVALID_ARGS) - # Validate guest_os guest_os = context["guest_os"].upper() if guest_os not in VALID_GUEST_OS: console.print(f"āŒ Invalid guest_os: [red]{context['guest_os']}[/red]") @@ -119,9 +199,81 @@ def validate_additional_context( ) raise typer.Exit(ExitCode.INVALID_ARGS) + # Canonical form for Context.filter() and downstream (exact string match) + context["gpu_vendor"] = gpu_vendor + context["guest_os"] = guest_os + console.print( f"āœ… Context validated: [green]{gpu_vendor}[/green] + [green]{guest_os}[/green]" ) + + +def finalize_additional_context_dict( + context: Dict[str, Any], + *, + print_defaults_banner: bool = True, +) -> Dict[str, Any]: + """ + Apply gpu_vendor/guest_os defaults, validate structure and enums. + Mutates context in place. + """ + missing_gpu = "gpu_vendor" not in context + missing_guest = "guest_os" not in context + apply_build_context_defaults(context) + defaults_applied = [] + if missing_gpu: + defaults_applied.append(("gpu_vendor", DEFAULT_GPU_VENDOR)) + if missing_guest: + defaults_applied.append(("guest_os", DEFAULT_GUEST_OS)) + + if print_defaults_banner and defaults_applied: + console.print("\nā„¹ļø [cyan]Using default values for build configuration:[/cyan]") + for field, value in defaults_applied: + console.print(f" • {field}: [green]{value}[/green] (default)") + console.print( + "\nšŸ’” [dim]To customize, use --additional-context " + '\'{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}\'[/dim]\n' + ) + + validate_additional_context_structure(context) + _normalize_docker_build_arg_values(context) + _validate_gpu_vendor_guest_after_defaults(context) + return context + + +def additional_context_needs_cli_validation( + additional_context: str, + additional_context_file: Optional[str], +) -> bool: + """True when the user supplied a non-empty context (file and/or CLI string).""" + if additional_context_file: + return True + if additional_context and additional_context.strip() != "{}": + return True + return False + + +def validate_additional_context( + additional_context: str, + additional_context_file: Optional[str] = None, +) -> Dict[str, Any]: + """ + Validate and parse additional context. + + Args: + additional_context: JSON string containing additional context + additional_context_file: Optional file containing additional context + + Returns: + Dict containing parsed additional context + + Raises: + typer.Exit: If validation fails + """ + context, _ = merge_additional_context_from_sources( + additional_context, additional_context_file + ) + finalize_additional_context_dict(context) return context @@ -251,7 +403,11 @@ def process_batch_manifest_entries( f"No Dockerfile found for {dockerfile_specified}" ) else: - dockerfile_matched = dockerfile_matched_list[0].split("/")[-1].replace(".Dockerfile", "") + dockerfile_matched = ( + dockerfile_matched_list[0] + .split("/")[-1] + .replace(".Dockerfile", "") + ) # Create a synthetic image name for this model synthetic_image_name = f"ci-{model_name}_{dockerfile_matched}" @@ -275,12 +431,16 @@ def process_batch_manifest_entries( } # Add to built_models - include all discovered model fields - model_entry = model_info.copy() # Start with all fields from discovered model + model_entry = ( + model_info.copy() + ) # Start with all fields from discovered model # Ensure minimum required fields have fallback values model_entry.setdefault("name", model_name) model_entry.setdefault("dockerfile", f"docker/{model_name}") - model_entry.setdefault("scripts", f"scripts/{model_name}/run.sh") + model_entry.setdefault( + "scripts", f"scripts/{model_name}/run.sh" + ) model_entry.setdefault("n_gpus", "1") model_entry.setdefault("owner", "") model_entry.setdefault("training_precision", "") @@ -288,7 +448,9 @@ def process_batch_manifest_entries( model_entry.setdefault("args", "") model_entry.setdefault("cred", "") - build_manifest["built_models"][synthetic_image_name] = model_entry + build_manifest["built_models"][ + synthetic_image_name + ] = model_entry break except Exception as e: @@ -324,4 +486,3 @@ def process_batch_manifest_entries( console.print( f"āœ… Added entries for all models from batch manifest to {manifest_output}" ) - diff --git a/src/madengine/core/additional_context_defaults.py b/src/madengine/core/additional_context_defaults.py new file mode 100644 index 00000000..7fcab260 --- /dev/null +++ b/src/madengine/core/additional_context_defaults.py @@ -0,0 +1,21 @@ +#!/usr/bin/env python3 +""" +Default gpu_vendor / guest_os for build-time additional context. + +Used by CLI validation and BuildOrchestrator so Dockerfile filtering (Context.filter) +sees the same keys as the validated build defaults. +""" + +from typing import Any, Dict + +DEFAULT_GPU_VENDOR = "AMD" +DEFAULT_GUEST_OS = "UBUNTU" + + +def apply_build_context_defaults(ctx: Dict[str, Any]) -> Dict[str, Any]: + """Fill missing gpu_vendor and guest_os. Mutates and returns ctx.""" + if "gpu_vendor" not in ctx: + ctx["gpu_vendor"] = DEFAULT_GPU_VENDOR + if "guest_os" not in ctx: + ctx["guest_os"] = DEFAULT_GUEST_OS + return ctx diff --git a/src/madengine/core/context.py b/src/madengine/core/context.py index 836ab81e..24763588 100644 --- a/src/madengine/core/context.py +++ b/src/madengine/core/context.py @@ -170,10 +170,7 @@ def init_build_context(self) -> None: # Don't detect GPU-specific contexts in build-only mode # These should be provided via additional_context if needed for build args - if "MAD_SYSTEM_GPU_ARCHITECTURE" not in self.ctx.get("docker_build_arg", {}): - print( - "Info: MAD_SYSTEM_GPU_ARCHITECTURE not provided - should be set via --additional-context for GPU-specific builds" - ) + # (GPU arch guidance is emitted in BuildOrchestrator after model/Dockerfile discovery.) # Don't initialize NUMA balancing check for build-only nodes # This is runtime-specific and should be handled on execution nodes diff --git a/src/madengine/execution/docker_builder.py b/src/madengine/execution/docker_builder.py index 6727ffda..5e0c78d7 100644 --- a/src/madengine/execution/docker_builder.py +++ b/src/madengine/execution/docker_builder.py @@ -693,7 +693,11 @@ def _build_model_single_arch( if registry: try: registry_image = self._create_registry_image_name( - build_info["docker_image"], registry, batch_build_metadata, model_info + build_info["docker_image"], + registry, + batch_build_metadata, + model_info, + credentials, ) self.push_image(build_info["docker_image"], registry, credentials, registry_image) build_info["registry_image"] = registry_image @@ -780,28 +784,29 @@ def _create_base_image_name(self, model_info: typing.Dict, dockerfile: str) -> s return f"ci-{model_info['name']}_{model_info['name']}.{context_suffix}" def _create_registry_image_name( - self, - image_name: str, - registry: str, - batch_build_metadata: typing.Optional[dict], - model_info: typing.Dict + self, + image_name: str, + registry: str, + batch_build_metadata: typing.Optional[dict], + model_info: typing.Dict, + credentials: typing.Optional[dict] = None, ) -> str: """Create registry image name.""" if batch_build_metadata and model_info["name"] in batch_build_metadata: meta = batch_build_metadata[model_info["name"]] if meta.get("registry_image"): return meta["registry_image"] - - # Default registry naming - return self._determine_registry_image_name(image_name, registry) + + return self._determine_registry_image_name(image_name, registry, credentials) def _create_arch_registry_image_name( - self, - image_name: str, - gpu_arch: str, - registry: str, - batch_build_metadata: typing.Optional[dict], - model_info: typing.Dict + self, + image_name: str, + gpu_arch: str, + registry: str, + batch_build_metadata: typing.Optional[dict], + model_info: typing.Dict, + credentials: typing.Optional[dict] = None, ) -> str: """Create architecture-specific registry image name.""" # For multi-arch builds, add architecture to the tag @@ -810,9 +815,10 @@ def _create_arch_registry_image_name( if meta.get("registry_image"): # Append architecture to existing registry image return f"{meta['registry_image']}_{gpu_arch}" - - # Default arch-specific registry naming - base_registry_name = self._determine_registry_image_name(image_name, registry) + + base_registry_name = self._determine_registry_image_name( + image_name, registry, credentials + ) return f"{base_registry_name}" # Architecture already in image_name def _determine_registry_image_name( @@ -913,74 +919,3 @@ def _build_model_for_arch( arch_results.append(build_info) return arch_results - - def _build_model_single_arch( - self, - model_info: typing.Dict, - credentials: typing.Dict, - clean_cache: bool, - registry: str, - phase_suffix: str, - batch_build_metadata: typing.Optional[dict] - ) -> typing.List[typing.Dict]: - """Build model using existing single architecture flow.""" - - # Find dockerfiles for this model - dockerfiles = self._get_dockerfiles_for_model(model_info) - - results = [] - for dockerfile in dockerfiles: - build_info = self.build_image( - model_info, - dockerfile, - credentials, - clean_cache, - phase_suffix - ) - - # Extract GPU architecture from build args or context for manifest - gpu_arch = self._get_effective_gpu_architecture(model_info, dockerfile) - if gpu_arch: - build_info["gpu_architecture"] = gpu_arch - - # Handle registry push (existing logic) - if registry: - registry_image = self._determine_registry_image_name( - build_info["docker_image"], registry, credentials - ) - try: - self.push_image(build_info["docker_image"], registry, credentials, registry_image) - build_info["registry_image"] = registry_image - except Exception as e: - build_info["push_error"] = str(e) - - results.append(build_info) - - return results - - def _get_effective_gpu_architecture(self, model_info: typing.Dict, dockerfile_path: str) -> str: - """Get effective GPU architecture for single arch builds.""" - # Check if MAD_SYSTEM_GPU_ARCHITECTURE is in build args from additional_context - if ("docker_build_arg" in self.context.ctx and - "MAD_SYSTEM_GPU_ARCHITECTURE" in self.context.ctx["docker_build_arg"]): - return self.context.ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] - - # Try to extract from Dockerfile defaults - try: - with open(dockerfile_path, 'r') as f: - content = f.read() - - # Look for ARG or ENV declarations - patterns = [ - r"ARG\s+MAD_SYSTEM_GPU_ARCHITECTURE=([^\s\n]+)", - r"ENV\s+MAD_SYSTEM_GPU_ARCHITECTURE=([^\s\n]+)" - ] - - for pattern in patterns: - match = re.search(pattern, content, re.IGNORECASE) - if match: - return match.group(1).strip('"\'') - except Exception: - pass - - return None diff --git a/src/madengine/execution/dockerfile_utils.py b/src/madengine/execution/dockerfile_utils.py index 7a7a1bf9..c8392b5a 100644 --- a/src/madengine/execution/dockerfile_utils.py +++ b/src/madengine/execution/dockerfile_utils.py @@ -100,6 +100,55 @@ def is_target_arch_compatible_with_variable( return True +def dockerfile_requires_explicit_mad_arch_build_arg(dockerfile_content: str) -> bool: + """ + True when the Dockerfile declares ARG MAD_SYSTEM_GPU_ARCHITECTURE without a + non-empty default (user must pass --build-arg or rely on a non-empty ARG/ENV default). + + If the variable is not declared as ARG, returns False. If any ARG line gives a + non-empty default, returns False (Dockerfile supplies a default). + """ + found_arg = False + has_nonempty_default = False + has_bare_or_empty_default = False + for raw_line in dockerfile_content.splitlines(): + line = raw_line.strip() + if not line or line.startswith("#"): + continue + upper = line.upper() + if not upper.startswith("ARG") or "MAD_SYSTEM_GPU_ARCHITECTURE" not in upper: + continue + found_arg = True + m = re.match( + r"ARG\s+MAD_SYSTEM_GPU_ARCHITECTURE\s*=\s*(\S+)", + line, + re.IGNORECASE, + ) + if m and m.group(1).strip("\"'"): + has_nonempty_default = True + continue + m_bare = re.match( + r"ARG\s+MAD_SYSTEM_GPU_ARCHITECTURE\s*$", + line, + re.IGNORECASE, + ) + if m_bare: + has_bare_or_empty_default = True + continue + m_empty = re.match( + r"ARG\s+MAD_SYSTEM_GPU_ARCHITECTURE\s*=\s*$", + line, + re.IGNORECASE, + ) + if m_empty: + has_bare_or_empty_default = True + if not found_arg: + return False + if has_nonempty_default: + return False + return has_bare_or_empty_default + + def is_compilation_arch_compatible(compile_arch: str, target_arch: str) -> bool: """Check if compilation architecture is compatible with target architecture.""" compatibility_matrix = { diff --git a/src/madengine/orchestration/build_orchestrator.py b/src/madengine/orchestration/build_orchestrator.py index a7557ef8..da06f91f 100644 --- a/src/madengine/orchestration/build_orchestrator.py +++ b/src/madengine/orchestration/build_orchestrator.py @@ -19,6 +19,7 @@ from madengine.core.console import Console from madengine.core.context import Context +from madengine.core.additional_context_defaults import apply_build_context_defaults from madengine.core.errors import ( BuildError, ConfigurationError, @@ -28,6 +29,9 @@ ) from madengine.utils.discover_models import DiscoverModels from madengine.execution.docker_builder import DockerBuilder +from madengine.execution.dockerfile_utils import ( + dockerfile_requires_explicit_mad_arch_build_arg, +) class BuildOrchestrator: @@ -84,6 +88,9 @@ def __init__(self, args, additional_context: Optional[Dict] = None): if additional_context: merged_context.update(additional_context) + # Match CLI validate_additional_context: defaults so Context.filter() can select Dockerfiles + apply_build_context_defaults(merged_context) + self.additional_context = merged_context # Apply ConfigLoader to infer deploy type, validate, and apply defaults @@ -116,7 +123,8 @@ def __init__(self, args, additional_context: Optional[Dict] = None): # Initialize context in build-only mode (no GPU detection) # Context expects additional_context as a string representation of Python dict # Use repr() instead of json.dumps() because Context uses ast.literal_eval() - context_string = repr(merged_context) if merged_context else None + # Use self.additional_context (post-ConfigLoader), not pre-defaults merged_context + context_string = repr(self.additional_context) self.context = Context( additional_context=context_string, build_only_mode=True, @@ -182,6 +190,46 @@ def _copy_scripts(self): # No-op: This method is deprecated and should not be called pass + def _warn_if_mad_arch_unresolved_for_dockerfiles( + self, models: List[Dict], builder: DockerBuilder + ) -> None: + """Warn when a selected Dockerfile needs MAD_SYSTEM_GPU_ARCHITECTURE and none is resolved.""" + from collections import defaultdict + + by_dockerfile: Dict[str, List[str]] = defaultdict(list) + for model in models: + for dockerfile in builder._get_dockerfiles_for_model(model): + arch = builder._get_effective_gpu_architecture(model, dockerfile) + if arch is not None: + continue + try: + with open(dockerfile, "r", encoding="utf-8", errors="replace") as f: + content = f.read() + except OSError: + continue + if dockerfile_requires_explicit_mad_arch_build_arg(content): + by_dockerfile[dockerfile].append(model["name"]) + + if not by_dockerfile: + return + + self.rich_console.print( + "[yellow]āš ļø Warning: MAD_SYSTEM_GPU_ARCHITECTURE is required for some " + "Dockerfile(s) but was not set in additional context and has no default " + "in the Dockerfile[/yellow]" + ) + for df_path in sorted(by_dockerfile.keys()): + names = ", ".join(sorted(set(by_dockerfile[df_path]))) + self.rich_console.print(f" [dim]• models:[/dim] [cyan]{names}[/cyan]") + self.rich_console.print(f" [dim]Dockerfile:[/dim] {df_path}") + self.rich_console.print( + "[dim] Provide GPU architecture via --additional-context, e.g.[/dim]" + ) + self.rich_console.print( + '[dim] --additional-context \'{"docker_build_arg": ' + '{"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}\'[/dim]\n' + ) + def execute( self, registry: Optional[str] = None, @@ -232,27 +280,16 @@ def execute( self.rich_console.print(f"[green]āœ“ Found {len(models)} models[/green]\n") - # Step 2: Validate build context (scripts not needed for build phase) - # Build phase only creates Docker images - script execution happens in run phase - # Note: K8s and Slurm have their own script management mechanisms - if "MAD_SYSTEM_GPU_ARCHITECTURE" not in self.context.ctx["docker_build_arg"]: - self.rich_console.print( - "[yellow]āš ļø Warning: MAD_SYSTEM_GPU_ARCHITECTURE not provided[/yellow]" - ) - self.rich_console.print( - "[dim] Provide GPU architecture via --additional-context:[/dim]" - ) - self.rich_console.print( - '[dim] --additional-context \'{"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx90a"}}\'[/dim]\n' - ) - - # Step 3: Build Docker images - self.rich_console.print("[bold cyan]šŸ—ļø Building Docker images...[/bold cyan]") + # Step 2: Per-model / per-Dockerfile MAD_SYSTEM_GPU_ARCHITECTURE hint (after discovery) builder = DockerBuilder( self.context, self.console, live_output=getattr(self.args, "live_output", False), ) + self._warn_if_mad_arch_unresolved_for_dockerfiles(models, builder) + + # Step 3: Build Docker images + self.rich_console.print("[bold cyan]šŸ—ļø Building Docker images...[/bold cyan]") # Determine phase suffix for log files # Build phase always uses .build suffix to avoid conflicts with run logs diff --git a/tests/integration/test_orchestrator_workflows.py b/tests/integration/test_orchestrator_workflows.py index 12aa1a91..143e56e5 100644 --- a/tests/integration/test_orchestrator_workflows.py +++ b/tests/integration/test_orchestrator_workflows.py @@ -65,6 +65,7 @@ def test_build_orchestrator_initialization(self, mock_context): """Test orchestrator initialization with minimal args.""" mock_args = MagicMock() mock_args.additional_context = None + mock_args.additional_context_file = None mock_args.live_output = True mock_context_instance = MagicMock() @@ -75,7 +76,10 @@ def test_build_orchestrator_initialization(self, mock_context): assert orchestrator.args == mock_args assert orchestrator.context == mock_context_instance - assert orchestrator.additional_context == {} + assert orchestrator.additional_context == { + "gpu_vendor": "AMD", + "guest_os": "UBUNTU", + } assert orchestrator.credentials is None @patch( diff --git a/tests/unit/test_additional_context_defaults.py b/tests/unit/test_additional_context_defaults.py new file mode 100644 index 00000000..f42d9c2e --- /dev/null +++ b/tests/unit/test_additional_context_defaults.py @@ -0,0 +1,21 @@ +"""Tests for madengine.core.additional_context_defaults.""" + +from madengine.core.additional_context_defaults import ( + DEFAULT_GUEST_OS, + DEFAULT_GPU_VENDOR, + apply_build_context_defaults, +) + + +def test_apply_fills_missing_keys(): + ctx = {} + apply_build_context_defaults(ctx) + assert ctx == {"gpu_vendor": DEFAULT_GPU_VENDOR, "guest_os": DEFAULT_GUEST_OS} + + +def test_apply_preserves_overrides(): + ctx = {"gpu_vendor": "NVIDIA", "guest_os": "CENTOS", "tools": []} + apply_build_context_defaults(ctx) + assert ctx["gpu_vendor"] == "NVIDIA" + assert ctx["guest_os"] == "CENTOS" + assert ctx["tools"] == [] diff --git a/tests/unit/test_build_orchestrator_mad_arch.py b/tests/unit/test_build_orchestrator_mad_arch.py new file mode 100644 index 00000000..2f32523b --- /dev/null +++ b/tests/unit/test_build_orchestrator_mad_arch.py @@ -0,0 +1,36 @@ +#!/usr/bin/env python3 +"""BuildOrchestrator MAD_SYSTEM_GPU_ARCHITECTURE messaging.""" + +from unittest.mock import MagicMock + +from madengine.orchestration.build_orchestrator import BuildOrchestrator + + +def test_warn_skipped_when_dockerfile_has_default(tmp_path): + df = tmp_path / "d.Dockerfile" + df.write_text("ARG MAD_SYSTEM_GPU_ARCHITECTURE=gfx942\n") + orch = object.__new__(BuildOrchestrator) + orch.rich_console = MagicMock() + builder = MagicMock() + builder._get_dockerfiles_for_model = lambda m: [str(df)] + builder._get_effective_gpu_architecture = lambda m, d: None + + BuildOrchestrator._warn_if_mad_arch_unresolved_for_dockerfiles( + orch, [{"name": "model_a"}], builder + ) + orch.rich_console.print.assert_not_called() + + +def test_warn_when_bare_arg_and_unresolved(tmp_path): + df = tmp_path / "d.Dockerfile" + df.write_text("ARG MAD_SYSTEM_GPU_ARCHITECTURE\n") + orch = object.__new__(BuildOrchestrator) + orch.rich_console = MagicMock() + builder = MagicMock() + builder._get_dockerfiles_for_model = lambda m: [str(df)] + builder._get_effective_gpu_architecture = lambda m, d: None + + BuildOrchestrator._warn_if_mad_arch_unresolved_for_dockerfiles( + orch, [{"name": "model_a"}], builder + ) + orch.rich_console.print.assert_called() diff --git a/tests/unit/test_cli.py b/tests/unit/test_cli.py index d186f8b2..81b74a4b 100644 --- a/tests/unit/test_cli.py +++ b/tests/unit/test_cli.py @@ -230,18 +230,22 @@ def test_validate_additional_context_invalid_json(self): assert exc_info.value.exit_code == ExitCode.INVALID_ARGS mock_console.print.assert_called() - def test_validate_additional_context_missing_required_fields(self): - """Test validation with missing required fields.""" + def test_validate_additional_context_defaults_fill_partial_fields(self): + """Missing gpu_vendor or guest_os is filled from defaults (no error).""" + from madengine.core.additional_context_defaults import ( + DEFAULT_GPU_VENDOR, + DEFAULT_GUEST_OS, + ) + with patch("madengine.cli.validators.console") as mock_console: - # Missing gpu_vendor - with pytest.raises(typer.Exit) as exc_info: - validate_additional_context('{"guest_os": "UBUNTU"}') - assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + r1 = validate_additional_context('{"guest_os": "UBUNTU"}') + assert r1["gpu_vendor"] == DEFAULT_GPU_VENDOR + assert r1["guest_os"] == "UBUNTU" - # Missing guest_os - with pytest.raises(typer.Exit) as exc_info: - validate_additional_context('{"gpu_vendor": "AMD"}') - assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + r2 = validate_additional_context('{"gpu_vendor": "AMD"}') + assert r2["gpu_vendor"] == "AMD" + assert r2["guest_os"] == DEFAULT_GUEST_OS + mock_console.print.assert_called() def test_validate_additional_context_invalid_values(self): """Test validation with invalid field values.""" diff --git a/tests/unit/test_context_logic.py b/tests/unit/test_context_logic.py index 0ce6504f..c9bc860c 100644 --- a/tests/unit/test_context_logic.py +++ b/tests/unit/test_context_logic.py @@ -52,4 +52,19 @@ def test_generates_build_args_for_amd(self, mock_arch, mock_vendor): assert context.ctx["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] == "gfx90a" +@pytest.mark.unit +class TestBuildOnlyContextMessages: + """Build-only context should not emit duplicate MAD_SYSTEM_GPU_ARCHITECTURE info.""" + + @patch.object(Context, "get_ctx_test", return_value="test") + @patch.object(Context, "get_host_os", return_value="linux") + def test_build_only_no_mad_arch_info_line(self, mock_host, mock_ctx): + from unittest.mock import patch + + with patch("builtins.print") as p: + Context(additional_context="{}", build_only_mode=True) + msgs = [str(c.args[0]) for c in p.call_args_list if c.args] + assert not any("MAD_SYSTEM_GPU_ARCHITECTURE" in m for m in msgs) + + # Total: 5 unit tests diff --git a/tests/unit/test_docker_builder.py b/tests/unit/test_docker_builder.py new file mode 100644 index 00000000..3fe97f9b --- /dev/null +++ b/tests/unit/test_docker_builder.py @@ -0,0 +1,46 @@ +"""Unit tests for madengine.execution.docker_builder (DockerBuilder). + +Currently covers registry image naming for single-arch builds (credentials → repository:tag). +""" + +from unittest.mock import MagicMock + +import pytest + +from madengine.execution.docker_builder import DockerBuilder + + +@pytest.fixture +def docker_builder(): + ctx = MagicMock() + ctx.ctx = {} + return DockerBuilder(ctx) + + +def test_create_registry_image_name_uses_dockerhub_repository(docker_builder): + creds = { + "dockerhub": { + "repository": "myorg/ci", + "username": "u", + "password": "p", + } + } + out = docker_builder._create_registry_image_name( + "ci-dummy_dummy.ubuntu.amd", + "dockerhub", + None, + {"name": "dummy"}, + creds, + ) + assert out == "myorg/ci:ci-dummy_dummy.ubuntu.amd" + + +def test_create_registry_image_name_without_credentials_matches_local_tag(docker_builder): + out = docker_builder._create_registry_image_name( + "ci-dummy_dummy.ubuntu.amd", + "dockerhub", + None, + {"name": "dummy"}, + None, + ) + assert out == "ci-dummy_dummy.ubuntu.amd" diff --git a/tests/unit/test_dockerfile_mad_arch.py b/tests/unit/test_dockerfile_mad_arch.py new file mode 100644 index 00000000..80924052 --- /dev/null +++ b/tests/unit/test_dockerfile_mad_arch.py @@ -0,0 +1,24 @@ +#!/usr/bin/env python3 +"""Tests for MAD_SYSTEM_GPU_ARCHITECTURE Dockerfile heuristics.""" + +import pytest + +from madengine.execution.dockerfile_utils import ( + dockerfile_requires_explicit_mad_arch_build_arg, +) + + +@pytest.mark.parametrize( + "content,expected", + [ + ("ARG MAD_SYSTEM_GPU_ARCHITECTURE=gfx942\nFROM ubuntu\n", False), + ("ARG MAD_SYSTEM_GPU_ARCHITECTURE\n", True), + ("FROM ubuntu\n", False), + ( + "ARG MAD_SYSTEM_GPU_ARCHITECTURE\nARG MAD_SYSTEM_GPU_ARCHITECTURE=gfx942\n", + False, + ), + ], +) +def test_dockerfile_requires_explicit_mad_arch(content, expected): + assert dockerfile_requires_explicit_mad_arch_build_arg(content) is expected diff --git a/tests/unit/test_orchestration.py b/tests/unit/test_orchestration.py index 6c82ff9b..7c1506c4 100644 --- a/tests/unit/test_orchestration.py +++ b/tests/unit/test_orchestration.py @@ -7,6 +7,10 @@ filter_images_by_gpu_compatibility, filter_images_by_skip_gpu_arch, ) +from madengine.core.additional_context_defaults import ( + DEFAULT_GPU_VENDOR, + DEFAULT_GUEST_OS, +) from madengine.orchestration.build_orchestrator import BuildOrchestrator from madengine.orchestration.run_orchestrator import RunOrchestrator from madengine.core.errors import ConfigurationError @@ -111,28 +115,37 @@ class TestBuildOrchestratorInit: @patch("madengine.orchestration.build_orchestrator.Context") @patch("os.path.exists", return_value=False) def test_initializes_with_minimal_args(self, mock_exists, mock_context): - """Should initialize with minimal arguments.""" + """Should initialize with minimal arguments and build defaults for Dockerfile filtering.""" mock_args = MagicMock() mock_args.additional_context = None + mock_args.additional_context_file = None mock_args.live_output = True orchestrator = BuildOrchestrator(mock_args) assert orchestrator.args == mock_args - assert orchestrator.additional_context == {} + assert orchestrator.additional_context == { + "gpu_vendor": DEFAULT_GPU_VENDOR, + "guest_os": DEFAULT_GUEST_OS, + } assert orchestrator.credentials is None @patch("madengine.orchestration.build_orchestrator.Context") @patch("os.path.exists", return_value=False) def test_parses_additional_context_json(self, mock_exists, mock_context): - """Should parse JSON additional context.""" + """Should parse JSON additional context and merge build defaults.""" mock_args = MagicMock() mock_args.additional_context = '{"key": "value"}' + mock_args.additional_context_file = None mock_args.live_output = True orchestrator = BuildOrchestrator(mock_args) - assert orchestrator.additional_context == {"key": "value"} + assert orchestrator.additional_context == { + "key": "value", + "gpu_vendor": DEFAULT_GPU_VENDOR, + "guest_os": DEFAULT_GUEST_OS, + } @pytest.mark.unit diff --git a/tests/unit/test_validators.py b/tests/unit/test_validators.py new file mode 100644 index 00000000..cb74cce0 --- /dev/null +++ b/tests/unit/test_validators.py @@ -0,0 +1,196 @@ +#!/usr/bin/env python3 +""" +Unit tests for madengine CLI validators + +Copyright (c) Advanced Micro Devices, Inc. All rights reserved. +""" + +import json +import tempfile + +import pytest +import typer + +from madengine.core.additional_context_defaults import ( + DEFAULT_GPU_VENDOR, + DEFAULT_GUEST_OS, +) +from madengine.cli.validators import validate_additional_context +from madengine.cli.constants import ExitCode + + +class TestValidateAdditionalContext: + """Test suite for validate_additional_context function""" + + def test_validate_additional_context_with_defaults_applied(self, capsys): + """Test that defaults are applied when context is empty""" + result = validate_additional_context(additional_context="{}") + + assert result["gpu_vendor"] == DEFAULT_GPU_VENDOR + assert result["guest_os"] == DEFAULT_GUEST_OS + + # Verify console output mentions defaults + # Note: capsys won't capture Rich console output, so we just verify the result + + def test_validate_additional_context_no_defaults_when_provided(self): + """Test that explicit values are preserved and no defaults applied""" + explicit_context = '{"gpu_vendor": "NVIDIA", "guest_os": "CENTOS"}' + result = validate_additional_context(additional_context=explicit_context) + + assert result["gpu_vendor"] == "NVIDIA" + assert result["guest_os"] == "CENTOS" + + def test_validate_additional_context_partial_default_gpu_vendor(self): + """Test that only gpu_vendor is defaulted when guest_os is provided""" + partial_context = '{"guest_os": "CENTOS"}' + result = validate_additional_context(additional_context=partial_context) + + assert result["gpu_vendor"] == DEFAULT_GPU_VENDOR + assert result["guest_os"] == "CENTOS" + + def test_validate_additional_context_partial_default_guest_os(self): + """Test that only guest_os is defaulted when gpu_vendor is provided""" + partial_context = '{"gpu_vendor": "NVIDIA"}' + result = validate_additional_context(additional_context=partial_context) + + assert result["gpu_vendor"] == "NVIDIA" + assert result["guest_os"] == DEFAULT_GUEST_OS + + def test_validate_additional_context_file_with_defaults(self): + """Test that defaults are applied after file is loaded""" + # Create temporary config file with extra fields but missing required ones + with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: + json.dump({"extra_field": "value", "timeout": 30}, f) + temp_file = f.name + + try: + result = validate_additional_context( + additional_context="{}", additional_context_file=temp_file + ) + + # Should have defaults plus the extra field + assert result["gpu_vendor"] == DEFAULT_GPU_VENDOR + assert result["guest_os"] == DEFAULT_GUEST_OS + assert result["extra_field"] == "value" + assert result["timeout"] == 30 + finally: + import os + + os.unlink(temp_file) + + def test_validate_additional_context_invalid_values_no_defaults(self): + """Test that invalid values cause validation error, not default application""" + invalid_context = '{"gpu_vendor": "INVALID", "guest_os": "INVALID"}' + + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context(additional_context=invalid_context) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_empty_string_no_defaults(self): + """Test that empty string values cause validation error, not defaults""" + empty_string_context = '{"gpu_vendor": "", "guest_os": ""}' + + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context(additional_context=empty_string_context) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_case_insensitive(self): + """Lowercase gpu_vendor/guest_os are accepted and normalized to canonical uppercase.""" + lowercase_context = '{"gpu_vendor": "amd", "guest_os": "ubuntu"}' + result = validate_additional_context(additional_context=lowercase_context) + + assert result["gpu_vendor"] == "AMD" + assert result["guest_os"] == "UBUNTU" + + def test_validate_additional_context_file_and_cli_merge_with_defaults(self): + """Test that file + CLI merge works and defaults fill gaps""" + # Create temporary file with one field + with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: + json.dump({"gpu_vendor": "NVIDIA"}, f) + temp_file = f.name + + try: + # CLI provides different field (neither provides guest_os) + result = validate_additional_context( + additional_context='{"timeout": 60}', additional_context_file=temp_file + ) + + # Should merge file + CLI + defaults + assert result["gpu_vendor"] == "NVIDIA" # From file + assert result["guest_os"] == DEFAULT_GUEST_OS # From defaults + assert result["timeout"] == 60 # From CLI + finally: + import os + + os.unlink(temp_file) + + def test_validate_additional_context_cli_overrides_file(self): + """Test that CLI values override file values""" + # Create temporary file + with tempfile.NamedTemporaryFile(mode="w", suffix=".json", delete=False) as f: + json.dump({"gpu_vendor": "AMD", "guest_os": "UBUNTU"}, f) + temp_file = f.name + + try: + # CLI overrides gpu_vendor + result = validate_additional_context( + additional_context='{"gpu_vendor": "NVIDIA"}', + additional_context_file=temp_file, + ) + + assert result["gpu_vendor"] == "NVIDIA" # From CLI (override) + assert result["guest_os"] == "UBUNTU" # From file + finally: + import os + + os.unlink(temp_file) + + def test_validate_additional_context_invalid_json(self): + """Test that invalid JSON raises appropriate error""" + invalid_json = '{"gpu_vendor": "AMD", invalid}' + + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context(additional_context=invalid_json) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_file_not_found(self): + """Test that missing file raises appropriate error""" + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context( + additional_context="{}", + additional_context_file="/nonexistent/file.json", + ) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_docker_build_arg_must_be_object(self): + bad = '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "docker_build_arg": "oops"}' + + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context(additional_context=bad) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_nested_docker_build_arg_ok(self): + ctx = ( + '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", ' + '"docker_build_arg": {"MAD_SYSTEM_GPU_ARCHITECTURE": "gfx942"}}' + ) + result = validate_additional_context(additional_context=ctx) + assert result["docker_build_arg"]["MAD_SYSTEM_GPU_ARCHITECTURE"] == "gfx942" + + def test_validate_additional_context_tools_must_be_list(self): + bad = '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "tools": {}}' + + with pytest.raises(typer.Exit) as exc_info: + validate_additional_context(additional_context=bad) + + assert exc_info.value.exit_code == ExitCode.INVALID_ARGS + + def test_validate_additional_context_tools_list_ok(self): + ctx = '{"gpu_vendor": "AMD", "guest_os": "UBUNTU", "tools": []}' + result = validate_additional_context(additional_context=ctx) + assert result["tools"] == []