Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a stable-audio command-line interface entrypoint for running Stable Audio 3 generation (text-to-audio, audio-to-audio, inpainting/continuation) without writing Python, along with unit tests and documentation.
Changes:
- Introduces
stable_audio_3/cli.pyimplementing argparse-based generation and WAV output saving. - Adds
tests/test_cli.pycovering CLI flag wiring via mocked model and torchaudio I/O. - Updates
README.mdwith CLI usage examples and adds thestable-audioconsole script inpyproject.toml.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
stable_audio_3/cli.py |
New CLI implementation: argument parsing, model loading, generation invocation, and output saving. |
tests/test_cli.py |
New unit tests validating CLI flag wiring with mocked dependencies. |
README.md |
Documents CLI usage and adds a note about uv sync behavior. |
pyproject.toml |
Adds a stable-audio console script entrypoint. |
Comments suppressed due to low confidence (2)
stable_audio_3/cli.py:225
torchaudio.load()returns(waveform, sample_rate), butStableAudioModel.generate()expects audio inputs as(sample_rate, tensor)tuples (seeStableAudioModel._encode_audio_inputunpacking). Passing the rawtorchaudio.load()result here will swap sample rate and waveform and break audio-to-audio/inpainting. Load as(waveform, sr)and then pass(sr, waveform)for bothinit_audioandinpaint_audio.
init_audio = None
if args.init_audio:
init_audio = torchaudio.load(args.init_audio)
inpaint_audio = None
if args.inpaint_audio:
inpaint_audio = torchaudio.load(args.inpaint_audio)
stable_audio_3/cli.py:195
- When
--prompt/--duration/--negative-promptare passed with multiple values, the CLI will pass lists intomodel.generate(). The model validates list lengths withassert len(value) == batch_size(inStableAudioModel._build_conditioning_dicts), which will raise anAssertionError(and can become a silent logic bug underpython -O). Add explicit CLI-side validation (or auto-inferbatch_sizefrom the list lengths when the user didn’t set it) and fail withparser.error(...)on mismatches.
# --- Build scalar / list args ---
prompt = args.prompt[0] if len(args.prompt) == 1 else args.prompt
negative_prompt = None
if args.negative_prompt:
negative_prompt = (
args.negative_prompt[0]
if len(args.negative_prompt) == 1
else args.negative_prompt
)
duration = args.duration[0] if len(args.duration) == 1 else args.duration
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| SAMPLE_RATE = 44100 | ||
| CHANNELS = 2 | ||
| FAKE_AUDIO_PATH = "some/audio.wav" | ||
| _FAKE_LOAD_RESULT = (SAMPLE_RATE, torch.zeros(CHANNELS, SAMPLE_RATE * 5)) | ||
|
|
| if args.inpaint_starts and len(args.inpaint_starts) != len(args.inpaint_ends): | ||
| parser.error( | ||
| "--inpaint-start and --inpaint-end must be specified the same number of times" | ||
| ) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
stable_audio_3/cli.py:199
- Same issue as prompts: when --duration is provided multiple times, the CLI passes a list through to StableAudioModel.generate, which asserts that list length == batch_size. Please validate the length against --batch-size (or require batch_size to match) and fail via parser.error so users get a clear CLI message instead of an AssertionError.
duration = args.duration[0] if len(args.duration) == 1 else args.duration
| prompt = args.prompt[0] if len(args.prompt) == 1 else args.prompt | ||
| negative_prompt = None | ||
| if args.negative_prompt: | ||
| negative_prompt = ( | ||
| args.negative_prompt[0] |
| if args.loras: | ||
| print(f"Loading LoRA(s): {args.loras}") | ||
| model.load_lora(args.loras) | ||
| if args.lora_strength is not None: | ||
| model.set_lora_strength(args.lora_strength, lora_index=args.lora_index) |
| warnings.filterwarnings("ignore") | ||
|
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
stable_audio_3/cli.py:188
- Inpaint region times are not validated for non-negativity or ordering. Negative seconds can produce negative sample indices (masking the end of the clip due to Python negative indexing), and
end <= startbecomes a silent no-op region. Validate each start/end pair after parsing (>=0 and end > start) andparser.erroron invalid regions.
# --- Validate inpaint args ---
if (args.inpaint_starts is None) != (args.inpaint_ends is None):
parser.error("--inpaint-start and --inpaint-end must both be provided together")
if args.inpaint_starts and len(args.inpaint_starts) != len(args.inpaint_ends):
parser.error(
"--inpaint-start and --inpaint-end must be specified the same number of times"
)
if args.inpaint_starts and not args.inpaint_audio:
parser.error("--inpaint-start/--inpaint-end require --inpaint-audio")
if args.inpaint_audio and not args.inpaint_starts:
parser.error("--inpaint-audio requires --inpaint-start and --inpaint-end")
stable_audio_3/cli.py:214
--durationaccepts zero/negative values. InStableAudioModel.generate, non-positive durations cause sample-size adaptation to fall back to the defaultsample_size, which can generate unexpectedly long audio instead of failing fast. Add a check that all provided durations are > 0 and raiseparser.errorotherwise.
# --- Validate list-flag lengths against batch size ---
if (
args.negative_prompt
and len(args.negative_prompt) > 1
and len(args.negative_prompt) != batch_size
):
parser.error(
f"Got {len(args.negative_prompt)} --negative-prompt values but batch size is {batch_size}"
)
if len(args.duration) > 1 and len(args.duration) != batch_size:
parser.error(
f"Got {len(args.duration)} --duration values but batch size is {batch_size}"
)
| parser.add_argument( | ||
| "--init-noise-level", | ||
| type=float, | ||
| default=0.9, | ||
| help="Noise level for audio-to-audio (0.0–1.0, default: 0.9)", | ||
| ) |
No description provided.