Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/docs/extraction/audio.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Extract Speech with NeMo Retriever Library

This documentation describes two methods to run [NeMo Retriever Library](overview.md)
with the [RIVA ASR NIM microservice](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)
to extract speech from audio files.
with the [parakeet-1-1b-ctc-en-us ASR NIM microservice](https://docs.nvidia.com/nim/speech/latest/asr/deploy-asr-models/parakeet-ctc-en-us.html)
(`nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us`) to extract speech from audio files.

- Run the NIM locally by using Docker Compose
- Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference
Expand All @@ -22,12 +22,12 @@ Currently, you can extract speech from the following file types:

[NeMo Retriever Library](overview.md) supports extracting speech from audio files for Retrieval Augmented Generation (RAG) applications.
Similar to how the multimodal document extraction pipeline leverages object detection and image OCR microservices,
NeMo Retriever leverages the [RIVA ASR NIM microservice](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)
NeMo Retriever leverages the [parakeet-1-1b-ctc-en-us ASR NIM microservice](https://docs.nvidia.com/nim/speech/latest/asr/deploy-asr-models/parakeet-ctc-en-us.html)
to transcribe speech to text, which is then embedded by using the NeMo Retriever embedding NIM.

!!! important

Due to limitations in available VRAM controls in the current release, the RIVA ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). For the full list of requirements, refer to [Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix.html).
Due to limitations in available VRAM controls in the current release, the parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). For the full list of requirements, refer to [Support Matrix](support-matrix.md).

This pipeline enables users to retrieve speech files at the segment level.

Expand All @@ -43,7 +43,7 @@ Use the following procedure to run the NIM locally.

!!! important

The RIVA ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). Edit docker-compose.yaml to set the device_id to a dedicated GPU: device_ids: ["1"] or higher.
The parakeet-1-1b-ctc-en-us ASR NIM microservice must run on a [dedicated additional GPU](support-matrix.md). Edit docker-compose.yaml to set the device_id to a dedicated GPU: device_ids: ["1"] or higher.

1. To access the required container images, log in to the NVIDIA Container Registry (nvcr.io). Use [your NGC key](ngc-api-key.md) as the password. Run the following command in your terminal.

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/extraction/nv-ingest-python-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -571,5 +571,5 @@ results = ingestor.ingest()
- [Split Documents](chunking.md)
- [Troubleshoot Nemo Retriever Extraction](troubleshoot.md)
- [Advanced Visual Parsing](nemoretriever-parse.md)
- [Use NeMo Retriever Library with Riva for Audio Processing](audio.md)
- [Use NeMo Retriever Library with the Parakeet ASR NIM for Audio Processing](audio.md)
- [Use Multimodal Embedding](vlm-embed.md)
2 changes: 1 addition & 1 deletion docs/docs/extraction/python-api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -663,5 +663,5 @@ results = ingestor.ingest()
- [Split Documents](chunking.md)
- [Troubleshoot NeMo Retriever Library](troubleshoot.md)
- [Advanced Visual Parsing](nemoretriever-parse.md)
- [Use the NeMo Retriever Library with Riva for Audio Processing](audio.md)
- [Use the NeMo Retriever Library with the Parakeet ASR NIM for Audio Processing](audio.md)
- [Use Multimodal Embedding](vlm-embed.md)
2 changes: 1 addition & 1 deletion docs/docs/extraction/quickstart-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -393,7 +393,7 @@ You can specify multiple `--profile` options.
| Profile | Type | Description |
|-----------------------|----------|-------------------------------------------------------------------|
| `retrieval` | Core | Enables the embedding NIM and (optional) GPU-accelerated Milvus. Omit this profile to use the default LanceDB backend. |
| `audio` | Advanced | Use [Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) for processing audio files. For more information, refer to [Audio Processing](audio.md). |
| `audio` | Advanced | Use the [parakeet-1-1b-ctc-en-us](https://docs.nvidia.com/nim/speech/latest/asr/deploy-asr-models/parakeet-ctc-en-us.html) ASR NIM (`nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us`) for processing audio files. For more information, refer to [Audio Processing](audio.md). |
| `nemotron-parse` | Advanced | Use [nemotron-parse](https://build.nvidia.com/nvidia/nemotron-parse), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing](nemoretriever-parse.md). |
| `vlm` | Advanced | Use [llama 3.1 Nemotron 8B Vision](https://build.nvidia.com/nvidia/llama-3.1-nemotron-nano-vl-8b-v1/modelcard) for image captioning of unstructured images and infographics. This profile enables the `caption` method in the Python API to generate text descriptions of visual content. For more information, refer to [Use Multimodal Embedding](vlm-embed.md) and [Extract Captions from Images](nv-ingest-python-api.md#extract-captions-from-images). |

Expand Down
8 changes: 4 additions & 4 deletions docs/docs/extraction/support-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The core pipeline features include the following:
Advanced features require additional GPU support and disk space.
This includes the following:

- Audio extraction — Use [Riva](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html) for processing audio files. For more information, refer to [Audio Processing](audio.md).
- Audio extraction - parakeet-1-1b-ctc-en-us — Use the [Parakeet CTC English (en-US) ASR NIM](https://docs.nvidia.com/nim/speech/latest/asr/deploy-asr-models/parakeet-ctc-en-us.html) (`nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us`) for processing audio files. For more information, refer to [Audio Processing](audio.md).
- Advanced visual parsing — Use [nemotron-parse](https://docs.nvidia.com/nim/vision-language-models/latest/examples/nemotron-parse/overview.html), which adds state-of-the-art text and table extraction. For more information, refer to [Advanced Visual Parsing ](nemoretriever-parse.md).
- VLM — Use [nemotron-nano-12b-v2-vl](https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard) for experimental image captioning of unstructured images.

Expand Down Expand Up @@ -55,8 +55,8 @@ The following are the hardware requirements to run NeMo Retriever Library.
| GPU | Memory | 96GB | 180GB | 141GB | 80GB | 80GB | 40GB | 24GB | 48GB | 32GB GDDR7 (GB203) |
| Core Features | Total GPUs | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Core Features | Total Disk Space | ~150GB | ~150GB | ~150GB | ~150GB | ~150GB | ~150GB | ~150GB | ~150GB | ~150GB |
| Audio | Additional Dedicated GPUs | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1¹ |
| Audio | Additional Disk Space | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB¹ |
| Audio (parakeet-1-1b-ctc-en-us) | Additional Dedicated GPUs | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1¹ |
| Audio (parakeet-1-1b-ctc-en-us) | Additional Disk Space | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB | ~37GB¹ |
| nemotron-parse | Additional Dedicated GPUs | Not supported | Not supported | Not supported | 1 | 1 | 1 | 1 | 1 | Not supported² |
| nemotron-parse | Additional Disk Space | Not supported | Not supported | Not supported | ~16GB | ~16GB | ~16GB | ~16GB | ~16GB | Not supported² |
| VLM | Additional Dedicated GPUs | 1 | 1 | 1 | 1 | 1 | Not supported | Not supported | 1 | Not supported³ |
Expand All @@ -73,4 +73,4 @@ and run only the embedder, reranker, and your vector database.
- [Prerequisites](prerequisites.md)
- [Release Notes](releasenotes-nv-ingest.md)
- [NVIDIA NIM for Vision Language Models Support Matrix](https://docs.nvidia.com/nim/vision-language-models/latest/support-matrix.html)
- [NVIDIA Riva Support Matrix](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/support-matrix/support-matrix.html)
- [NVIDIA Speech NIM Microservices](https://docs.nvidia.com/nim/speech/latest/reference/support-matrix/index.html)
4 changes: 2 additions & 2 deletions docs/docs/extraction/troubleshoot.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Before you change the `-u` setting, consider the following:
- For `-u` we recommend 10,000 as a baseline, but you might need to raise or lower it based on your actual usage and system configuration.

```bash
ulimit -u 10,000
ulimit -u 10000
```


Expand Down Expand Up @@ -89,7 +89,7 @@ Before you change the `-n` setting, consider the following:
- For `-n` we recommend 10,000 as a baseline, but you might need to raise or lower it based on your actual usage and system configuration.

```bash
ulimit -n 10,000
ulimit -n 10000
```


Expand Down
Loading