Any way to do speak content grounding? is like check a sentence is presented in audio or not (not caption
Any way to do speak content grounding? is like check a sentence is presented in audio or not (not caption