Skip to content

[RNE Rewrite] feat: add voice activity detection pipeline#1298

Draft
msluszniak wants to merge 3 commits into
rne-rewritefrom
@ms/rewrite-vad
Draft

[RNE Rewrite] feat: add voice activity detection pipeline#1298
msluszniak wants to merge 3 commits into
rne-rewritefrom
@ms/rewrite-vad

Conversation

@msluszniak

Copy link
Copy Markdown
Member

Description

Adds a Voice Activity Detection (VAD) task pipeline and a corresponding speech example app. The whole pipeline (feature extraction, chunked inference, segment postprocessing and streaming) runs in TypeScript on top of the core model.execute primitive — no new C++.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

Screenshots

Related issues

Closes #1249

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

  • Depends on the get_dynamic_dims relaxed input validation from the text-embeddings PR ([RNE Rewrite] Add image and text embeddings pipelines #1247): VAD feeds a variable-length [frames, 512] input tensor per chunk. Outputs are still validated exactly, so the output tensor is pre-allocated at the model-declared shape. Requires [RNE Rewrite] Add image and text embeddings pipelines #1247 to land and the fsmn-vad model to be re-exported with a get_dynamic_dims method.
  • Segments are returned in seconds (the old native path returned raw sample indices).
  • The FSMN output contract is assumed to be [1, frames, classes] with class 0 = non-speech (speech = 1 - p0), matching the current native implementation.

@msluszniak msluszniak self-assigned this Jul 2, 2026
@msluszniak msluszniak added refactoring feature PRs that implement a new feature labels Jul 2, 2026
@msluszniak msluszniak linked an issue Jul 2, 2026 that may be closed by this pull request
Port the VAD feature to the rewrite as a pure-TypeScript pipeline on top of
the core model.execute primitive (no new C++):

- src/extensions/speech/tasks/vad.ts: createVAD runner replicating the native
  FSMN-VAD algorithm (framing + Hann window + pre-emphasis, chunked inference,
  thresholding / min-duration / padding / merge). Segments are returned in
  seconds. Relies on the get_dynamic_dims relaxed input validation for the
  dynamic frame dimension; the fsmn-vad model is re-exported with it.
- src/extensions/speech/vadStreamer.ts: pure streaming state machine driving
  onSpeechBegin / onSpeechEnd over an accumulating buffer.
- src/hooks/useVAD.ts: hook wrapping createVAD + streamer lifecycle.
- Register models.vad.FSMN_VAD and export the speech extension.
- apps/speech: expo-router demo (mirrors apps/nlp) with a real-time mic VAD
  screen via react-native-audio-api.
…arams into model config

Frame geometry (sample rate, window/hop, FFT size, pre-emphasis, min frames)
is FSMN-specific and now lives on VADModel.featureConfig (supplied by the models
registry) instead of hardcoded constants in the task. The pipeline and streamer
are parameterized by it; detection thresholds remain generic VADOptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature refactoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RNE Rewrite] Speech - add VAD pipeline implementation

1 participant