Canary-v2 streamatt speech processor

Hello, 

I implemented a support for [decoding_input_ids](https://github.com/NVIDIA-NeMo/NeMo/pull/15449) and tested it to see if it works. Although it's not merged yet, I'm working towards contributing canary with streamatt to the repo.

I have a question about the implementation of audio history. I see from the base_streamatt implementation that the audio is supposed to be stored in mel-features. Although it is possible to extract the mel-features first in NeMo framework, it's much easier to work with raw waveform history. I was thinking of some options how to tweak the implementation so that the raw history update is supported:
1. In self.audio_subsampling_factor put subsampling_factor * MEL_HOP_SAMPLES. This maps one encoder frame -> raw form. I was worried this might mess up the semantics of the code a bit.
2. The second option would be to override _update_audio_history completely, but then there is repeating code.

How would you see this implemented best?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canary-v2 streamatt speech processor #28

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Canary-v2 streamatt speech processor #28

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions