Faster streaming support

Have you tried building the spectrogram and encoder output in smaller chunks and appending? I think the spectrogram should generate fairly easily with minimal noise depending on the size of the chunk, and the encoder output can also be appended with sufficiently large chunks.

So the encoder instead takes in Bx80xN as its input and outputs Bx(N/2)x(embedding size), if you wanted to send in 1s of audio into the tiny.en model, for example, 1x80x100->1x50x384. This should result in much faster processing for short clips (when the audio clip is <30s), and allows real time streaming without much wasted computation (like having to calculate a full x1500 encoding for each chunk of audio).

Some noise may be introduced at various size of chunks (spectrogram chunk size can be independent from encoder chunk size), and some overlap of spectrogram input/encoder output may help further to reduce that noise. This allows us for better scheduled deployment where the decoder, encoder, and spectrogram can run on different threads at the same time to produce our transcription.


Choosing when to decode will be another challenge as you don't want to decode if a full word is not complete in the encoding, but there are definitely solutions around that as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster streaming support #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Faster streaming support #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions