Technical notes on speech, language, and machine learning.
Autoregressive Models for Speech
Encoder/decoder architecture, RVQ with EMA codebook updates, MS-STFT discriminator, loss balancer, streaming vs. non-streaming mode, variable bitrate, and ablation results.
RVQ mechanics, codebook delay pattern, semantic vs. acoustic token comparison, codebook collapse, exposure bias, streaming implementation, and EnCodec vs. DAC vs. Mimi.