FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Architecture of FlexiCodec
Architecture of FlexiCodec. The model employs dynamic frame rates and an ASR feature-assisted dual-stream encoder, and supports a controllable frame rate between 3Hz and 12.5Hz.

Abstract

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS.

Method

Our motivation was to explore very low frame rate neural audio codecs. A very low frame rate codec is beneficial for the downstream tasks efficiency. Our final system, FlexiCodec is a novel low-frame-rate codec built on three design choices: (1) Dynamic frame rate, (2) Rich semantics guided by ASR feature, and (3) Frame rate control ability. This allows our codec system to support very low frame rate, and A controllable frame rate supports versatile efficiency-performance tradeoff demands. A general comparison between different audio tokenization methods is shown in the table below.
Method Frame Rate (Hz) Dynamic Rate Controllable Rate Semantic Augmentation Has TTS Results?
DAC 75
SpeechTokenizer 50 SSL feature (HuBERT)
CodecSlime 40 (Avg) ✓ (40,50,67,80Hz options)
Mimi 12.5 SSL feature (WavLM)
DualCodec 12.5/25 SSL feature (w2v-bert-2)
TaDiCodec 6.25 Text
Phoneme Tokens 11.7 (Avg) - -
BPE Text Tokens 4.5 (Avg) - -
FlexiCodec 6.25/8.3/12.5 (Avg) ✓ (Any from 3 to 12.5Hz) ASR feature
Comparison of different audio tokenization methods and their properties.

The following figure shows a zoomed-in view of the frame merging and unmerging module, which enables dynamic frame rate. Each has a transformer inside. A merging "threshold" parameter controls the intensity of frame merging. A lower threshold results in a lower average frame rate (we can only report "average" frame rate because the frame rate is not fixed). During inference, users can select from a threshold from 0.7 to 1.0, to use a frame rate (3.0 to 12.5Hz) they want. In the following sections, we provide demo audios for our experiments.
FlexiCodec merging and unmerging architecture
The system adaptively merges semantically similar frames during encoding and reconstructs the original temporal resolution during decoding.

Examining the impact of very low frame rates

We first investigate the performance of representative audio codecs at very low frame rates. We created three new baseline versions by retraining DAC and DualCodec to operate at 12.5Hz, 8.3Hz, and 6.25Hz, respectively.

Semantic Information Preservation

The following audios are the codec-reconstructed audios using their RVQ-1 tokens. For these RVQ-1 audios, we focus on their intelligibility instead of the acoustic details, because at such low frame rate and bitrate, it is not possible to reconstruct the acoustic details, but we can maintain the core semantic information that are useful for downstream models.

Model 6.25Hz (RVQ1) 8.3Hz (RVQ1) 12.5Hz (RVQ1)
Reference Text if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec
DualCodec
DAC
GroundTruth
Semantic reconstruction demos - Audio reconstructed using RVQ-1 tokens across some low frame rate settings .

The audios above show that FlexiCodec audios are more intelligible than other codecs especially at lower frame rates like 6.25Hz. This is confirmed by using a speech recognition to compute the WERs of the audios, as illustrated below. The rich semantic information of FlexiCodec RVQ-1 tokens is helpful for the downstream models (especially the AR models) to generate more intlligible speech.

RVQ1 Chart Analysis
Quantitative analysis of RVQ-1 token performance across different frame rates and codec architectures.

Acoustic Quality Evaluation

Audio codecs can use more RVQ layers to reconstruct acoustic details. The following audios are reconstructed audios using 8 RVQ layers for each codec variant. You may not easily hear the difference, so we attach a figure of the PESQ metric.

Model 6.25Hz (RVQ1:8) 8.3Hz (RVQ1:8) 12.5Hz (RVQ1:8)
Reference Text if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec
DualCodec
DAC
GroundTruth
Audio reconstruction quality across different frame rates using RVQ-1:8 tokens.
RVQ8 Chart Analysis
Quantitative analysis of RVQ-8 token performance across different frame rates and codec architectures.

The results show that acoustic quality metrics show more moderate differences across systems and frame rates, compared to the dramatic differences in the previous semantic evaluation. We think that the acoustic fidelity is more constrained by bitrate, and because the bitrates of the three systems are at the same level, the difference is not pronounced.

Additional Codec-Reconstructed Audio Comparisons

The following audio comparisons showcase the quality differences between FlexiCodec and other open-source codec systems. These samples demonstrate how FlexiCodec performs compared to established codec architectures at different compression rates.

System RVQ1:8 RVQ1
Reference Text if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec@12.5Hz
(1.3kbps)

(0.23kbps)
FlexiCodec@8.3Hz
(0.85kbps)

(0.15kbps)
FlexiCodec@6.25Hz
(0.64kbps)

(0.11kbps)
GroundTruth
Encodec
(6kbps)

(1.5kbps)
Mimi
(1.1kbps)
N/A
SNAC
(0.98kbps)
N/A
XCodec2
(0.8kbps)
Same as left
XYTokenizer
(1.0kbps)
N/A
DualCodec-12.5Hz
(1.2kbps)

(0.19kbps)
SpeechTokenizer
(4.0kbps)

(0.5kbps)
TaDiCodec
(>0.15kbps, uses additional
reference audio)
same as left
WavTokenizer
(0.90kbps)
same as left
Codec reconstructed audio comparison.
We have compared Flexi-Codec with open-source neural audio codecs, spanning various bitrate and frame rates. We find that FlexiCodec has state-of-the-art acoustic quality (RVQ1:8) and semantic preservation (RVQ1) at various bitrate levels. And it is competitive to higher frame rate systems, such as SpeechTokenizer-50Hz, Encodec-75Hz, WavTokenizer-75Hz, etc.
System RVQ1
BR(kbps)
RVQ1:8
BR(kbps)/n_q
Param Semantic Test Acoustic Test (RVQ1:8)
WER(RVQ1)↓ WER(RVQ1:8)↓ PESQ↑ UTMOS↑ MCD↓ SIM↑
> 1kbps Acoustic Bitrate
DAC-75Hz 0.75 6.0 / 8q 74M 31.2 2.27 3.77 3.62 2.34 0.90
Encodec-75Hz 1.50 6.0 / 8q 15M 5.90 2.24 3.12 3.01 2.60 0.89
SpeechTokenizer-50Hz 0.50 4.0 / 8q 103M 5.56 2.47 3.01 3.90 3.17 0.85
Mimi-12.5Hz - 1.1 / 8q 78M - 3.15 2.75 3.56 3.62 0.73
DualCodec-12.5Hz 0.19 1.2 / 8q 84M 5.93 2.26 3.29 4.18 2.81 0.85
XYTokenizer-12.5Hz - 1.0 / 8q 520M - 2.36 3.00 4.00 3.28 0.84
FlexiCodec @12.5Hz 0.23 1.3 / 8q 216M 2.76 2.23 3.35 4.22 2.76 0.85
∼0.8kbps Acoustic Bitrate
WavTokenizer-75Hz 0.90 0.90 / 1q 81M 4.57 4.57 2.86 3.98 3.51 0.68
SNAC-12,23,47Hz - 0.98 / 3q 20M - 4.21 2.51 3.43 3.61 0.67
XCodec2-50Hz 0.80 0.80 / 1q 210M 2.80 2.80 2.77 4.08 3.65 0.82
TS3Codec(X2)-50Hz 0.85 0.85 / 1q 204M 4.09 4.09 2.80 3.80 3.38 0.68
FlexiCodec @8.3Hz 0.15 0.85 / 8q 216M 2.98 2.28 3.03 4.21 3.10 0.78
<0.7kbps Acoustic Bitrate
TS3Codec(X4)-40Hz 0.68 0.68 / 1q 204M 5.14 5.14 2.58 3.67 3.65 0.63
TaDiCodec-6.25Hz 0.15 0.15 / 1q 751M 4.32 4.32 1.73 4.05 9.75 0.83
FlexiCodec @6.25Hz 0.11 0.64 / 8q 216M 4.15 2.53 2.76 4.18 3.42 0.71
Comparison between FlexiCodec and other open-source neural audio codecs.

Text-to-Speech

We built a text-to-speech system with FlexiCodec. This system is an AR + NAR(flow matching) TTS. The AR predicts each FlexiCodec token and its frame length using two parallel prediction heads. Because the AR model is usually the slowest part in TTS, it can greatly benefit from FlexiCodec's low frame rate. For the NAR, we tried two schemes, one is using the 50Hz male spectrogram feature, the second one is using the 12.5Hz FlexiCodec feature.

Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6 Sample-7 Sample-8 Sample-9 Sample-10
Reference Text Give me a check for a hundred and fifty, and I'll turn over to you the forged check and quash further proceedings. I have been here this quarter of an hour," replied La Valliere. What is the tumult and rioting"? cried out the Squire, authoritatively, and he blew twice on a silver whistle which hung at his belt. You have come to us threatening us with absolute destruction. The sound of an imperative and uncompromising bell recalled me in due time to the regions of reality. He was soft hearted and impetuous," said Beth; "and, being in love, he didn't stop to count the cost". The goat's warlike spirit was roused by this successful attack. His conduct and presence of mind in this emergence appeared conspicuous. The Nautilus nearly perishes in the Antarctic and Nemo sinks into a growing depression. Jack would become Eva's happy husband, and would remain amidst the hurried duties of the eager world.
Ground Truth
FlexiCodec-TTS (6.25Hz AR, 50Hz NAR)
FlexiCodec-TTS (8.3Hz AR, 50Hz NAR)
FlexiCodec-TTS (12.5Hz AR, 50Hz NAR)
CosyVoice
FlexiCodec-TTS (6.25Hz AR, 12.5Hz NAR)
FlexiCodec-TTS (8.3Hz AR, 12.5Hz NAR)
FlexiCodec-TTS (12.5Hz AR, 12.5Hz NAR)
SparkTTS
FireRedTTS

Text-to-Speech demos across models. Columns are different text samples.

We find that FlexiCodec-TTS delivers strong quality at less compute than baselines. With 6.25Hz AR + 50Hz NAR, it matches or surpasses CosyVoice and SparkTTS. A 50Hz NAR consistently outperforms 12.5Hz across WER, SIM-0, NMOS, and QMOS, showing that higher temporal resolution is important for NAR. Lower AR rates (8.3/6.25Hz) preserve or slightly improve accuracy with 50Hz NAR—likely due to better semantic-acoustic disentanglement and shorter sequences that simplify attention.

Frame rate controllability analysis

This section demonstrates FlexiCodec's frame rate controllability through different merging threshold settings. The threshold parameter controls the dynamic frame rate mechanism, where lower thresholds lead to more aggressive frame merging and lower effective frame rates. As can be heard from the audios. the intelligibility is adequate at > 4.5Hz, but becomes unusable for 3.6Hz and 3.0Hz.

Threshold / Avg frame rate Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8 Sample 9 Sample 10
Reference Text you were quite right to say no ambrose began never smoke with john jago his cigars will poison you paul declares that the false apostles were called or sent neither by men nor by man his hat had a peaked crown and a flat brim and around the brim was a row of tiny golden bells that tinkled when he moved i reside in the marais rue de douze portes the utility of consumption as an evidence of wealth is to be classed as a derivative growth and there's linen in the house as i could well spare you for i've got lots o sheeting and table clothing and towelling as isn't made up but i wrestled with this fellow and do know that he played unfairly in the second bout of course he reflected she always had that combination of something homely and sensible and something utterly wild and daft to all these inquiries the count responded in the affirmative yes something everything said rachel hurriedly looking frowningly at a flower which she was twirling in her fingers
1.00 (12.5Hz)
0.90 (7.9Hz)
0.80 (4.5Hz)
0.75 (3.6Hz)
0.70 (3.0Hz)
Ground Truth
FlexiCodec-reconstructed RVQ-1 audios using different merging thresholds. Lower threshold values (0.7, 0.75, 0.8) result in more aggressive frame merging and lower effective frame rates, while higher thresholds (0.90, 1.0) preserve more temporal detail.