Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS.

Method

Method	Frame Rate (Hz)	Dynamic Rate	Controllable Rate	Semantic Augmentation	Has TTS Results?
DAC	75	✗	✗	✗	✗
SpeechTokenizer	50	✗	✗	SSL feature (HuBERT)	✓
CodecSlime	40 (Avg)	✓	✓ (40,50,67,80Hz options)	✗	✗
Mimi	12.5	✗	✗	SSL feature (WavLM)	✓
DualCodec	12.5/25	✗	✗	SSL feature (w2v-bert-2)	✓
TaDiCodec	6.25	✗	✗	Text	✓
Phoneme Tokens	11.7 (Avg)	✓	✗	-	-
BPE Text Tokens	4.5 (Avg)	✓	✗	-	-
FlexiCodec	6.25/8.3/12.5 (Avg)	✓	✓ (Any from 3 to 12.5Hz)	ASR feature	✓

The following figure shows a zoomed-in view of the frame merging and unmerging module, which enables dynamic frame rate. Each has a transformer inside. A merging "threshold" parameter controls the intensity of frame merging. A lower threshold results in a lower average frame rate (we can only report "average" frame rate because the frame rate is not fixed). During inference, users can select from a threshold from 0.7 to 1.0, to use a frame rate (3.0 to 12.5Hz) they want. In the following sections, we provide demo audios for our experiments.

Examining the impact of very low frame rates

We first investigate the performance of representative audio codecs at very low frame rates. We created three new baseline versions by retraining DAC and DualCodec to operate at 12.5Hz, 8.3Hz, and 6.25Hz, respectively.

Semantic Information Preservation

The following audios are the codec-reconstructed audios using their RVQ-1 tokens. For these RVQ-1 audios, we focus on their intelligibility instead of the acoustic details, because at such low frame rate and bitrate, it is not possible to reconstruct the acoustic details, but we can maintain the core semantic information that are useful for downstream models.

Model	6.25Hz (RVQ1)	8.3Hz (RVQ1)	12.5Hz (RVQ1)
Reference Text	if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec
DualCodec
DAC
GroundTruth

The audios above show that FlexiCodec audios are more intelligible than other codecs especially at lower frame rates like 6.25Hz. This is confirmed by using a speech recognition to compute the WERs of the audios, as illustrated below. The rich semantic information of FlexiCodec RVQ-1 tokens is helpful for the downstream models (especially the AR models) to generate more intlligible speech.

Acoustic Quality Evaluation

Audio codecs can use more RVQ layers to reconstruct acoustic details. The following audios are reconstructed audios using 8 RVQ layers for each codec variant. You may not easily hear the difference, so we attach a figure of the PESQ metric.

Model	6.25Hz (RVQ1:8)	8.3Hz (RVQ1:8)	12.5Hz (RVQ1:8)
Reference Text	if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec
DualCodec
DAC
GroundTruth

The results show that acoustic quality metrics show more moderate differences across systems and frame rates, compared to the dramatic differences in the previous semantic evaluation. We think that the acoustic fidelity is more constrained by bitrate, and because the bitrates of the three systems are at the same level, the difference is not pronounced.

Additional Codec-Reconstructed Audio Comparisons

The following audio comparisons showcase the quality differences between FlexiCodec and other open-source codec systems. These samples demonstrate how FlexiCodec performs compared to established codec architectures at different compression rates.

System	RVQ1:8	RVQ1
Reference Text	if you will give us your promise to meet captain battleax here at this time to morrow we will stretch a point and delay the departure of the john bright for twenty four hours
FlexiCodec@12.5Hz	(1.3kbps)	(0.23kbps)
FlexiCodec@8.3Hz	(0.85kbps)	(0.15kbps)
FlexiCodec@6.25Hz	(0.64kbps)	(0.11kbps)
GroundTruth
Encodec	(6kbps)	(1.5kbps)
Mimi	(1.1kbps)	N/A
SNAC	(0.98kbps)	N/A
XCodec2	(0.8kbps)	Same as left
XYTokenizer	(1.0kbps)	N/A
DualCodec-12.5Hz	(1.2kbps)	(0.19kbps)
SpeechTokenizer	(4.0kbps)	(0.5kbps)
TaDiCodec	(>0.15kbps, uses additional reference audio)	same as left
WavTokenizer	(0.90kbps)	same as left

We have compared Flexi-Codec with open-source neural audio codecs, spanning various bitrate and frame rates. We find that FlexiCodec has state-of-the-art acoustic quality (RVQ1:8) and semantic preservation (RVQ1) at various bitrate levels. And it is competitive to higher frame rate systems, such as SpeechTokenizer-50Hz, Encodec-75Hz, WavTokenizer-75Hz, etc.

System	RVQ1 BR(kbps)	RVQ1:8 BR(kbps)/n_q	Param	Semantic Test		Acoustic Test (RVQ1:8)
System	RVQ1 BR(kbps)	RVQ1:8 BR(kbps)/n_q	Param	WER(RVQ1)↓	WER(RVQ1:8)↓	PESQ↑	UTMOS↑	MCD↓	SIM↑
> 1kbps Acoustic Bitrate
DAC-75Hz	0.75	6.0 / 8q	74M	31.2	2.27	3.77	3.62	2.34	0.90
Encodec-75Hz	1.50	6.0 / 8q	15M	5.90	2.24	3.12	3.01	2.60	0.89
SpeechTokenizer-50Hz	0.50	4.0 / 8q	103M	5.56	2.47	3.01	3.90	3.17	0.85
Mimi-12.5Hz	-	1.1 / 8q	78M	-	3.15	2.75	3.56	3.62	0.73
DualCodec-12.5Hz	0.19	1.2 / 8q	84M	5.93	2.26	3.29	4.18	2.81	0.85
XYTokenizer-12.5Hz	-	1.0 / 8q	520M	-	2.36	3.00	4.00	3.28	0.84
FlexiCodec @12.5Hz	0.23	1.3 / 8q	216M	2.76	2.23	3.35	4.22	2.76	0.85
∼0.8kbps Acoustic Bitrate
WavTokenizer-75Hz	0.90	0.90 / 1q	81M	4.57	4.57	2.86	3.98	3.51	0.68
SNAC-12,23,47Hz	-	0.98 / 3q	20M	-	4.21	2.51	3.43	3.61	0.67
XCodec2-50Hz	0.80	0.80 / 1q	210M	2.80	2.80	2.77	4.08	3.65	0.82
TS3Codec(X2)-50Hz	0.85	0.85 / 1q	204M	4.09	4.09	2.80	3.80	3.38	0.68
FlexiCodec @8.3Hz	0.15	0.85 / 8q	216M	2.98	2.28	3.03	4.21	3.10	0.78
<0.7kbps Acoustic Bitrate
TS3Codec(X4)-40Hz	0.68	0.68 / 1q	204M	5.14	5.14	2.58	3.67	3.65	0.63
TaDiCodec-6.25Hz	0.15	0.15 / 1q	751M	4.32	4.32	1.73	4.05	9.75	0.83
FlexiCodec @6.25Hz	0.11	0.64 / 8q	216M	4.15	2.53	2.76	4.18	3.42	0.71

Text-to-Speech

We built a text-to-speech system with FlexiCodec. This system is an AR + NAR(flow matching) TTS. The AR predicts each FlexiCodec token and its frame length using two parallel prediction heads. Because the AR model is usually the slowest part in TTS, it can greatly benefit from FlexiCodec's low frame rate. For the NAR, we tried two schemes, one is using the 50Hz male spectrogram feature, the second one is using the 12.5Hz FlexiCodec feature.

	Sample-1	Sample-2	Sample-3	Sample-4	Sample-5	Sample-6	Sample-7	Sample-8	Sample-9	Sample-10
Reference Text	Give me a check for a hundred and fifty, and I'll turn over to you the forged check and quash further proceedings.	I have been here this quarter of an hour," replied La Valliere.	What is the tumult and rioting"? cried out the Squire, authoritatively, and he blew twice on a silver whistle which hung at his belt.	You have come to us threatening us with absolute destruction.	The sound of an imperative and uncompromising bell recalled me in due time to the regions of reality.	He was soft hearted and impetuous," said Beth; "and, being in love, he didn't stop to count the cost".	The goat's warlike spirit was roused by this successful attack.	His conduct and presence of mind in this emergence appeared conspicuous.	The Nautilus nearly perishes in the Antarctic and Nemo sinks into a growing depression.	Jack would become Eva's happy husband, and would remain amidst the hurried duties of the eager world.
Ground Truth
FlexiCodec-TTS (6.25Hz AR, 50Hz NAR)
FlexiCodec-TTS (8.3Hz AR, 50Hz NAR)
FlexiCodec-TTS (12.5Hz AR, 50Hz NAR)
CosyVoice
FlexiCodec-TTS (6.25Hz AR, 12.5Hz NAR)
FlexiCodec-TTS (8.3Hz AR, 12.5Hz NAR)
FlexiCodec-TTS (12.5Hz AR, 12.5Hz NAR)
SparkTTS
FireRedTTS

We find that FlexiCodec-TTS delivers strong quality at less compute than baselines. With 6.25Hz AR + 50Hz NAR, it matches or surpasses CosyVoice and SparkTTS. A 50Hz NAR consistently outperforms 12.5Hz across WER, SIM-0, NMOS, and QMOS, showing that higher temporal resolution is important for NAR. Lower AR rates (8.3/6.25Hz) preserve or slightly improve accuracy with 50Hz NAR—likely due to better semantic-acoustic disentanglement and shorter sequences that simplify attention.

Frame rate controllability analysis

This section demonstrates FlexiCodec's frame rate controllability through different merging threshold settings. The threshold parameter controls the dynamic frame rate mechanism, where lower thresholds lead to more aggressive frame merging and lower effective frame rates. As can be heard from the audios. the intelligibility is adequate at > 4.5Hz, but becomes unusable for 3.6Hz and 3.0Hz.

Threshold / Avg frame rate	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8	Sample 9	Sample 10
Reference Text	you were quite right to say no ambrose began never smoke with john jago his cigars will poison you	paul declares that the false apostles were called or sent neither by men nor by man	his hat had a peaked crown and a flat brim and around the brim was a row of tiny golden bells that tinkled when he moved	i reside in the marais rue de douze portes	the utility of consumption as an evidence of wealth is to be classed as a derivative growth	and there's linen in the house as i could well spare you for i've got lots o sheeting and table clothing and towelling as isn't made up	but i wrestled with this fellow and do know that he played unfairly in the second bout	of course he reflected she always had that combination of something homely and sensible and something utterly wild and daft	to all these inquiries the count responded in the affirmative	yes something everything said rachel hurriedly looking frowningly at a flower which she was twirling in her fingers
1.00 (12.5Hz)
0.90 (7.9Hz)
0.80 (4.5Hz)
0.75 (3.6Hz)
0.70 (3.0Hz)
Ground Truth

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates