Contents

Autoregressive Image Generation with
Masked Bit Modeling

Amazon FAR (Frontier AI & Robotics)

Abstract

This paper challenges the dominance of continuous pipelines in visual generation. We systematically investigate the performance gap between discrete and continuous methods. Contrary to the belief that discrete tokenizers are intrinsically inferior, we demonstrate that the disparity arises primarily from the total number of bits allocated in the latent space (i.e., the compression ratio). We show that scaling up the codebook size effectively bridges this gap, allowing discrete tokenizers to match or surpass their continuous counterparts. However, existing discrete generation methods struggle to capitalize on this insight, suffering from performance degradation or prohibitive training costs with enlarged codebook. To address this, we propose masked Bit AutoRegressive modeling (BAR), a scalable framework that supports arbitrary codebook sizes. By equipping an autoregressive transformer with a masked bit modeling head, BAR predicts discrete tokens through progressively generating their constituent bits. BAR achieves a state-of-the-art gFID of 0.99 on ImageNet-256, outperforming leading methods across both continuous and discrete paradigms, while significantly reducing sampling times and oversweeping faster than prior continuous approaches.

Performance comparison across methods
BAR achieves superior quality-cost trade-off

BAR (masked Bit AutoRegressive modeling) achieves superior quality-cost trade-off on ImageNet-256, outperforming leading methods across both discrete and continuous paradigms while significantly reducing sampling time.

Key Insights

01 The Performance Gap Isn't About Discrete vs. Continuous

Contrary to conventional wisdom, we demonstrate that discrete tokenizers can match or surpass continuous tokenizers when operating under comparable bit budgets. The perceived inferiority of discrete methods stems from insufficient bit allocation to the latent space, not from fundamental architectural limitations.

02 Scaling Codebook Size is Key

By scaling the codebook size from 210 to 232, we observe consistent reconstruction quality improvements, demonstrating that discrete tokenizers can match and surpass continuous methods with sufficient bit allocation.

03 Masked Bit Modeling Enables Efficient Scaling

Traditional approaches face prohibitive memory and computational costs with large vocabularies. BAR's masked bit modeling head predicts discrete tokens through iterative bit-wise unmasking, requiring only a small number of forward passes. This elegant solution allows discrete generation with arbitrarily large vocabularies while remaining computationally efficient.

04 State-of-the-Art Performance Across All Metrics

BAR-L achieves FID 0.99 on ImageNet-256, establishing new state-of-the-art across both discrete and continuous paradigms. BAR-B ties with continuous RAE (FID 1.13) with only 415M parameters, and outperforms leading diffusion models (DDT FID 1.26), prior discrete methods (RAR FID 1.48, VAR FID 1.92), while being substantially faster at inference time.

Discrete Tokenizers Beat Continuous Tokenizers

Scaling Codebook Size Enables Discrete to Outperform Continuous

We observe that discrete tokenizers generally exhibit worse reconstruction quality while using substantially fewer bits than continuous tokenizers. This prompts a critical question: Is the perceived inferiority intrinsic to the quantization bottleneck, or merely a consequence of insufficient bit allocation? By systematically scaling the codebook size from 210 to 232 using FSQ quantization, we demonstrate that discrete tokenizers' reconstruction quality improves consistently with increased bit budget. This leads to our core finding: the main performance bottleneck of discrete tokenizers lies in insufficient bit budget, while scaling up codebook size enables discrete tokenization to outperform continuous approaches.

Scaling codebook size enables discrete tokenizers to surpass continuous baselines

Figure: Scaling BAR's discrete tokenizer (BAR-FSQ) with bit budget. Standard discrete methods (green circles) historically lag behind continuous baselines (blue circles) primarily due to restricted bit allocation. By systematically scaling the codebook size, BAR-FSQ (red curve) demonstrates that discrete tokenizer's reconstruction performance is not inherently bounded—it matches and further surpasses continuous reconstruction fidelity with increased bit budget, challenging the assumption that continuous latent spaces are required for high-fidelity reconstruction.

Discrete Autoregressive Models Beat Diffusion

The Vocabulary Scaling Problem

While scaling codebook size resolves the reconstruction bottleneck, it introduces a critical challenge: the vocabulary scaling problem. Standard autoregressive models with linear prediction heads face prohibitive computational costs when vocabularies expand to millions (220) or billions (230) of entries, making training unaffordable. Prior works typically cap codebook sizes at 218 (262K), accepting a ceiling on reconstruction fidelity.

Masked Bit Modeling: Prediction Head as a Bit Generator

To overcome this, we propose a paradigm shift: rather than treating token prediction as a massive classification task, we formulate it as a conditional generation task. Our Masked Bit Modeling (MBM) head generates the target discrete token via an iterative, bit-wise unmasking process conditioned on the autoregressive transformer's output. This design offers two key advantages: (1) Scalability—decomposing tokens into bits bypasses the need for a monolithic softmax, reducing memory complexity from O(C) to O(log₂ C); (2) Robustness— bit-wise masking acts as a strong regularizer, consistently improving generation quality across all codebook scales.

BAR framework overview

Figure: Overview of the proposed BAR framework. We decompose autoregressive visual generation into two stages: (a) context modeling via an autoregressive transformer generating latent conditions through causal attention; (b) a standard linear head predicts logits (effective for small vocabularies but fails to scale); (c) a bit-based head predicts bits directly (scalable but inferior quality); (d) our Masked Bit Modeling (MBM) head generates bits via progressive unmasking, achieving both exceptional scalability and superior generation quality.

Main Results

ImageNet-256Ă—256 Generation Results

Method Type #Params FID↓ IS↑ Prec.↑ Rec.↑
BAR-L (ours) DiscreteSOTA 1.1B 0.99 296.9 0.77 0.69
BAR-B (ours) DiscreteSOTA 415M 1.13 289.0 0.77 0.66
RAE Continuous 839M 1.13 262.6 0.78 0.67
DDT Continuous 675M 1.26 310.6 0.79 0.65
RAR Discrete 1.5B 1.48 326.0 0.80 0.63
MAR Continuous 943M 1.55 303.7 0.81 0.62
VAR Discrete 2.0B 1.92 323.1 0.82 0.59

Table: ImageNet-256Ă—256 generation results with classifier-free guidance, sorted by FID (lower is better). BAR-L achieves state-of-the-art FID of 0.99, while BAR-B matches RAE's performance (FID 1.13) with only 415M parameters (vs. RAE's 839M). Both BAR variants significantly outperform prior discrete methods like RAR (FID 1.48) and VAR (FID 1.92).

ImageNet-512Ă—512 Generation Results

Method Type #Params FID↓ IS↑
BAR-L (ours) DiscreteSOTA 1.1B 1.09 311.1
RAE Continuous 839M 1.13 259.6
DDT Continuous 675M 1.28 305.1
RAR Discrete 1.5B 1.66 295.7
xAR Continuous 608M 1.70 281.5

Table: ImageNet-512Ă—512 generation results with classifier-free guidance, sorted by FID (lower is better). BAR-L achieves state-of-the-art FID of 1.09 with 311.1 IS, surpassing all prior methods including RAE (FID 1.13), DDT (FID 1.28), and RAR (FID 1.66). Note: Due to computational constraints, the 512Ă—512 model was trained for only 200 epochs.

Sampling Efficiency

Method #Params FID↓ Images/sec↑
PAR-4Ă— 3.1B 2.29 4.92
VAR 2.0B 1.92 8.08
MeanFlow 676M 2.20 151.48
BAR-B/4 (ours) 416M 2.34 445.48
BAR-B/2 (ours) 415M 1.35 150.52
MAR 943M 1.55 1.19
VA-VAE 675M 1.35 1.51
DDT 675M 1.26 1.62
xAR 1.1B 1.24 2.03
RAE 839M 1.13 6.62
BAR-B (ours) 415M 1.13 24.33
BAR-L (ours) 1.1B 0.99 10.65

Table: Sampling throughput (including de-tokenization) benchmarked on a single H200 with float32 precision. BAR only uses KV-cache without further optimization. The table is organized into four groups: (1) baseline fast methods, (2) efficient BAR variants optimized for speed, (3) high-quality continuous/diffusion methods, and (4) main BAR models. BAR-L achieves state-of-the-art FID of 0.99 at 10.65 images/sec, while BAR-B/4 generates 445.48 images/sec (2.94Ă— faster than MeanFlow, 374Ă— faster than MAR) with competitive FID of 2.34. BAR-B matches RAE's quality (FID 1.13) while being 3.68Ă— faster.

Generated Samples

Goldfish
Turtle
Lorikeet
Balloon
Husky
Golden Retriever
Fountain
Burger

Figure: Random samples generated by BAR on ImageNet-256. BAR produces high-fidelity, diverse images across various object categories while maintaining fine-grained details and semantic coherence. Click on any image to view it in full resolution.

BibTeX

@article{yu2026bar,
  author    = {Qihang Yu and Qihao Liu and Ju He and Xinyang Zhang and Yang Liu and Liang-Chieh Chen and Xi Chen},
  title     = {Autoregressive Image Generation with Masked Bit Modeling},
  journal   = {arXiv preprint},
  year      = {2026}
}