#ml
1 note
- Jun 26, 2026
baidu/Unlimited-OCR (paper) parses dozens of pages in one forward pass by fixing the real bottleneck in end-to-end OCR.
- The problem: an OCR decoder types out the page token-by-token, and its KV cache grows unbounded with output length — so memory climbs and speed decays the longer the doc, forcing page-by-page loops that wipe memory each step.
- The trick (Reference Sliding Window Attention): split context into two zones with different retention. Vision/prompt tokens stay fully visible and pinned forever (the "source book"); each token attends to only the last 128 of its own outputs (the "last few words you wrote"). KV cache becomes a fixed-size queue, so memory and TPS stay flat regardless of output length.
- The non-obvious part: discarding history improves accuracy (+6% on OmniDocBench, beating Qwen2.5-VL-72B at 3B/0.5B-active) — full attention can diverge on long dense output, and pinned vision tokens never blur.
- Cheap to build: freeze DeepSeek-OCR's encoder, fine-tune only the decoder with the swapped attention.
- Generalizes past OCR: separate what you reference from what you remember.