Sun, Jun 28, 2026

2 notes this week

Jun 26, 2026
baidu/Unlimited-OCR (paper) parses dozens of pages in one forward pass by fixing the real bottleneck in end-to-end OCR.
- The problem: an OCR decoder types out the page token-by-token, and its KV cache grows unbounded with output length — so memory climbs and speed decays the longer the doc, forcing page-by-page loops that wipe memory each step.
- The trick (Reference Sliding Window Attention): split context into two zones with different retention. Vision/prompt tokens stay fully visible and pinned forever (the "source book"); each token attends to only the last 128 of its own outputs (the "last few words you wrote"). KV cache becomes a fixed-size queue, so memory and TPS stay flat regardless of output length.
- The non-obvious part: discarding history improves accuracy (+6% on OmniDocBench, beating Qwen2.5-VL-72B at 3B/0.5B-active) — full attention can diverge on long dense output, and pinned vision tokens never blur.
- Cheap to build: freeze DeepSeek-OCR's encoder, fine-tune only the decoder with the swapped attention.
- Generalizes past OCR: separate what you reference from what you remember.
- Ran it on an M1 Mac (32GB) with R-SWA intact, no NVIDIA. The official HF model (trust_remote_code) already ships R-SWA as a pure-PyTorch eager ring buffer (SlidingWindowLlamaAttention, picked via attn_implementation="eager") — not buried in a CUDA/FlashAttention kernel. So the "port" was just stripping .cuda()/torch.autocast("cuda") and running bf16 on MPS; zero MPS op-gaps hit. bf16 works on MPS in torch 2.12.
- Proof R-SWA is live: instrument the KV cache during decode — on a 540-token page it plateaus at prefill(277) + window(128) = 405 and never grows (full MHA would hit 817). That constant-cache assertion + Distinct-35≈1.0 are the decisive signals, not edit distance (too noisy vs a single reference).
- Gotcha: feeding all pages to one generate() (infer_multi) loops at the document level — the 128-token decode window forgets the doc was already transcribed while the vision prefix still shows every page, so it restarts from page 1. Fix = one generate() per page (matches official infer.py PAGE_SIZE=1). Accuracy then lands at/below gemini-vs-docai inter-annotator noise (~0.31 mean edit). ~35–55 s/page on M1.
ai ml ocr
Jun 23, 2026
Many asymmetric embedding models need task prefixes on inputs, and skipping them quietly degrades relevance. Each model has its own scheme: nomic (search_query:/search_document:), E5 (query:/passage:), BGE (a query instruction sentence, bare docs) — not interchangeable. OpenAI text-embedding-3-* and all-MiniLM-L6-v2 need none. Whether you add the prefix depends on the serving layer, not the model: raw endpoints (llama.cpp /v1/embeddings, HF TEI, Ollama) send bare text so it's on you, while sentence-transformers (prompt_name="query") and vendor SDKs inject it for you. Adding search_query:/search_document: to a nomic-v1.5 call lifted cosine similarity on a real query/doc pair from 0.54 to 0.60 at zero cost.

rag ai