Sun, Jun 28, 2026
2 notes this week
- Jun 26, 2026
baidu/Unlimited-OCR (paper) parses dozens of pages in one forward pass by fixing the real bottleneck in end-to-end OCR.
- The problem: an OCR decoder types out the page token-by-token, and its KV cache grows unbounded with output length — so memory climbs and speed decays the longer the doc, forcing page-by-page loops that wipe memory each step.
- The trick (Reference Sliding Window Attention): split context into two zones with different retention. Vision/prompt tokens stay fully visible and pinned forever (the "source book"); each token attends to only the last 128 of its own outputs (the "last few words you wrote"). KV cache becomes a fixed-size queue, so memory and TPS stay flat regardless of output length.
- The non-obvious part: discarding history improves accuracy (+6% on OmniDocBench, beating Qwen2.5-VL-72B at 3B/0.5B-active) — full attention can diverge on long dense output, and pinned vision tokens never blur.
- Cheap to build: freeze DeepSeek-OCR's encoder, fine-tune only the decoder with the swapped attention.
- Generalizes past OCR: separate what you reference from what you remember.
- Jun 23, 2026
Many asymmetric embedding models need task prefixes on inputs, and skipping them quietly degrades relevance. Each model has its own scheme: nomic (
search_query:/search_document:), E5 (query:/passage:), BGE (a query instruction sentence, bare docs) — not interchangeable. OpenAItext-embedding-3-*andall-MiniLM-L6-v2need none. Whether you add the prefix depends on the serving layer, not the model: raw endpoints (llama.cpp/v1/embeddings, HF TEI, Ollama) send bare text so it's on you, while sentence-transformers (prompt_name="query") and vendor SDKs inject it for you. Addingsearch_query:/search_document:to a nomic-v1.5 call lifted cosine similarity on a real query/doc pair from 0.54 to 0.60 at zero cost.