#ai

19 notes

Jul 28, 2026
Python's Ellipsis literal (...) is a real object (type(...) is <class 'ellipsis'>), not just a placeholder. Beyond type stubs and numpy slicing, NVIDIA's NOOA framework repurposes it as a runtime dispatch marker: methods with ... bodies are completed by an LLM at runtime, while methods with real bodies stay deterministic Python. The boundary between "code I control" and "code the model controls" becomes greppable: grep -n '\.\.\.' agent.py. Clever reuse of existing syntax that linters and type checkers already accept.

python
Jul 24, 2026
Embedding pipelines do not feed whole documents straight into the model. They first use the embedding model's tokenizer to measure and split text into token-bounded, overlapping chunks, then embed each chunk and store the vectors. Tokenizer must match the model. Tools like marcelroed/gigatoken accelerate this preparation at corpus scale but do not create embeddings themselves.

rag performance
Jun 26, 2026
baidu/Unlimited-OCR (paper) parses dozens of pages in one forward pass by fixing the real bottleneck in end-to-end OCR.
- The problem: an OCR decoder types out the page token-by-token, and its KV cache grows unbounded with output length — so memory climbs and speed decays the longer the doc, forcing page-by-page loops that wipe memory each step.
- The trick (Reference Sliding Window Attention): split context into two zones with different retention. Vision/prompt tokens stay fully visible and pinned forever (the "source book"); each token attends to only the last 128 of its own outputs (the "last few words you wrote"). KV cache becomes a fixed-size queue, so memory and TPS stay flat regardless of output length.
- The non-obvious part: discarding history improves accuracy (+6% on OmniDocBench, beating Qwen2.5-VL-72B at 3B/0.5B-active) — full attention can diverge on long dense output, and pinned vision tokens never blur.
- Cheap to build: freeze DeepSeek-OCR's encoder, fine-tune only the decoder with the swapped attention.
- Generalizes past OCR: separate what you reference from what you remember.
- Ran it on an M1 Mac (32GB) with R-SWA intact, no NVIDIA. The official HF model (trust_remote_code) already ships R-SWA as a pure-PyTorch eager ring buffer (SlidingWindowLlamaAttention, picked via attn_implementation="eager") — not buried in a CUDA/FlashAttention kernel. So the "port" was just stripping .cuda()/torch.autocast("cuda") and running bf16 on MPS; zero MPS op-gaps hit. bf16 works on MPS in torch 2.12.
- Proof R-SWA is live: instrument the KV cache during decode — on a 540-token page it plateaus at prefill(277) + window(128) = 405 and never grows (full MHA would hit 817). That constant-cache assertion + Distinct-35≈1.0 are the decisive signals, not edit distance (too noisy vs a single reference).
- Gotcha: feeding all pages to one generate() (infer_multi) loops at the document level — the 128-token decode window forgets the doc was already transcribed while the vision prefix still shows every page, so it restarts from page 1. Fix = one generate() per page (matches official infer.py PAGE_SIZE=1). Accuracy then lands at/below gemini-vs-docai inter-annotator noise (~0.31 mean edit). ~35–55 s/page on M1.
ml ocr
Jun 23, 2026
Many asymmetric embedding models need task prefixes on inputs, and skipping them quietly degrades relevance. Each model has its own scheme: nomic (search_query:/search_document:), E5 (query:/passage:), BGE (a query instruction sentence, bare docs) — not interchangeable. OpenAI text-embedding-3-* and all-MiniLM-L6-v2 need none. Whether you add the prefix depends on the serving layer, not the model: raw endpoints (llama.cpp /v1/embeddings, HF TEI, Ollama) send bare text so it's on you, while sentence-transformers (prompt_name="query") and vendor SDKs inject it for you. Adding search_query:/search_document: to a nomic-v1.5 call lifted cosine similarity on a real query/doc pair from 0.54 to 0.60 at zero cost.

rag
Jun 15, 2026
Moshi is the best SSH client I've found for iPhone — connects to your laptop's tmux sessions over Mosh, making it rock-solid on mobile networks with roaming support. Perfect for an agent view on the go; I use it to monitor and interact with running pi/Claude sessions from my iPhone without losing the session on network switches.

macos tools
Jun 8, 2026
faberic is a crowd-sourced collection of prompts bundeled with a cli to use them via piped interfaces.

tools open-source
Jun 8, 2026
[karpathy/autoresearch] is a concept/framework that allows one to leverage LLM to run an optimization loops based on a criteria. davebcn87/pi-autoresearch takes this further has built a generic optimization long run loop method on top of pi agent. gemini

performance
May 29, 2026
Dynamic workflows in Claude Code. Claude dynamically writes orchestration scripts that spawn tens to hundreds of parallel subagents. It plans, decomposes into subtasks, fans out agents, and checks results before returning a coordinated answer. Adversarial agents try to refute findings, iterating until convergence. Enable via ultracode setting or ask Claude to "create a workflow". Notable: Jarred Sumner used it to port Bun from Zig to Rust (~750K lines, 11 days). Available on Max, Team, Enterprise plans. Consumes significantly more tokens than typical sessions.

devops
May 9, 2026
Use HTML instead of markdown to effectively plan and review. Claude Code: The Unreasonable Effectiveness of HTML" / X

markdown
May 9, 2026
colbymchenry/codegraph: Pre-indexed code knowledge graph for Claude Code — fewer tokens, fewer tool calls, 100% local

tools
Apr 22, 2026
A good collection of practices on automated AI code reviews by Ankit Jain
- The Scalability Crisis: Manual post-PR review is no longer viable. AI agents have nearly doubled code output, causing human review time to spike by 91%, creating a bottleneck that traditional workflows cannot solve.
- The Upstream Pivot: Human value must shift from reviewing implementation to defining intent. Instead of checking syntax, humans spend their energy writing rigorous specs and acceptance criteria before the code is written, which the machine then uses to self-verify.
- The Swiss-Cheese Defense: Rather than one "perfect" human gate, the model uses a stack of imperfect automated layers. By layering signals like agent competition, deterministic guardrails, and adversarial "red-team" agents, the system catches errors where their individual failure modes don't overlap.
architecture
Mar 24, 2026
pageindex generates a semantic tree-like json index of a lengthy document to allow for reasoning based RAG without the need for vectordb.

rag databases
Feb 15, 2026
For generating embedding locally, nomic-embed-text is a large context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks. It has a balance of speed, 8k context, and accuracy for English-centric apps. BGE-M3, Qwen3-Embedding and E5-Small are other alternatives.

rag
Feb 15, 2026
yichuan-w/LEANN is a RAG focused framework focused on efficient storage with built-in chunking strategies embedding model management and MCP server. gemini

rag
Feb 15, 2026
K-dense known for using skills to enable deep research has published 140+ skills related to scientic research including literature review, data analysis, etc.
Feb 6, 2026
Opus 4.6 launch.
- context compaction (beta) and 1M context window, enables longer agentic tasks without loosing context.
- they claim it has found 500 Zero-Day Flaws in open-source projects (yet to see the proofs though)
- agent teams, multiple agent coordinates with a leader agent. https://code.claude.com/docs/en/agent-teams
open-source security
Jan 31, 2026
Notes on "How AI assistance impacts the formation of coding skills" Article HN
- AI speeds up coding but reduces deep understanding and mastery
- Juniors (1-3 years experience) showed speed improvements with AI, but 4+ year developers showed no difference
- Modern software work is more about requirements, specs, documentation, and communication than raw coding skill
- Small sample size (n<8) and study design limitations make results questionable
- Takeaways:
  1. Use AI for high-scoring interaction patterns: Ask conceptual questions and request explanations rather than just code generation
  2. Adopt AI for documentation and specs: Multiple developers report dramatic improvements in tickets, PRs, and documentation quality
  3. Be deliberate about learning: If using AI, actively practice explaining concepts and avoid pure copy-paste workflows
  4. Use AI to reduce grunt work: Let it handle boilerplate, test writing, and repetitive tasks while focusing on architecture and requirements
- The research confirms what many suspected: AI coding assistants create a real trade-off between speed and skill development, but the practical significance is hotly contested. The critical question isn't whether AI reduces learning (it does), but whether deep coding skill remains as valuable as expressing requirements clearly—and whether we're comfortable with a generation of developers who can't function without AI assistance.
architecture
Dec 24, 2025
"LangGraph is an orchestration framework for building stateful multi-agent applications using LLMs. It provides low-level primitives such as nodes and edges, along with built-in features that give developers granular control over agent workflows, memory management and state persistence. This means developers can start with a simple pre-built graph and scale to complex, evolving agent architectures. With support for streaming, advanced context management and resilience patterns like model fallbacks and tool error handling, LangGraph enables you to build robust, production-grade agentic applications. Its graph-based approach ensures predictable, customizable workflows and simplifies debugging and scaling."

python
Dec 22, 2025
Notes from Thoughtworks - Technology Radar vol 33
- text-to-sql solutions aren't working as expected
- pnpm, langGraph, and pydantic recommended for adoption
architecture databases

← All tags