This document explains the approach used in this repository to automatically group til.md notes into meaningful categories.
The pipeline reads til.md and parses entries using the existing date-based bullet structure. Each top-level dated item is treated as one note, and its nested bullets/sub-content are kept as part of that note.
At a high level:
CATEGORIES.md (and JSON metadata when needed).Semantic embeddings are generated locally with Ollama using:
nomic-embed-textWhy embeddings:
Operationally, each note is converted to text and embedded into a fixed-size vector (commonly 768 dimensions for this model).
The grouping step uses k-means over note embeddings with cosine-style similarity behavior.
Instead of hardcoding one cluster count, the script evaluates a candidate range and selects a good k using silhouette score.
Benefits:
k.Clusters are converted into readable category names using a hybrid approach:
This keeps labels stable and understandable while still adapting to new content.
After initial clustering, very small categories (especially singletons) are merged into semantically related larger categories.
Merge behavior combines:
This improves final readability by avoiding noisy one-off buckets.
Primary generated artifacts:
CATEGORIES.md: human-readable grouped notes with full note content (including sub-bullets/links).categories.json: machine-friendly category metadata.The markdown output is designed for browsing and can be used directly in docs/README workflows.
Two implementations exist:
categorize.jscategorize.pynumpy, scikit-learn, httpx).Both produce equivalent category outcomes when run with similar settings.
til.md.categorize.js or categorize.py).CATEGORIES.md.As note volume increases, you can improve quality by:
k ranges,This approach combines semantic embeddings, unsupervised clustering, and light rule-based post-processing to keep TIL notes organized automatically while preserving high-quality, readable output.