til

Categorizing TIL Notes with Embeddings

This document explains the approach used in this repository to automatically group til.md notes into meaningful categories.

Goals

Input Format

The pipeline reads til.md and parses entries using the existing date-based bullet structure. Each top-level dated item is treated as one note, and its nested bullets/sub-content are kept as part of that note.

At a high level:

  1. Parse markdown into note records.
  2. Build an embedding for each note.
  3. Cluster similar embeddings.
  4. Label clusters into human-readable categories.
  5. Merge tiny categories into nearby larger ones.
  6. Write categorized output to CATEGORIES.md (and JSON metadata when needed).

Embeddings

Semantic embeddings are generated locally with Ollama using:

Why embeddings:

Operationally, each note is converted to text and embedded into a fixed-size vector (commonly 768 dimensions for this model).

Clustering Strategy

The grouping step uses k-means over note embeddings with cosine-style similarity behavior.

Choosing the number of clusters

Instead of hardcoding one cluster count, the script evaluates a candidate range and selects a good k using silhouette score.

Benefits:

Category Labeling

Clusters are converted into readable category names using a hybrid approach:

  1. TF-IDF keywords to identify salient terms in each cluster.
  2. Rule-based patterns to normalize/override labels for known themes.

This keeps labels stable and understandable while still adapting to new content.

Merging Small Categories

After initial clustering, very small categories (especially singletons) are merged into semantically related larger categories.

Merge behavior combines:

This improves final readability by avoiding noisy one-off buckets.

Output

Primary generated artifacts:

The markdown output is designed for browsing and can be used directly in docs/README workflows.

Implementations in This Repo

Two implementations exist:

JavaScript version

Python version

Both produce equivalent category outcomes when run with similar settings.

Typical Workflow

  1. Add/update notes in til.md.
  2. Run categorization script (categorize.js or categorize.py).
  3. Review CATEGORIES.md.
  4. Optionally refresh README sections to reference/use categorized output.

Tuning and Extensions

As note volume increases, you can improve quality by:

Summary

This approach combines semantic embeddings, unsupervised clustering, and light rule-based post-processing to keep TIL notes organized automatically while preserving high-quality, readable output.