Skip to main content
Image de couverture pour Cartog: incremental indexing with a Merkle tree

Cartog: incremental indexing with a Merkle tree

TL;DR
Re-indexing 10,000 files on every save costs several seconds. With a Merkle tree, only the affected edges are re-resolved. Result: ~15 ms for a modified file, ~3 ms when nothing changed.
Try it directly: documentation · source code.

The previous articles build a complete code graph: symbols extracted by tree-sitter, semantic embeddings, edges resolved by heuristics and LSP. But that graph is recomputed in full on every cartog index.

On a small project (69 files, 4k lines), that takes ~95 ms. Acceptable. On a 10,000-file project it climbs to several seconds, a real drag, especially in watch mode where every save triggers a re-index.

If the developer edits a single file, why re-parse the other 9,999?

To avoid re-parsing everything, you need to know what changed.
But how do you compare two versions of a function when its line number just moved?

To go incremental, you have to identify a symbol in a stable way between two indexing runs. If the symbol changes whenever a line is added above it, any comparison becomes impossible. So you have to keep only what is stable.

Cartog identifies each symbol by file_path:kind:qualified_name: the file path, the symbol kind (class, method, function
), and its qualified name, that is, prefixed by its container (UserService.authenticate rather than authenticate alone):

src/auth.py:class:UserService
src/auth.py:method:UserService.authenticate
src/auth.py:function:verify_token

This format is:

  • Stable: adding a blank line above verify_token does not change its ID
  • Deterministic: same code generates the same symbol ID, no random UUID
  • Readable: useful for debugging and logs

The only things that invalidate an ID: renaming the symbol, changing its kind, or moving it to another file. These are exactly the cases where re-indexing is necessary.

Before parsing anything, cartog filters files through a cascade, where each level eliminates more than the previous one:

flowchart TD
    START["cartog index"] --> FORCE{"--force ?"}
    FORCE -->|yes| REINSERT["Re-parse every file<br>→ full reinsertion<br>(Merkle hashes recomputed)"]
    FORCE -->|no| GIT{"Level 1<br>git diff available?"}
    GIT -->|yes| GITLIST["Files flagged by git"]
    GIT -->|no| ALL["All files"]
    GITLIST --> HASH{"Level 2<br>SHA-256 identical?"}
    ALL --> HASH
    HASH -->|yes| SKIP["Skip file<br>(touched but unchanged)"]
    HASH -->|no| PARSE["Level 3<br>Parse → Merkle diff"]

Level 1 — Git diff: when .git is available and a last_commit is stored, the union of git diff --name-only last_commit..HEAD, git ls-files --others --exclude-standard, and the working tree/cached diffs identifies committed, staged, unstaged, and untracked files. Files outside that set are not even read from disk: this is the filter that contributes the most.

Level 2 — SHA-256: for each file that level 1 let through (or all files, when cold), the contents are read and hashed. If the hash matches the one stored in the database, the file is skipped before parsing. This catches the common case of an editor saving a file with no real change (timestamp modified, contents identical).

Level 3 — Symbol Merkle: only the files that actually changed are parsed, and the Merkle diff (next section) drills down to symbol granularity.

The --force flag skips levels 1 and 2: every file is re-parsed and its symbols re-inserted. The Merkle hashes (level 3) are recomputed and re-stored, but the surgical diff is not applied: it is a full reinsertion. Useful in case of index corruption or after a git rebase that invalidates the stored last_commit.

What about a Merkle at the folder level? If we hashed src/, we would know nothing inside changed without descending.

Tempting but a trap. On every filesystem cartog targets (APFS, ext4, NTFS, btrfs), a folder’s mtime only changes when the set of its children’s names changes: creation, deletion, rename. Modifying a file’s contents does not propagate up. Relying on the folder’s mtime would silently miss the most common case (e.g. a developer saving a modified file).

Walking the folder and hashing each file remains mandatory: that is exactly what level 2 does.

We know a file changed, but not which symbols inside. If we added a method to a class, there is no point deleting and recreating all of the file’s symbols, including its imports, its other functions, its constants.

How do you know, without re-parsing everything, that one precise symbol changed when it lives among hundreds of others?

A Merkle tree is a tree where each node stores a hash derived from its children’s hashes. A change in any leaf therefore propagates up to the root.

The property that makes this fast: if a node’s hash is identical on both sides, its entire subtree is unchanged, and there is no need to look inside. Comparing the two root hashes answers “did anything move?” in one shot; and where they differ, you descend only into the affected branches. Instead of comparing hundreds of symbols one by one, you follow only the path of hashes that changed. It is the same structure Git uses for its commits, and Bitcoin for its blocks.

Cartog applies this idea to the symbols extracted from a file:

The values below are fake hashes shortened to 4 characters for readability (in reality, SHA-256 over 64 characters). Key point: a parent combines the subtree_hash of each child (not its content_hash), so that a change in a deep leaf propagates all the way to the root. For a leaf with no children, subtree_hash = content_hash.

graph TD
    FILE["auth.py<br>file_hash = sha256(9f3a + 7c01)<br>= b8d2"] -->|subtree_hash 9f3a| F
    F["class UserService<br>content_hash = c4e7<br>subtree_hash = sha256(c4e7 + 1a2b + 5d6e)<br>= 9f3a"] -->|subtree_hash 1a2b| M1["authenticate()<br>content_hash = 1a2b"]
    F -->|subtree_hash 5d6e| M2["refresh_token()<br>content_hash = 5d6e"]
    FILE -->|subtree_hash 7c01| V["verify_token()<br>content_hash = 7c01"]

    style FILE fill:#e8eaf6,stroke:#3f51b5
    style F fill:#e3f2fd,stroke:#1976d2
    style M1 fill:#f3e5f5,stroke:#7b1fa2
    style M2 fill:#f3e5f5,stroke:#7b1fa2
    style V fill:#f3e5f5,stroke:#7b1fa2

So the file_hash combines 9f3a (the subtree_hash of UserService, which already includes its methods) and 7c01 (verify_token, a leaf), and not the class’s content_hash c4e7, which would ignore its methods.

If the body of authenticate() changes, its content_hash moves from 1a2b to something else; the subtree_hash of UserService therefore changes (9f3a → a new value), then the file_hash of auth.py. By contrast, verify_token (7c01) stays intact: its hash did not move, so we do not re-insert it.

Two levels of hash per symbol:

content_hash = SHA-256(kind + name + signature + the symbol’s raw source code) Captures the exact state of the symbol itself.

subtree_hash = SHA-256(content_hash + sorted children hashes) Captures the state of the symbol AND of its entire descent (a class’s methods, nested functions).

Sorting the children guarantees a canonical hash: declaration order does not affect the hash if the children are the same.

By comparing the old hashes (in the database) with the new hashes (fresh extraction), cartog classifies each symbol:

flowchart TD
    SYM["Extracted symbol"] --> EXISTS{"ID exists<br>in database?"}
    EXISTS -->|no| ADDED["Added<br>→ INSERT"]
    EXISTS -->|yes| CH{"content_hash<br>identical?"}
    CH -->|no| MODIFIED["Modified<br>→ UPDATE"]
    CH -->|yes| SH{"subtree_hash<br>identical?"}
    SH -->|yes| UNCHANGED["Unchanged<br>→ SKIP"]
    SH -->|no| CHILDREN["Children changed<br>→ UPDATE subtree_hash"]

    OLD["Symbol in database<br>but not extracted"] --> REMOVED["Removed<br>→ DELETE"]

    style ADDED fill:#c8e6c9,stroke:#388e3c
    style MODIFIED fill:#fff9c4,stroke:#f9a825
    style REMOVED fill:#ffcdd2,stroke:#d32f2f
    style UNCHANGED fill:#e0e0e0,stroke:#757575
    style CHILDREN fill:#e1f5fe,stroke:#0288d1
CategoryConditionAction
AddedID absent from the databaseINSERT symbol
ModifiedID exists, content_hash differentUPDATE symbol
Children changedcontent_hash identical, subtree_hash differentUPDATE subtree_hash only
UnchangedBoth hashes identicalNothing
RemovedID in the database but not in the extractionDELETE symbol

The “children changed” category is the key to granularity: if we add a method to UserService, the class itself (its body, its signature) has not changed, only its subtree_hash is updated. Existing edges pointing to UserService stay valid.

Modifying a symbol in one file can break edges coming from other files. Do we then have to re-index everyone that pointed to it?

Not everyone, only the affected edges. An edge is a relation between two symbols (A calls B, A imports B, A inherits from B); it is what makes the graph a graph. If verify_token is renamed to validate_token, the calls → verify_token edges from other files point into the void.

We have to find them and re-resolve them, but without touching the thousands of other edges that stayed valid.

Resolution happens in two stages:

sequenceDiagram
    participant DB as SQLite
    participant IDX as Indexer

    Note over IDX: Phase 1: Invalidation
    IDX->>DB: Find edges whose target no longer exists
    DB-->>IDX: edges with dangling target_id
    IDX->>DB: SET target_id = NULL

    Note over IDX: Phase 2: Re-resolution
    IDX->>DB: Resolve edges with target_id = NULL
    Note over DB: 6-level heuristic<br>only on the unresolved ones

    Note over IDX: Phase 3: Centrality
    IDX->>DB: Recompute in_degree<br>for dirty files

Only the affected edges are re-resolved, not the thousands of edges of unchanged files.

Phase 3 recomputes centrality (determined by the number of symbols that point to a given target) only for the dirty files, that is, those touched by the diff, no need to recompute it everywhere.

In cartog watch mode, a file watcher detects changes and triggers an incremental re-index. On every change, the structural index (symbols and edges) is updated: that is all an agent needs to navigate (name search, callees, impact).

flowchart LR
    FS["Filesystem events"] -->|debounce 5s| IDX["Incremental index<br>(symbols + edges)"]
    IDX --> FS

    style IDX fill:#e8f5e9,stroke:#4caf50

The filesystem debounce (5s by default, configurable via --debounce) groups events: saving a file often generates several (write, chmod, rename), and the debounce merges them into a single re-index. The 5-second value absorbs massive bursts (git pull, branch switch) without triggering re-index storms, while staying short enough to feel “live” as you type.

The graph stays fresh continuously: the MCP agent running in parallel sees changes in near real time thanks to SQLite’s WAL (Write-Ahead Logging) mode, which allows concurrent reads during writes.

On the test project (69 Python files, 4k lines):

ScenarioTime
Full index (cold)~95ms
Incremental index (1 file modified)~15ms
Incremental index (0 changes)~3ms

The cold figure is published in the project README; the two incremental figures are measured locally via cargo bench -p cartog-indexer.

The “0 changes” case is the most frequent in watch mode (a debounce on an event that produces no real diff).

The 3ms correspond to the git diff + hash verification: no tree-sitter parsing.

The graph is now precise and kept up to date in seconds or even milliseconds. All of it is stored in a local SQLite file.

The LLM agent that should exploit it does not have direct access to it yet.

How do you expose this graph to Claude Code, Cursor, or another coding agent without rewriting a client per editor?

That will be the topic of the next article in the series: exposing a code graph to an LLM agent via the Model Context Protocol.