TL;DR
Re-indexing 10,000 files on every save costs several seconds. With a Merkle tree, only the affected edges are re-resolved. Result: ~15 ms for a modified file, ~3 ms when nothing changed.
Try it directly: documentation · source code.
The previous articles build a complete code graph: symbols extracted by tree-sitter, semantic embeddings, edges resolved by heuristics and LSP. But that graph is recomputed in full on every cartog index.
On a small project (69 files, 4k lines), that takes ~95 ms. Acceptable. On a 10,000-file project it climbs to several seconds, a real drag, especially in watch mode where every save triggers a re-index.
If the developer edits a single file, why re-parse the other 9,999?
To avoid re-parsing everything, you need to know what changed.
But how do you compare two versions of a function when its line number just moved?
To go incremental, you have to identify a symbol in a stable way between two indexing runs. If the symbol changes whenever a line is added above it, any comparison becomes impossible. So you have to keep only what is stable.
Cartog identifies each symbol by file_path:kind:qualified_name: the file path, the symbol kind (class, method, functionâŠ), and its qualified name, that is, prefixed by its container (UserService.authenticate rather than authenticate alone):
src/auth.py:class:UserService
src/auth.py:method:UserService.authenticate
src/auth.py:function:verify_token
This format is:
- Stable: adding a blank line above
verify_tokendoes not change its ID - Deterministic: same code generates the same symbol ID, no random UUID
- Readable: useful for debugging and logs
The only things that invalidate an ID: renaming the symbol, changing its kind, or moving it to another file. These are exactly the cases where re-indexing is necessary.
Before parsing anything, cartog filters files through a cascade, where each level eliminates more than the previous one:
flowchart TD
START["cartog index"] --> FORCE{"--force ?"}
FORCE -->|yes| REINSERT["Re-parse every file<br>â full reinsertion<br>(Merkle hashes recomputed)"]
FORCE -->|no| GIT{"Level 1<br>git diff available?"}
GIT -->|yes| GITLIST["Files flagged by git"]
GIT -->|no| ALL["All files"]
GITLIST --> HASH{"Level 2<br>SHA-256 identical?"}
ALL --> HASH
HASH -->|yes| SKIP["Skip file<br>(touched but unchanged)"]
HASH -->|no| PARSE["Level 3<br>Parse â Merkle diff"]
Level 1 â Git diff: when .git is available and a last_commit is stored, the union of git diff --name-only last_commit..HEAD, git ls-files --others --exclude-standard, and the working tree/cached diffs identifies committed, staged, unstaged, and untracked files. Files outside that set are not even read from disk: this is the filter that contributes the most.
Level 2 â SHA-256: for each file that level 1 let through (or all files, when cold), the contents are read and hashed. If the hash matches the one stored in the database, the file is skipped before parsing. This catches the common case of an editor saving a file with no real change (timestamp modified, contents identical).
Level 3 â Symbol Merkle: only the files that actually changed are parsed, and the Merkle diff (next section) drills down to symbol granularity.
The --force flag skips levels 1 and 2: every file is re-parsed and its symbols re-inserted. The Merkle hashes (level 3) are recomputed and re-stored, but the surgical diff is not applied: it is a full reinsertion. Useful in case of index corruption or after a git rebase that invalidates the stored last_commit.
What about a Merkle at the folder level? If we hashed
src/, we would know nothing inside changed without descending.
Tempting but a trap. On every filesystem cartog targets (APFS, ext4, NTFS, btrfs), a folderâs mtime only changes when the set of its childrenâs names changes: creation, deletion, rename. Modifying a fileâs contents does not propagate up. Relying on the folderâs mtime would silently miss the most common case (e.g. a developer saving a modified file).
Walking the folder and hashing each file remains mandatory: that is exactly what level 2 does.
We know a file changed, but not which symbols inside. If we added a method to a class, there is no point deleting and recreating all of the fileâs symbols, including its imports, its other functions, its constants.
How do you know, without re-parsing everything, that one precise symbol changed when it lives among hundreds of others?
A Merkle tree is a tree where each node stores a hash derived from its childrenâs hashes. A change in any leaf therefore propagates up to the root.
The property that makes this fast: if a nodeâs hash is identical on both sides, its entire subtree is unchanged, and there is no need to look inside. Comparing the two root hashes answers âdid anything move?â in one shot; and where they differ, you descend only into the affected branches. Instead of comparing hundreds of symbols one by one, you follow only the path of hashes that changed. It is the same structure Git uses for its commits, and Bitcoin for its blocks.
Cartog applies this idea to the symbols extracted from a file:
The values below are fake hashes shortened to 4 characters for readability (in reality, SHA-256 over 64 characters). Key point: a parent combines the subtree_hash of each child (not its content_hash), so that a change in a deep leaf propagates all the way to the root. For a leaf with no children, subtree_hash = content_hash.
graph TD
FILE["auth.py<br>file_hash = sha256(9f3a + 7c01)<br>= b8d2"] -->|subtree_hash 9f3a| F
F["class UserService<br>content_hash = c4e7<br>subtree_hash = sha256(c4e7 + 1a2b + 5d6e)<br>= 9f3a"] -->|subtree_hash 1a2b| M1["authenticate()<br>content_hash = 1a2b"]
F -->|subtree_hash 5d6e| M2["refresh_token()<br>content_hash = 5d6e"]
FILE -->|subtree_hash 7c01| V["verify_token()<br>content_hash = 7c01"]
style FILE fill:#e8eaf6,stroke:#3f51b5
style F fill:#e3f2fd,stroke:#1976d2
style M1 fill:#f3e5f5,stroke:#7b1fa2
style M2 fill:#f3e5f5,stroke:#7b1fa2
style V fill:#f3e5f5,stroke:#7b1fa2
So the file_hash combines 9f3a (the subtree_hash of UserService, which already includes its methods) and 7c01 (verify_token, a leaf), and not the classâs content_hash c4e7, which would ignore its methods.
If the body of authenticate() changes, its content_hash moves from 1a2b to something else; the subtree_hash of UserService therefore changes (9f3a â a new value), then the file_hash of auth.py. By contrast, verify_token (7c01) stays intact: its hash did not move, so we do not re-insert it.
Two levels of hash per symbol:
content_hash = SHA-256(kind + name + signature + the symbolâs raw source code)
Captures the exact state of the symbol itself.
subtree_hash = SHA-256(content_hash + sorted children hashes)
Captures the state of the symbol AND of its entire descent (a classâs methods, nested functions).
Sorting the children guarantees a canonical hash: declaration order does not affect the hash if the children are the same.
By comparing the old hashes (in the database) with the new hashes (fresh extraction), cartog classifies each symbol:
flowchart TD
SYM["Extracted symbol"] --> EXISTS{"ID exists<br>in database?"}
EXISTS -->|no| ADDED["Added<br>â INSERT"]
EXISTS -->|yes| CH{"content_hash<br>identical?"}
CH -->|no| MODIFIED["Modified<br>â UPDATE"]
CH -->|yes| SH{"subtree_hash<br>identical?"}
SH -->|yes| UNCHANGED["Unchanged<br>â SKIP"]
SH -->|no| CHILDREN["Children changed<br>â UPDATE subtree_hash"]
OLD["Symbol in database<br>but not extracted"] --> REMOVED["Removed<br>â DELETE"]
style ADDED fill:#c8e6c9,stroke:#388e3c
style MODIFIED fill:#fff9c4,stroke:#f9a825
style REMOVED fill:#ffcdd2,stroke:#d32f2f
style UNCHANGED fill:#e0e0e0,stroke:#757575
style CHILDREN fill:#e1f5fe,stroke:#0288d1
| Category | Condition | Action |
|---|---|---|
| Added | ID absent from the database | INSERT symbol |
| Modified | ID exists, content_hash different | UPDATE symbol |
| Children changed | content_hash identical, subtree_hash different | UPDATE subtree_hash only |
| Unchanged | Both hashes identical | Nothing |
| Removed | ID in the database but not in the extraction | DELETE symbol |
The âchildren changedâ category is the key to granularity: if we add a method to UserService, the class itself (its body, its signature) has not changed, only its subtree_hash is updated. Existing edges pointing to UserService stay valid.
Modifying a symbol in one file can break edges coming from other files. Do we then have to re-index everyone that pointed to it?
Not everyone, only the affected edges. An edge is a relation between two symbols (A calls B, A imports B, A inherits from B); it is what makes the graph a graph. If verify_token is renamed to validate_token, the calls â verify_token edges from other files point into the void.
We have to find them and re-resolve them, but without touching the thousands of other edges that stayed valid.
Resolution happens in two stages:
sequenceDiagram
participant DB as SQLite
participant IDX as Indexer
Note over IDX: Phase 1: Invalidation
IDX->>DB: Find edges whose target no longer exists
DB-->>IDX: edges with dangling target_id
IDX->>DB: SET target_id = NULL
Note over IDX: Phase 2: Re-resolution
IDX->>DB: Resolve edges with target_id = NULL
Note over DB: 6-level heuristic<br>only on the unresolved ones
Note over IDX: Phase 3: Centrality
IDX->>DB: Recompute in_degree<br>for dirty files
Only the affected edges are re-resolved, not the thousands of edges of unchanged files.
Phase 3 recomputes centrality (determined by the number of symbols that point to a given target) only for the dirty files, that is, those touched by the diff, no need to recompute it everywhere.
In cartog watch mode, a file watcher detects changes and triggers an incremental re-index. On every change, the structural index (symbols and edges) is updated: that is all an agent needs to navigate (name search, callees, impact).
flowchart LR
FS["Filesystem events"] -->|debounce 5s| IDX["Incremental index<br>(symbols + edges)"]
IDX --> FS
style IDX fill:#e8f5e9,stroke:#4caf50
The filesystem debounce (5s by default, configurable via --debounce) groups events: saving a file often generates several (write, chmod, rename), and the debounce merges them into a single re-index. The 5-second value absorbs massive bursts (git pull, branch switch) without triggering re-index storms, while staying short enough to feel âliveâ as you type.
The graph stays fresh continuously: the MCP agent running in parallel sees changes in near real time thanks to SQLiteâs WAL (Write-Ahead Logging) mode, which allows concurrent reads during writes.
On the test project (69 Python files, 4k lines):
| Scenario | Time |
|---|---|
| Full index (cold) | ~95ms |
| Incremental index (1 file modified) | ~15ms |
| Incremental index (0 changes) | ~3ms |
The cold figure is published in the project README; the two incremental figures are measured locally via cargo bench -p cartog-indexer.
The â0 changesâ case is the most frequent in watch mode (a debounce on an event that produces no real diff).
The 3ms correspond to the git diff + hash verification: no tree-sitter parsing.
The graph is now precise and kept up to date in seconds or even milliseconds. All of it is stored in a local SQLite file.
The LLM agent that should exploit it does not have direct access to it yet.
How do you expose this graph to Claude Code, Cursor, or another coding agent without rewriting a client per editor?
That will be the topic of the next article in the series: exposing a code graph to an LLM agent via the Model Context Protocol.