AI Engineer

Document Pipeline

A document pipeline built as a set of Claude tools — reads a doc, masks PII, validates the extraction with structured JSON outputs, escalates anything unusual to a human reviewer, tracks token cost. Built for regulated environments: PII masking, HITL review, and validation as the default, not the add-on.

5 stage tools MCP server ajv validation Python · Node
The Brief

Document processing pipelines look simple on a slide. Read it. Check it. Hide the personal info. Send anything wrong to a human. Ship. In practice they're a black box with Claude in the middle, doing everything from text extraction (OCR) to business rules to deciding what counts as personal info. Hard to debug, hard to know the cost, hard to swap one piece out.

This pipeline pulls those concerns apart. Five custom Claude tools, one per stage. Claude is the conductor, not the pipeline itself. Each tool owns its piece of the work — doc_extract returns a schema and the rules to follow, pii_mask hides personal info by field path and by pattern matching, doc_validate runs schema + business-rule checks and produces a list of exceptions, hitl_route turns those exceptions into a prioritized queue, cost_report totals the tokens.

The interesting design choice: doc_extract doesn't call Claude. It returns a work order — task, schema, rules — for Claude to carry out. No Anthropic API key needed for the demo. Production swaps the work order for a real API call without changing any other stage. The pattern survives the boundary.

01 — The pipeline 5 stages · 3 patterns · ~1240 lines of source code

Five tools, one conductor.

Each stage is a separate custom Claude tool, swappable on its own. The conductor is Claude. Three of the five tools change something (extract, mask, route); one is a check (validate); one totals things up (cost). The pipeline is composition.

stage 1 doc_extract mutation · work-order Returns the task, schema, and 13 rules. Claude does the extraction against the document text.
stage 2 pii_mask mutation · two-layer Two-layer hide: known fields by path, plus regex matches anywhere. Free-text fields get sent back to Claude for a second look.
stage 3 doc_validate critique · redaction-aware Schema check (with ajv) + 12 business rules. Skips format checks on fields where the data is hidden.
stage 4 hitl_route mutation · triage Sends exceptions to a queue. Auto-priorities based on the rule type. Each entry comes with a suggested fix.
stage 5 cost_report aggregation Per-stage tokens × rates. Totals, plus what this would cost per 1,000 and per 100,000 documents.
Design rationale Separation by stage is separation by concern. Each tool's input is the previous tool's output — no shared state, no surprises. The pipeline becomes inspectable: drop a tool's trace into Claude, swap one piece for another, test different prompts side by side. The work-order pattern keeps the demo runnable on a Claude Max subscription; production with direct API access plugs in at doc_extract without touching the four downstream tools.
02 — A real run 4 pre-baked traces · click a tab

Pick a doc. Watch it move.

Four runs: three clean documents, plus one claim with deliberately broken amounts and a missing required field. The traces below are real outputs from the pipeline runner — same JSON the MCP tools produce when run end-to-end.

03 — Inside the work-order pattern no API key required · production-portable

Tools return contracts, not responses.

The interesting MCP design choice. doc_extract doesn't call Claude — it's a tool that returns a structured work order for Claude to carry out. Production with direct API access plugs in here without touching any downstream stage.

tools/call · doc_extract [ MUTATION · WORK ORDER ]
request
{
  "doc_text": "NORTHSTAR MUTUAL — AUTO CLAIM SUBMISSION\nClaim ID: NSM-2026-0451829\n...",
  "doc_type": "claim"
}

response
{
  "task": "Extract structured data from this claim document. Conform exactly to the schema. Return JSON only.",
  "doc_type": "claim",
  "schema": { /* full JSON Schema for AutoClaim — types, requireds, patterns, formats */ },
  "extraction_rules": [
    "Output JSON only, conforming exactly to the provided schema. No commentary, no markdown fences.",
    "Use ISO 8601 for all dates (YYYY-MM-DD) and date-times (YYYY-MM-DDTHH:mm:ssZ or with offset).",
    "Currency values: strip $ and commas, return as numbers.",
    "Missing fields → null. Do not invent values.",
    "policyholder.address.state should be the two-letter USPS code.",
    "vehicle.year is an integer, not a string.",
    /* 7 more rules */
  ],
  "output_format": "Plain JSON, valid against the schema. No prose, no fences, no comments.",
  "on_completion": "Pass the extracted JSON to pii_mask with doc_type='claim'.",
  "estimate": { "stage": "extract", "input_tokens": 1955 }
}
Why the tool doesn't call Claude In MCP, Claude is the agent calling the tools. Asking the server to call Claude AGAIN to do the extraction is double work for no gain — and locks every demo to an API key. Returning a work order keeps Claude as the executor. The server's job is to encode the contract: which schema, which rules, what to do with the result. Claude reads the contract, generates the JSON, hands off to the next stage. Production swaps the work order for an Anthropic API call at this single seam; the downstream tools don't notice.
In review
Craft decisions
  • Five tools, one per stage. doc_extract / pii_mask / doc_validate / hitl_route / cost_report. Each owns one contract; the pipeline is composition.
  • The work-order pattern. doc_extract returns a schema + rules + task instead of calling Claude. Claude executes. No API key for the demo; the same plug-in point takes a real API call in production.
  • PII masking is two-layer: deterministic regex + path-based redaction first, then a list of free-text fields flagged for Claude to scan as a second pass. The tool draws the line between what can be caught with rules and what can't.
  • doc_validate is redaction-aware. Format and pattern errors on values that look like [TYPE-REDACTED] are filtered out automatically. Rules that need the original PII (the age check against a redacted DOB, for example) skip with a note that they should run before masking in production.
  • Business rules live alongside the schema (a businessRules array in the JSON Schema file). One file per doc type. The functions that evaluate them live in the validator file. Adding a rule is two changes: schema entry plus evaluator function.
  • The pipeline runner (scripts/run-pipeline.mjs) composes all five tools end-to-end and emits a single trace per fixture. Those traces — checked into the repo — are the source-of-truth for the playback on this page.
  • Exceptions get auto-prioritized in hitl_route based on rule type. Missing required fields are high; business rule failures with monetary impact are high; other schema issues are medium; everything else is low.
What I'd do differently
  • PDF / OCR is not in the loop. Fixtures are plain text. A real version takes a PDF, reads the text from it (OCR), then enters at doc_extract. The OCR tool is the obvious next addition — and a clean MCP shape: take a PDF, return the text plus a confidence score per page.
  • The human queue is in-memory only. Production would save each entry to a ticket system (Notion, Linear, Jira) and route by team. The shape — queue_id, priority, suggested_resolution — is portable. The persistence layer isn't.
  • The second-pass review for pii_mask is documented but not auto-run. The tool returns a list of free-text fields to scan; Claude can scan those. A future tool — pii_review — would close that loop.
  • Token cost is estimated via chars / 4. Real API runs return measured token counts; cost_report already accepts a method: 'measured' flag and reports mode accordingly. Demo runs in estimated; production flips it.
  • The way schema validation handles redacted values is to filter errors after the fact. A cleaner version would emit a second schema per doc type — same shape, but with the PII fields allowed to contain placeholder strings. The current approach is shorter; the cleaner version is more honest.
  • Auto-numbering, edge-case test fixtures, prompt-iteration logging. A second pass would build a proper evaluation system (an "evals harness") around doc_extract — the same kind of system Project 010 (voice_check v2) is going to need.
PALETTE