A fine-tuned classifier that scores machine identity credentials by risk state. Two models trained on the same task — DistilBERT for production triage, Qwen 2.5 1.5B + LoRA for cachet. Side-by-side comparison with a real eval harness.
A real ML portfolio piece — synthetic dataset, two fine-tuned models, eval harness with per-class metrics and a cost/latency comparison. Classifies a machine identity credential into one of four risk states (active_legitimate · orphaned · over_scoped · suspicious). The classifier becomes the brain of the AI Agent Governance Console's Triage agent.
Machine Learning Engineer · Applied ML / NLP Engineer · AI Platform Engineer · AI Security Engineer
Most NHI tooling is rules-based or LLM-prompted. Neither scales — rules miss the long tail, LLM API calls cost too much at enterprise volume. A small fine-tuned classifier sits in the right place: cheap enough to run on every credential, accurate enough to triage.
Each class is defined by probabilistic signals in the synthetic dataset — not deterministic rules — so the model has to generalize patterns instead of memorizing signatures.
Both models see the same input (a serialized credential record), output the same four-way probability distribution, and are scored on the same held-out test set.
| DistilBERT [ MEASURED ] | Qwen 2.5 1.5B + LoRA [ PROJECTED ] | |
|---|---|---|
| Parameters | 67M (classification head) | 1.5B base + ~8M LoRA adapter |
| Train time | ~6 min · MacBook CPU | ~hours · single T4 GPU |
| Test accuracy | 99.33% | — |
| F1 (macro) | 0.9933 | — |
| Inference latency (CPU) | 10.8 ms p50 · 11.2 ms p95 | ~60 ms / record |
| Deployment size on disk | 256 MB | ~30 MB (adapter only) |
| Cost per 1M predictions | $0.15 | $~0.83 |
| Best use | Production triage queue | Hard cases + rationale |
DistilBERT numbers are measured on a MacBook CPU (450-record test set). Qwen 2.5 numbers are projections from comparable LoRA fine-tunes — the training script ships and runs against any LoRA-compatible base. The point of training both isn’t to declare a winner; it’s to make the production-vs-research trade-off visible.
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| active_legitimate | 0.9818 | 0.9908 | 0.9863 | 109 |
| orphaned | 1.0000 | 1.0000 | 1.0000 | 103 |
| over_scoped | 0.9909 | 0.9909 | 0.9909 | 110 |
| suspicious | 1.0000 | 0.9922 | 0.9961 | 128 |
Three misses out of 450. The matrix below shows where they went.
Two of the three errors cross the active_legitimate ↔ over_scoped boundary — the same boundary a human reviewer would hesitate at. One suspicious credential was miscalled active_legitimate — the highest-cost error type, and the one to monitor in production. Nothing was ever miscalled suspicious or orphaned; the model errs toward the less alarming label, not the more.
Accuracy alone hides per-class collapse and the cost of being wrong on the rare classes. The harness reports the things a deployment review would actually ask about.
Four make targets get you from a fresh clone to a trained DistilBERT model with a comparison report. The Qwen LoRA fine-tune needs a GPU.
The whole thing is <1,500 lines of Python. The training scripts are intentionally vanilla Hugging Face Trainer — no clever wrappers, no proprietary framework. Anyone reading the code can verify what's happening.