Machine Learning · Fine-Tuning · Hugging Face

NHI Credential
Risk Classifier

A fine-tuned classifier that scores machine identity credentials by risk state. Two models trained on the same task — DistilBERT for production triage, Qwen 2.5 1.5B + LoRA for cachet. Side-by-side comparison with a real eval harness.

Two fine-tuned models 3,000-record synthetic dataset 4-class classification Open source
At a glance 30-second read

What it is

A real ML portfolio piece — synthetic dataset, two fine-tuned models, eval harness with per-class metrics and a cost/latency comparison. Classifies a machine identity credential into one of four risk states (active_legitimate · orphaned · over_scoped · suspicious). The classifier becomes the brain of the AI Agent Governance Console's Triage agent.

Maps to

Machine Learning Engineer · Applied ML / NLP Engineer · AI Platform Engineer · AI Security Engineer

Skills demonstrated

  • Synthetic dataset design with controlled, learnable patterns
  • Transformer fine-tuning (Hugging Face Trainer)
  • LoRA / PEFT parameter-efficient fine-tuning
  • Eval methodology: per-class P/R/F1, confusion matrix, latency, cost
  • Production-vs-research model trade-off analysis
  • FastAPI inference server, CLI predictor
Hugging Face Transformers PEFT / LoRA PyTorch scikit-learn FastAPI Python
What it is
Stage 1 Dataset 3,000 synthetic records
Stage 2 Fine-tune DistilBERT + Qwen LoRA
Stage 3 Evaluate P/R/F1, latency, cost
Stage 4 Serve FastAPI · CLI

Most NHI tooling is rules-based or LLM-prompted. Neither scales — rules miss the long tail, LLM API calls cost too much at enterprise volume. A small fine-tuned classifier sits in the right place: cheap enough to run on every credential, accurate enough to triage.

01 — The Task four risk classes

Four classes. One label per credential.

Each class is defined by probabilistic signals in the synthetic dataset — not deterministic rules — so the model has to generalize patterns instead of memorizing signatures.

Active Legitimate
active_legitimate
Owned, scoped to actual use, rotated within policy, recent activity. The healthy default state. Most credentials should land here.
Pattern · known owner · recent use · scope matches frequency
Orphaned
orphaned
No owner, no recent use, no documentation — often all three. Created for a project that ended; nobody knows it exists.
Pattern · null owner · unused 120+ days · no docs · old or missing rotation
Over-Scoped
over_scoped
Elevated or admin scope on a workload that looks read-only. Common after a one-time elevation that was never rolled back.
Pattern · scope > observed need · light/infrequent use
Suspicious
suspicious
Anomalous pattern — unusual geo, dormant-then-active, elevated+undocumented, fresh+unowned. Surface to a human now.
Pattern · geo anomaly · revival · doc gap · ownership gap
02 — Two Models production vs. cachet

Two fine-tunes on the same task.

Both models see the same input (a serialized credential record), output the same four-way probability distribution, and are scored on the same held-out test set.

DistilBERT [ MEASURED ] Qwen 2.5 1.5B + LoRA [ PROJECTED ]
Parameters67M (classification head)1.5B base + ~8M LoRA adapter
Train time~6 min · MacBook CPU~hours · single T4 GPU
Test accuracy99.33%
F1 (macro)0.9933
Inference latency (CPU)10.8 ms p50 · 11.2 ms p95~60 ms / record
Deployment size on disk256 MB~30 MB (adapter only)
Cost per 1M predictions$0.15$~0.83
Best useProduction triage queueHard cases + rationale

DistilBERT numbers are measured on a MacBook CPU (450-record test set). Qwen 2.5 numbers are projections from comparable LoRA fine-tunes — the training script ships and runs against any LoRA-compatible base. The point of training both isn’t to declare a winner; it’s to make the production-vs-research trade-off visible.

DistilBERT — per-class breakdown

Class Precision Recall F1 Support
active_legitimate0.98180.99080.9863109
orphaned1.00001.00001.0000103
over_scoped0.99090.99090.9909110
suspicious1.00000.99220.9961128

Three misses out of 450. The matrix below shows where they went.

Confusion matrix — 450 held-out records

↓ actual predicted →
active_legit.
orphaned
over_scoped
suspicious
active_legit.
108
0
1
0
orphaned
0
103
0
0
over_scoped
1
0
109
0
suspicious
1
0
0
127
correct (447) misclassification (3) zero

Two of the three errors cross the active_legitimate ↔ over_scoped boundary — the same boundary a human reviewer would hesitate at. One suspicious credential was miscalled active_legitimate — the highest-cost error type, and the one to monitor in production. Nothing was ever miscalled suspicious or orphaned; the model errs toward the less alarming label, not the more.

03 — Eval Methodology scikit-learn + custom latency harness

What the harness measures.

Accuracy alone hides per-class collapse and the cost of being wrong on the rare classes. The harness reports the things a deployment review would actually ask about.

Per-class P/R/F1
Precision, recall, F1, support per class. Macro F1 is the headline metric — it weights all four classes equally so a model that nails the easy ones and tanks suspicious can't hide.
Confusion matrix
Where each class gets misclassified. Suspicious → orphaned is a different failure than suspicious → active_legitimate. The matrix lets you see which mistakes are expensive.
Latency + cost
p50 and p95 single-record CPU latency, model size on disk, projected dollar cost per million predictions. The numbers that decide whether you can run this on every credential.
04 — Try it clone & train

Reproducible end-to-end.

Four make targets get you from a fresh clone to a trained DistilBERT model with a comparison report. The Qwen LoRA fine-tune needs a GPU.

nhi-classifier · setup [ COPY/PASTE ]
$ git clone https://github.com/iambrucedavis/nhi-classifier
$ cd nhi-classifier && python -m venv .venv && source .venv/bin/activate
$ make install
$ make data  # generate 3,000 records
$ make train-distilbert  # ~5 min CPU
$ make eval  # writes results/comparison.md
$ make demo  # FastAPI at localhost:8000

The whole thing is <1,500 lines of Python. The training scripts are intentionally vanilla Hugging Face Trainer — no clever wrappers, no proprietary framework. Anyone reading the code can verify what's happening.

05 — Honest Limits what this is not

A portfolio piece, not a product.

Synthetic data
The dataset is structurally realistic but generated. A production version needs real credential telemetry — CloudTrail, Entra ID sign-in logs, secrets-manager access patterns — labeled by an identity team.
Four classes is a start
Real NHI taxonomies have more: compromised, deprecated, third-party, break-glass. The training pipeline scales by adding rows to the schema; the model architecture doesn't change.
LoRA is the demo, not the deploy
The LoRA fine-tune proves the technique. For real-world classification, DistilBERT wins on cost and latency. The LLM has a role for hard cases plus rationale, not bulk triage.