Machine Learning · Fine-Tuning · Hugging Face

NHI Credential
Risk Classifier

A fine-tuned classifier that scores machine identity credentials by risk state. Two models trained on the same task — DistilBERT for production triage, Qwen 2.5 1.5B + LoRA for cachet. Side-by-side comparison with a real eval harness.

Two fine-tuned models 3,000-record synthetic dataset 4-class classification Open source

At a glance 30-second read

What it is

A real ML portfolio piece — synthetic dataset, two fine-tuned models, eval harness with per-class metrics and a cost/latency comparison. Classifies a machine identity credential into one of four risk states (active_legitimate · orphaned · over_scoped · suspicious). The classifier becomes the brain of the AI Agent Governance Console's Triage agent.

Maps to

Machine Learning Engineer · Applied ML / NLP Engineer · AI Platform Engineer · AI Security Engineer

Skills demonstrated

Synthetic dataset design with controlled, learnable patterns
Transformer fine-tuning (Hugging Face Trainer)
LoRA / PEFT parameter-efficient fine-tuning
Eval methodology: per-class P/R/F1, confusion matrix, latency, cost
Production-vs-research model trade-off analysis
FastAPI inference server, CLI predictor

Hugging Face Transformers PEFT / LoRA PyTorch scikit-learn FastAPI Python

What it is

Stage 1 Dataset 3,000 synthetic records

→

Stage 2 Fine-tune DistilBERT + Qwen LoRA

→

Stage 3 Evaluate P/R/F1, latency, cost

→

Stage 4 Serve FastAPI · CLI

Most NHI tooling is rules-based or LLM-prompted. Neither scales — rules miss the long tail, LLM API calls cost too much at enterprise volume. A small fine-tuned classifier sits in the right place: cheap enough to run on every credential, accurate enough to triage.

01 — The Task four risk classes

Four classes. One label per credential.

Each class is defined by probabilistic signals in the synthetic dataset — not deterministic rules — so the model has to generalize patterns instead of memorizing signatures.

Active Legitimate

active_legitimate

Owned, scoped to actual use, rotated within policy, recent activity. The healthy default state. Most credentials should land here.

Pattern · known owner · recent use · scope matches frequency

Orphaned

orphaned

No owner, no recent use, no documentation — often all three. Created for a project that ended; nobody knows it exists.

Pattern · null owner · unused 120+ days · no docs · old or missing rotation

Over-Scoped

over_scoped

Elevated or admin scope on a workload that looks read-only. Common after a one-time elevation that was never rolled back.

Pattern · scope > observed need · light/infrequent use

Suspicious

suspicious

Anomalous pattern — unusual geo, dormant-then-active, elevated+undocumented, fresh+unowned. Surface to a human now.

Pattern · geo anomaly · revival · doc gap · ownership gap

02 — Two Models production vs. cachet

Two fine-tunes on the same task.

Both models see the same input (a serialized credential record), output the same four-way probability distribution, and are scored on the same held-out test set.

	DistilBERT [ MEASURED ]	Qwen 2.5 1.5B + LoRA [ PROJECTED ]
Parameters	67M (classification head)	1.5B base + ~8M LoRA adapter
Train time	~6 min · MacBook CPU	~hours · single T4 GPU
Test accuracy	99.33%	—
F1 (macro)	0.9933	—
Inference latency (CPU)	10.8 ms p50 · 11.2 ms p95	~60 ms / record
Deployment size on disk	256 MB	~30 MB (adapter only)
Cost per 1M predictions	$0.15	$~0.83
Best use	Production triage queue	Hard cases + rationale

DistilBERT numbers are measured on a MacBook CPU (450-record test set). Qwen 2.5 numbers are projections from comparable LoRA fine-tunes — the training script ships and runs against any LoRA-compatible base. The point of training both isn’t to declare a winner; it’s to make the production-vs-research trade-off visible.

DistilBERT — per-class breakdown

Class	Precision	Recall	F1	Support
active_legitimate	0.9818	0.9908	0.9863	109
orphaned	1.0000	1.0000	1.0000	103
over_scoped	0.9909	0.9909	0.9909	110
suspicious	1.0000	0.9922	0.9961	128

Three misses out of 450. The matrix below shows where they went.

Confusion matrix — 450 held-out records

↓ actual predicted →

active_legit.

orphaned

over_scoped

suspicious

active_legit.

108

orphaned

103

over_scoped

109

suspicious

127

correct (447) misclassification (3) zero

Two of the three errors cross the active_legitimate ↔ over_scoped boundary — the same boundary a human reviewer would hesitate at. One suspicious credential was miscalled active_legitimate — the highest-cost error type, and the one to monitor in production. Nothing was ever miscalled suspicious or orphaned; the model errs toward the less alarming label, not the more.

03 — Eval Methodology scikit-learn + custom latency harness

What the harness measures.

Accuracy alone hides per-class collapse and the cost of being wrong on the rare classes. The harness reports the things a deployment review would actually ask about.

Per-class P/R/F1

Precision, recall, F1, support per class. Macro F1 is the headline metric — it weights all four classes equally so a model that nails the easy ones and tanks suspicious can't hide.

Confusion matrix

Where each class gets misclassified. Suspicious → orphaned is a different failure than suspicious → active_legitimate. The matrix lets you see which mistakes are expensive.

Latency + cost

p50 and p95 single-record CPU latency, model size on disk, projected dollar cost per million predictions. The numbers that decide whether you can run this on every credential.

04 — Try it clone & train

Reproducible end-to-end.

Four make targets get you from a fresh clone to a trained DistilBERT model with a comparison report. The Qwen LoRA fine-tune needs a GPU.

nhi-classifier · setup [ COPY/PASTE ]

$ git clone https://github.com/iambrucedavis/nhi-classifier

$ cd nhi-classifier && python -m venv .venv && source .venv/bin/activate

$ make install

$ make data # generate 3,000 records

$ make train-distilbert # ~5 min CPU

$ make eval # writes results/comparison.md

$ make demo # FastAPI at localhost:8000

The whole thing is <1,500 lines of Python. The training scripts are intentionally vanilla Hugging Face Trainer — no clever wrappers, no proprietary framework. Anyone reading the code can verify what's happening.

05 — Honest Limits what this is not

A portfolio piece, not a product.

Synthetic data

The dataset is structurally realistic but generated. A production version needs real credential telemetry — CloudTrail, Entra ID sign-in logs, secrets-manager access patterns — labeled by an identity team.

Four classes is a start

Real NHI taxonomies have more: compromised, deprecated, third-party, break-glass. The training pipeline scales by adding rows to the schema; the model architecture doesn't change.

LoRA is the demo, not the deploy

The LoRA fine-tune proves the technique. For real-world classification, DistilBERT wins on cost and latency. The LLM has a role for hard cases plus rationale, not bulk triage.

NHI CredentialRisk Classifier

What it is

Maps to

Skills demonstrated

Four classes. One label per credential.

Two fine-tunes on the same task.

DistilBERT — per-class breakdown

Confusion matrix — 450 held-out records

What the harness measures.

Reproducible end-to-end.

A portfolio piece, not a product.

NHI Credential
Risk Classifier