Quick Start
Fit a Probe
from transformers import AutoModelForCausalLM, AutoTokenizer
from hprobes import HProbes
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-3-4b-it",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")
# samples: list of dicts with question, options (dict or list), and answer
probe = HProbes(model, tokenizer, l1_C=0.01)
probe.fit(samples, options_key="choices", answer_key="answer")
print(f"H-Neurons found: {probe.n_neurons_}")
print(f"Layer distribution: {probe.layer_distribution_}")
Score and Validate
results = probe.score()
print(f"AUROC: {results['auroc']:.3f}")
print(f"Accuracy: {results['balanced_accuracy']:.3f}")
print(f"Threshold (Youden's J): {results['threshold']:.3f}")
# Causal validation: scale H-Neuron activations and measure accuracy change
cv = probe.causal_validate()
# {0.0: 0.15, 0.5: 0.22, 1.0: 0.30, 1.5: 0.28, 2.0: 0.25}
Production Hallucination Detection
No ground truth required — one or two forward passes per sample.
# Single sample
risk = probe.detect("Which organ is most affected? A) Heart B) Lung C) Liver D) Kidney\n\nAnswer:")
print(f"Risk: {risk:.3f}") # 0 = likely correct, 1 = likely hallucinating
# With a pre-computed answer letter (saves one forward pass)
risk = probe.detect(prompt, answer_letter="C")
# Batched (GPU-efficient)
scores = probe.detect_batch(prompts, batch_size=8)
Save and Load
probe.save("results/gemma_medqa") # writes .json + .pkl
probe = HProbes.load("results/gemma_medqa", model, tokenizer)
# Score on a new dataset with the loaded probe
probe.score_on(new_samples, options_key="choices", answer_key="answer")
CLI
# Fit and score on an MCQ dataset
hprobes run --model google/gemma-3-4b-it --data dataset.jsonl --samples 500
# Transfer a saved probe to a different model
hprobes transfer --probe results/probe --model google/gemma-3-4b --data dataset.jsonl
# Fit from pre-generated responses with judge labels
hprobes responses --model google/gemma-3-4b-it --data responses.jsonl
Output Files
Each probe.save() or CLI run produces two files:
| File | Contents |
|---|---|
<name>.json |
Human-readable results: H-Neurons, AUROC, accuracy, causal validation |
<name>.pkl |
Serialized classifier, normalization stats, neuron list |