Making models that are accurate, calibrated, and honest by default.
Overview
Alignment research focuses on making models follow instructions reliably, refuse appropriately, and express calibrated uncertainty. We study reinforcement learning from human feedback, direct preference optimization, and evaluation methods for alignment-relevant behaviors.
Research Directions
Direct Preference Optimization
Training models on preference data without a separate reward model, reducing instability and training cost.
Calibrated uncertainty
Models that express confidence proportional to their actual accuracy on factual queries.
Instruction following under distribution shift
Maintaining compliance with complex, multi-constraint instructions on out-of-distribution prompts.
Red teaming automation
Automated adversarial probing of model behavior across policy-relevant categories.
Refusal quality
Distinguishing appropriate refusals from over-refusals that degrade user experience without safety benefit.
Reinforcement Learning from Human Feedback dominated alignment research from 2021 to 2023 but requires a separately trained reward model and is sensitive to reward hacking. Direct Preference Optimization (Rafailov et al., 2023) reformulates the RLHF objective as a supervised learning problem over preference pairs, eliminating the reward model and reducing training complexity significantly. We use DPO as our default alignment method with targeted RLHF on tasks where reward signal is dense and verifiable.
An overconfident model is a liability in any high-stakes application: it states incorrect facts with the same surface confidence as correct ones. We treat calibration as a first-class alignment property and measure Expected Calibration Error (ECE) alongside standard capability benchmarks. Fine-tuning with calibration-aware losses reduces ECE from 0.14 to 0.04 on our factual QA benchmarks without degrading accuracy.
Production models that over-refuse erode user trust and utility. We train refusal classifiers on annotated examples distinguishing genuine policy violations from superficial pattern matches on sensitive-sounding keywords. Our internal benchmark covers 4,200 prompts across 14 sensitive categories. The goal is refusal that is precise, not merely cautious, with false positive rate below 3% on benign requests touching policy-adjacent topics.