08Active Research

Model Alignment

Making models that are accurate, calibrated, and honest by default.

RLHFDPOCalibrationSafety

Overview

Alignment research focuses on making models follow instructions reliably, refuse appropriately, and express calibrated uncertainty. We study reinforcement learning from human feedback, direct preference optimization, and evaluation methods for alignment-relevant behaviors.

Research Directions

Direct Preference Optimization

Training models on preference data without a separate reward model, reducing instability and training cost.

Calibrated uncertainty

Models that express confidence proportional to their actual accuracy on factual queries.

Instruction following under distribution shift

Maintaining compliance with complex, multi-constraint instructions on out-of-distribution prompts.

Red teaming automation

Automated adversarial probing of model behavior across policy-relevant categories.

Refusal quality

Distinguishing appropriate refusals from over-refusals that degrade user experience without safety benefit.

The RLHF to DPO transition

Reinforcement Learning from Human Feedback dominated alignment research from 2021 to 2023 but requires a separately trained reward model and is sensitive to reward hacking. Direct Preference Optimization (Rafailov et al., 2023) reformulates the RLHF objective as a supervised learning problem over preference pairs, eliminating the reward model and reducing training complexity significantly. We use DPO as our default alignment method with targeted RLHF on tasks where reward signal is dense and verifiable.

Calibration as a safety property

An overconfident model is a liability in any high-stakes application: it states incorrect facts with the same surface confidence as correct ones. We treat calibration as a first-class alignment property and measure Expected Calibration Error (ECE) alongside standard capability benchmarks. Fine-tuning with calibration-aware losses reduces ECE from 0.14 to 0.04 on our factual QA benchmarks without degrading accuracy.

Refusal quality and the over-refusal problem

Production models that over-refuse erode user trust and utility. We train refusal classifiers on annotated examples distinguishing genuine policy violations from superficial pattern matches on sensitive-sounding keywords. Our internal benchmark covers 4,200 prompts across 14 sensitive categories. The goal is refusal that is precise, not merely cautious, with false positive rate below 3% on benign requests touching policy-adjacent topics.

PreviousMultimodal Reasoning

The RLHF to DPO transition

Calibration as a safety property

Refusal quality and the over-refusal problem