04Applied Research

Production MLOps

Operational practices for AI systems that stay reliable over time.

CI/CDMonitoringModel RegistryDrift Detection

Overview

The operational side of AI at scale: model versioning, deployment automation, drift detection, and lifecycle management. We build the tooling and practices that let teams confidently ship AI systems and keep them running.

Research Directions

Continuous training pipelines

Automated retraining triggered by distribution drift signals, not calendar schedules.

Statistical drift detection

Population stability index and embedding drift metrics for early warning on input distribution shifts.

Experiment tracking at scale

Versioned experiment graphs linking datasets, hyperparameters, and evaluation results.

Model cards automation

Auto-generated model documentation from training metadata and eval outputs.

Canary deployments for LLMs

Shadow scoring and progressive traffic shifting with automatic rollback on quality regression.

Why AI deployment is different

Traditional software deployments fail deterministically: a bug either crashes the process or it does not. AI systems fail gradually and silently. A model can degrade in quality over weeks as its input distribution drifts from the training distribution, with no error logs and no alert. Detecting and responding to this class of failure requires a different operational mindset.

Drift detection in practice

We monitor three layers: input drift (are incoming queries changing?), output drift (are model responses changing?), and outcome drift (are downstream task success rates changing?). Each layer requires different instrumentation. Input and output drift are detectable with embedding distance metrics. Outcome drift requires labeled feedback loops, which we integrate via implicit signals (user corrections, thumbs-down events).

Population Stability Index

PSI measures how much the distribution of a feature (e.g., query embedding PCA projection) has shifted relative to a reference window. PSI below 0.1 is stable; above 0.25 triggers a retraining alert.

python

def psi(expected: np.ndarray, actual: np.ndarray, buckets: int = 10) -> float:
    """Population Stability Index. PSI > 0.25 indicates significant drift."""
    expected_pct = np.histogram(expected, bins=buckets)[0] / len(expected)
    actual_pct   = np.histogram(actual,   bins=buckets)[0] / len(actual)
    # avoid log(0)
    expected_pct = np.clip(expected_pct, 1e-6, None)
    actual_pct   = np.clip(actual_pct,   1e-6, None)
    return float(np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct)))

Canary deployments for language models

We route 5% of production traffic to the candidate model and compare quality scores from a frozen judge model against the current production baseline. If the candidate scores within a 2% margin for 30 minutes, traffic shifts to 25%, then 100%. Any regression triggers automatic rollback within 90 seconds. This approach has shipped 23 consecutive model updates without a user-visible quality incident.

Model registry and lineage

Every production model artifact is registered with a full lineage record: source dataset hashes, training run ID, hyperparameters, evaluation results, and the deployment history. This makes incident investigation deterministic: given a timestamp, we can reconstruct exactly which model version was serving, what it was trained on, and what its pre-deployment eval scores were.

PreviousRAG and Knowledge Systems Next Video Generation Models