← Back to Blog
    July 5, 20265 min readDPO
    Article

    Model Alignment - Gold Labels vs Preference Pairs vs Verifier?

    Your eval signal picks your method: gold labels → SFT, preference pairs → DPO, verifiers → RFT. The math, trade-offs, and failure modes explained.

    N
    Neelam Pawar
    Syntropylabs
    DPOPreference tuningRFTModel alignment
    Model Alignment - Gold Labels vs Preference Pairs vs Verifier?

    Preference Tuning

    It Post training alignment method that teaches a model from relative judgments instead of absolute answers. You don't hand the model a gold label saying "this is the correct response." You hand it a pair — one response marked chosen, one marked rejected — and the model learns to shift probability mass toward the kind of answer people prefer and away from the kind they don't. That makes it the right tool precisely where correctness is subjective: tone, politeness, brand voice, style, safety posture, clarity — cases where many valid outputs exist and "better" can only be defined by comparison

    Screenshot 2026-07-05 at 9.31.48 AM

    When should you use Preference tuning?

    Example applications where DPO is particularly effective in aligning AI responses include:

    • Enhancing Conversational AI Responses
    • Improving Code Generation Quality & Style
    • Ensuring Compliance with Legal, Ethical & Safety Standards
    • Controlling Brand Voice, Professionalism, & Tone
    • Customizing Creative Outputs & User Experience

    How it Works

    • Step 1: Generate Responses

    The process begins with an unaligned or partially trained language model (typically right after Supervised Fine-Tuning). The Action: A prompt is fed into the LLM, and the model generates two distinct candidate answers (Response A and Response B) by varying its internal sampling settings. The Goal: To capture different potential angles, phrasing strategies, or factual structures for the exact same input instruction.

    • Step 2: Get Human (or AI) Feedback

    Once the responses are generated, they are evaluated by an external judge—traditionally a human annotator, though increasingly an advanced LLM (AI Feedback / RLAIF). The Action: The reviewer looks at both completions side-by-side and chooses which one handles the prompt more safely, accurately, or helpfully. The Output: In this specific example, Response A is marked with a red cross (Rejected) because it might contain a flaw, while Response B is marked with a green checkmark (Chosen) as the preferred variant.

    • Step 3: Create a Preference Dataset

    Individual evaluation decisions are aggregated into a highly structured data matrix. The Action: The prompt along with its corresponding evaluated response pair is saved into an offline tabular training file. The Output: Each clean row in this database contains a specific triplet: [Prompt, Chosen Response, Rejected Response]. Thousands or millions of these rows are collected to establish a broad baseline of what high-quality model behavior looks like.

    • Step 4: Update LLM Weights directly(The DPO Algorithm)

    This is the core innovation of DPO. Instead of using reinforcement learning loops or building an intermediate scoring model, DPO processes the offline dataset using standard supervised learning techniques. The Action: The training algorithm feeds both the chosen and rejected text paths back through the LLM simultaneously to examine the model's current token-probability distributions ($P$). The Math Engine: The DPO loss function forces a mathematical adjustment: it actively pushes the probability curve of the Chosen path higher while compressing the probability curve of the Rejected path down.

    The End Result: The internal parameters ($\theta$) of the LLM are updated directly, ensuring that the next time the model encounters a similar prompt, it will natively favour the formatting, tone, and logical path of the preferred answers.

    Screenshot 2026-07-05 at 9.36.18 AM

    Training Job for DPO

    ts
    #OpenAI-style DPO job
    ft = client.fine_tuning.jobs.create(
        model=base_model,
        training_file=train_file_id,
        validation_file=val_file_id,
        method={
            "type": "dpo",
            "dpo": {"hyperparameters": {
                "n_epochs": 2,
                "beta": 0.1,
                "batch_size": 8,
            }},
        },
    )
    Screenshot 2026-07-05 at 9.38.57 AM
    • The one hyperparameter that matters β

    β is the KL knob inherited from equation (1) — it controls how far the tuned model may drift from the reference. In OpenAI's fine-tuning API it ranges 0–2; Google's Gemini preference-tuning jobs recommend 0.01–0.5 (and β = 0 stops learning entirely). The intuition is a spectrum:

    β → 0 · aggressive chases new preferences hard; stylistic shift, collapse risk

    Screenshot 2026-07-05 at 9.40.37 AM

    Issue with DPO

    • Out-of-Distribution (OOD)Vulnerability: DPO is highly sensitive to the distribution shift between the base model's outputs and the preference dataset. It can easily find biased solutions that over-allocate probability to unseen, low-quality, or completely out-of-distribution responses.
    • Failure on Complex Tasks: In challenging domains like code generation (benchmarked on APPS and CodeContests), DPO completely failed to improve upon the base Supervised Fine-Tuning (SFT) model, sometimes even degrading to a 0% pass rate by producing meaningless code snippets.
    Screenshot 2026-07-05 at 9.45.51 AM

    The mental modeL for Preference tuning

    One pipeline, two stages, increasing refinement SFT can only ever push the model toward the examples it is shown. It has no way to express what to move away from, because it never sees a "bad" answer. DPO closes that gap: every training example is a pair, so the model gets a positive signal (the chosen response) and a negative signal (the rejected one) at the same time.

    Screenshot 2026-07-05 at 9.49.52 AM

    Why DPO completes SFT?

    This reduces the magnitude of weight updates during DPO, stabilizing training and preventing overfitting." The combined SFT-then-DPO workflow "converges faster and yields higher-quality results.

    Screenshot 2026-07-05 at 9.52.33 AM

    SFT vs RFT vs Preference Tuning(DPO) - when to pick which ?

    Screenshot 2026-07-05 at 9.54.58 AM
    Screenshot 2026-07-05 at 9.56.33 AM

    A big issue for agent builders:

    Multi-turn breaks vanilla DPO and any other alignment methods. Everything above assumed single-turn: one prompt, one response. Agents — web navigation, tool use, multi-step workflows — violate the derivation in two ways. First, the reward arrives only at the end of a trajectory, so credit assignment across a dozen actions is ambiguous. Second, and more fundamentally, the partition function Z Z that so elegantly canceled in equation (4) only canceled because it depended on a static prompt. In a multi-turn environment it depends on the ever-changing state, it no longer cancels, and standard DPO acquires a length-dependent bias

    Next blog we will discuss about Multi turn Agentic use case Model alignment.

    Related reading

    View all →

    More insights await

    Explore our latest articles on AI evaluation, LLM optimization, and engineering best practices.

    Read more articles →