Model alignment

In AI, alignment is the process of steering a model's behavior so that its outputs match human values, intents, and safety standards. Think of a raw AI model as a brilliant but library assistant that has memorized the entire internet. Left to itself, it will simply guess the next word in a sentence, regardless of whether that word is helpful, rude, or dangerous. Alignment is the training that turns that assistant into a polite, helpful, and safe professional. You might be wondering how you can do model alignment to understand your company voice. Below are some way we you do LLM alignment for your company

Deep Dive into Alignment ways

1)SFT (Supervised Fine-Tuning )

Supervised Fine-Tuning (SFT) is a post-training technique used when you have a well-defined task and a high-quality, labeled dataset.

How it works: It adapts model behavior by adjusting the model's internal weights to minimize the difference between what the model predicts and the actual correct labels in your dataset.
What it achieves: It emphasizes knowledge already present inside the model, refines response structure or tone, and optimizes costs or latency (by saving tokens you would otherwise spend on long prompts). It acts as a way to turn a raw, chaotic model into a specialized, disciplined assistant.

Use Cases for SFT

a)Improve Accuracy and Performance: Improving the model's overall task success rate.
b) Optimize Output Structure or Formatting: Ensuring the AI consistently formats its output (e.g., specific JSON schemas, style guides) without needing long, complex prompts.
c) Increase Domain-Specific Understanding: Helping the model grasp unique, industry-specific language and terminology.
d) Reduce Cost and Latency: Shortening prompt templates (which saves tokens) and allowing organizations to achieve high-quality results using smaller, faster models.
e)Improve Factuality and Reduce Hallucinations: Helping models to produce factual answers aligned with specific guidelines.

Limitation of SFT

2) RL( Reinforcement learning)

It is a technique used to align models with human values, enforce safety guidelines, and teach them how to solve complex reasoning problems.

Various Ways to Implement RL

a)RFT (Reinforcement Fine-Tuning)

Reinforcement learning is a powerful technique to adapt large language models (LLMs) to maximize feedback rewards provided by a user-defined reward function( it can code based and LLM based reward model).The model explores many possible answers, observes a numeric reward for each, and gradually shifts its behavior so that high-reward answers become more likely and low-reward answers become less likely.

Unlike Supervised Fine-Tuning (SFT), which teaches a model by forcing it to imitate static, pre-written examples, RFT allows the model to dynamically explore multiple response paths, evaluate its own outcomes against a reward signal, and update its internal weights to favor high-scoring behaviors.

When to use RFT

Complex instruction following: When the model needs to follow intricate instructions where the "correct" output isn't a single answer but requires a set of checks. Reinforcement learning can help the model learn to navigate these complex instruction sets.
Creative content generation: For tasks like generating creative writing, poetry, or highly specialized code, where objective metrics are difficult to define. Reinforcement learning, with a well-designed reward function (p), can guide the model toward more desirable and innovative outputs.
Complex Reasoning tasks: For tasks that require heavy reasoning, such as solving word puzzles ,Function calling or mathematical problems, reinforcement learning can incentivize the model to explore different "thought" sequences to learn which patterns of reasoning are most effective at solving problems.

Algorithm used in RFT

1) GRPO( Google cloud Gemini RFT support this )

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm popularized by DeepSeek (specifically in models like DeepSeekMath and DeepSeek-R1). It was designed as a more efficient, less resource-heavy alternative to traditional PPO (Proximal Policy Optimization) for aligning Large Language Models.

How GRPO works?

Step 1: Generate a Group of Responses

Instead of generating a single response or a pair of responses . GRPO samples a group of $N$ outputs (typically $N = 4$ to $8$) from the same initial prompt using the current LLM policy.

Example: For the prompt "What is $2 + 3 \times 5$?", the model might generate four different variations (Responses A, B, C, and D), ranging from incorrect guesses to structured, step-by-step reasoning chains.

Step 2: Assign Raw Scores (The Reward Function)

Each of the $N$ responses is fed into a Reward Function to receive a raw score. This reward function is a programmatic check or a separate reward model. Verifiable Tasks (Math/Code): The reward function can be a simple script. If Response B gives the correct answer (17) and follows formatting rules, it gets a high score (e.g., 0.7). If Response C hallucinates an incorrect answer (25), it gets a penalty (e.g., 0.1).

Step 3: Compute Relative Rewards

GRPO normalizes the scores within the group.It calculates the mean and standard deviation of the scores only for that specific group of responses, and then computes a relative advantage for each response. The Math: If a response scores significantly better than the group average, it gets a highly positive advantage. If it scores worse than the group average, it gets a negative advantage.

4 Update the LLM Weights

Finally, the algorithm updates the LLM's weights using these relative advantage scores: Reinforce Success: The tokens and reasoning steps that led to above-average responses (Responses A, B, and D in the diagram) are penalized or incentivized so the model repeats them in the future. Suppress Failure: The paths that led to below-average responses (Response C) are suppressed. To prevent the model from drifting too far from its original capabilities during training, a KL-Divergence penalty is also applied, ensuring it updates its weights safely without degrading its language coherence.

2) PPO (Proximal Policy Optimization)

Proximal Policy Optimization (PPO) is a Reinforcement Learning from Human Feedback (RLHF) algorithm used to align Large Language Models with human preferences, safety guidelines, and reasoning capabilities.

How PPO Works (Step-by-Step)

The architecture of PPO relies on balancing two distinct networks to update the LLM safely:

Generation and Value Prediction

When a prompt is fed into the system, two models process it simultaneously: The Actor (The LLM): This is the actual language model you are trying to train. It generates a response (e.g., Response 1, 2, 3, 4 in the diagram). The Critic (The Value Model): This is a separate network trained to look at the prompt and predict the "expected token-level value" (V(t)). It essentially guesses how difficult the prompt is and what kind of reward the model should expect to get.

Scoring and Referencing

The generated responses are then evaluated across two fronts: The Reward Model: A separate classification model that assigns a hard quality/safety score (R) to the text. The Reference Model: A frozen, unaligned copy of the original base LLM. PPO calculates the KL-Divergence between the Actor's tokens and the Reference model's tokens. If the Actor starts altering its language style too drastically to chase high rewards, it is heavily penalized.

Calculating the Advantage

PPO evaluates the quality of the response by computing the Advantage: Advantage=Actual Reward(R)−Critic’s Predicted Value(V) If the reward is higher than what the Critic predicted, the Advantage is positive, and the model is incentivized to repeat those tokens. If the reward is worse than predicted, the Advantage is negative, and the behavior is suppressed.

Dual Network Weight Updates

Finally, PPO performs two distinct updates: Actor Update (Policy Loss): The LLM updates its weights to maximize positive advantages. To ensure training doesn't break, a Clip Ratio limits how much the model's token probabilities can change in a single training step. Critic Update (Value Loss): The Critic updates its weights via Mean Squared Error (MSE) to make more accurate future value predictions based on the real rewards received.

Configuring an RFT Job

Hyperparmeters:

samplesPerPrompt: 16: Generates 16 different candidate answers for every single prompt. The algorithm compares them against each other to learn what makes a good vs. bad response.
thinkingLevel: "HIGH": Forces the model to allocate a large compute budget to output an internal Chain-of-Thought (reasoning path) before giving its final answer
epochCount: 15: The number of times the training process loops through your entire dataset.batchSize: 32: The model updates its weights after evaluating 32 prompts at a time (which means processing $32 \times 16 = 512$ total generations per batch).
learningRateMultiplier: 1.0: Uses the default step-size for adjusting model weights based on rewards.adapterSize:
"ADAPTER_SIZE_SIXTEEN": Sets the rank ($r=16$) for the LoRA adapter. A larger rank allows the model to learn more complex behaviors but uses more memory.
maxOutputTokens: 32768: The maximum token length allowed per response. A high limit is required so the model doesn't get cut off while
"thinking."evaluateInterval: 5: Evaluates model performance against your validation dataset every 5 epochs to check for
overfitting.checkpointInterval: 5: Saves a permanent backup snapshot of the model weights every 5 epochs.

The Benefits of GRPO over PPO

While PPO has historically been the gold standard for LLM alignment, Group Relative Policy Optimization (GRPO) introduces structural optimizations that address PPO's biggest flaws.

Decision Tree

Which method to Pick up

How to Align Model behaviour in way Human likes?