GT-QLoRA: Safety Constraint Engineering for Trillion-Parameter MoE Models — Zen LM Blog

ZEN4-ULTRA TRAINER ZEN4-ULTRA WEIGHTS ZEN4-ULTRA GGUF

Standard representation engineering for safety constraint removal works on dense models. It fails on Mixture-of-Experts. This post explains why, and how Gate-Targeted QLoRA (GT-QLoRA) — the technique we developed for Zen4 Ultra — addresses the fundamental architectural mismatch.

This is a technical post about a hard problem in AI safety research. We are publishing it because the failure mode of naive approaches is subtle and poorly documented, and other researchers studying model safety constraints need to understand it.

Background: Representation Engineering for Safety Research

Representation engineering (Zou et al., 2023) is a family of techniques for understanding and modifying model behavior by operating directly on the residual stream. In the context of AI safety research, it can be used to study how safety constraints are encoded in model weights — which is essential for understanding whether those constraints are robust or superficial.

The constraint-removal technique works as follows:

Collect a contrast dataset: pairs of (prompt, constrained_completion) and (prompt, unconstrained_completion).
Run both sets through the model and collect residual stream activations at each layer.
Compute the constraint direction in the residual stream: the principal component that separates "constrained" activations from "unconstrained" activations, via mean difference or PCA.
Project out the constraint direction from the relevant weight matrices (typically the output projections of attention layers). For weight matrix W and constraint direction r:

W_research = W - (W r^T r) / (r^T r)

The resulting weights produce a model where the constraint direction no longer activates — enabling safety researchers to study baseline model behavior without the trained overlay.

This technique is computationally cheap (no gradient computation required) and highly effective on dense models. The constraint direction is low-dimensional — typically a single vector captures the dominant behavioral shift.

Why MoE Breaks This

Dense models have a simple architecture: every token, every layer, goes through the same FFN block. The residual stream carries all behavioral state. If you can find and project out the constraint direction in the residual stream, you are done.

MoE models have a different structure. The FFN block is replaced by a router + experts:

The router computes a score for each token against each expert: s_i = softmax(W_gate · h)
The top-k experts are selected based on these scores
The selected experts process the token; the others do not

The critical implication: the routing decision happens before the residual stream accumulates the expert's contribution. Safety constraints in MoE models can be encoded at two levels:

Level 1 (Residual stream): The same mechanism as dense models — certain activation patterns in the residual stream trigger constrained behavior. Projection-based representation engineering handles this.

Level 2 (Routing): The gate weights learn to route certain query patterns to constraint-specialized experts. These experts produce constrained completions. The routing decision itself is the safety mechanism.

In large MoE models trained with RLHF, the constraint behavior is often primarily encoded at Level 2. This is why projection-only techniques produce partial results on MoE: applying direction projection to the residual stream weights does not touch the routing weights, and the model continues routing flagged queries to safety experts.

To verify this, run a diagnostic: extract expert routing patterns at layers 20–40 for various query types. You will observe that certain experts (typically 5–15% of the expert pool) receive dramatically elevated routing probability for constraint-triggering queries. These are the constraint-specialized experts, and the router is the gatekeeper.

GT-QLoRA Design

Gate-Targeted QLoRA (GT-QLoRA) addresses both constraint-encoding levels simultaneously with three sets of trainable parameters:

Target 1: Attention projections (residual stream)

Standard LoRA on q_a_proj, q_b_proj, kv_a_proj_with_mqa, kv_b_proj, o_proj. This handles Level 1 — residual stream constraint patterns. Rank 32 is sufficient; the constraint direction is low-dimensional.

Target 2: Shared expert FFN

Large MoE architectures include "shared experts" — FFN blocks that process every token regardless of routing. These are a second site for Level 1 encoding. We apply LoRA here as well.

Target 3: Gate weights (the critical addition)

The gate weight matrix W_gate for each MoE layer has shape (n_experts, d_model). For Zen4 Ultra's 384 experts and 7168 hidden dim: 384 × 7168 = 2.75M parameters per layer, 61 layers, ~168M total parameters for all gate weights.

Crucially: we do not use LoRA on the gate weights. Gate weights are too small (168M total) and the routing changes we need are too structured for low-rank approximation to capture. We unfreeze the gate weights and apply direct gradient descent.

The training objective is DPO (Direct Preference Optimization) on a contrast dataset:

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

def setup_gt_qlora(model_id: str, lora_rank: int = 32) -> tuple:
    """
    Configure model for GT-QLoRA training.
    Returns (model, gate_params) where gate_params get separate optimizer.
    """
    # Load in INT4 — gate weights will be upcast during forward
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4',
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        quantization_config=bnb_config,
        device_map='auto',
        torch_dtype=torch.bfloat16,
    )

    # Apply LoRA to attention projections and shared expert FFN
    lora_config = LoraConfig(
        r=lora_rank,
        lora_alpha=64,
        target_modules=[
            'q_a_proj', 'q_b_proj',
            'kv_a_proj_with_mqa', 'kv_b_proj',
            'o_proj',
            # Shared expert FFN
            'shared_expert.gate_proj',
            'shared_expert.up_proj',
            'shared_expert.down_proj',
        ],
        bias='none',
        task_type='CAUSAL_LM',
    )
    model = get_peft_model(model, lora_config)

    # Explicitly unfreeze gate weights for direct gradient descent
    gate_params = []
    for name, param in model.named_parameters():
        if 'mlp.gate.weight' in name:
            param.requires_grad = True
            gate_params.append(param)

    return model, gate_params


def gt_qlora_loss(
    model,
    chosen_ids: torch.Tensor,   # unconstrained completion token ids
    rejected_ids: torch.Tensor, # constrained completion token ids
    beta: float = 0.1,
) -> torch.Tensor:
    """DPO loss for GT-QLoRA training."""
    with torch.no_grad():
        ref_chosen_logps = compute_logps(model, chosen_ids, frozen=True)
        ref_rejected_logps = compute_logps(model, rejected_ids, frozen=True)

    policy_chosen_logps = compute_logps(model, chosen_ids, frozen=False)
    policy_rejected_logps = compute_logps(model, rejected_ids, frozen=False)

    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    return loss


def compute_logps(model, input_ids: torch.Tensor, frozen: bool) -> torch.Tensor:
    with torch.set_grad_enabled(not frozen):
        outputs = model(input_ids=input_ids, labels=input_ids)
        return -outputs.loss  # mean log probability

The two optimizer groups run at different learning rates: LoRA adapters at 2e-4, gate weights at 5e-6. Gate weights need a lower learning rate because they directly control routing — large updates cause routing collapse where most tokens go to a single expert.

Why QLoRA at 1 Trillion Parameters

Zen4 Ultra has 1.04T parameters across 384 experts. A single forward pass in bfloat16 requires ~2TB of weight activations (not all in memory simultaneously, but routed through). Full fine-tuning is not feasible even on the largest available GPU clusters.

The quantization strategy:

Routed experts (384 experts): INT4 via NF4 quantization. Loaded in quantized form, dequantized during forward pass as needed.
Gate weights: Full bfloat16. At 168M parameters, gate weights fit comfortably in high-bandwidth GPU memory and must be in full precision for clean gradient signal.
Shared experts: INT4 for storage, bfloat16 for computation via the LoRA path.
LoRA adapters: bfloat16. Small (~400MB for rank-32 adapters).

Minimum hardware: 4× A100 80GB. The base model quantized to INT4 occupies roughly 280GB across 4 GPUs (70GB each). Gate weights and LoRA adapters add ~4GB. Remaining headroom handles activation memory during forward/backward.

The GGUF Alternative

For inference-only use cases, zen4-ultra-gguf provides Q2_K quantized weights (42 split files, ~280GB total). This uses linear direction projection applied during the GGUF conversion process — it handles the Level 1 (residual stream) constraint encoding.

For many workloads this is sufficient. The projection is not complete — routing-level constraints (Level 2) remain — but in practice the safety-expert routing preference is weak enough at Q2_K quantization that behavioral restrictions are significantly reduced.

GT-QLoRA is for producing clean SafeTensors weights where both Level 1 and Level 2 constraint encoding are addressed. The output: full-precision LoRA adapters (~400MB) that you apply to the original SafeTensors base. This is what zen4-ultra (the SafeTensors model) will use once training is complete.

BitDelta vs. GT-QLoRA: Complementary Techniques

These two techniques are sometimes confused because both operate on model deltas, but they serve different purposes:

BitDelta compresses behavioral deltas (personality, task specialization, persona) for efficient multi-variant serving. The delta is small and well-behaved; 1-bit compression retains 99.3% of behavioral accuracy.

GT-QLoRA modifies routing-level behavior (which experts handle which queries). The change is structural, not a weight perturbation. You cannot BitDelta-compress a GT-QLoRA adapter cleanly because the gate weight changes are not a small delta — they are a qualitative change in routing behavior.

In the Zen serving stack: GT-QLoRA produces the safety research weights. BitDelta then compresses behavioral variants (personas, task specializations) on top of that base.

Current Status

The training code is complete and available at github.com/zenlm/zen4-ultra-trainer. Key files:

train_zen4_ultra.py: Full GT-QLoRA training loop with DPO objective
dataset/contrast_dataset.py: Contrast dataset construction utilities
eval/routing_analysis.py: Expert routing attribution for diagnostic use
paper/main.tex: Technical paper with full derivations

We are awaiting compute budget for the full training run (estimated 72–96 GPU-hours on 4× A100 80GB). Until then, zen4-ultra ships as vanilla SafeTensors and zen4-ultra-gguf ships as the GGUF projection variant for users who need reduced constraints today.

When training completes, the LoRA adapters will be pushed to zenlm/zen4-ultra-lora and merged into the final zen4-ultra SafeTensors release. The training process will be documented in the paper.

Zen LM is a joint initiative of Hanzo AI Inc. (Techstars '17) and Zoo Labs Foundation (501c3).