CRAFT: Contrastive Reasoning Alignment from Hidden Representations for Jailbreak Defense
CRAFT proposes a red-teaming alignment framework that strengthens reasoning model safety at the hidden representation level. Combining contrastive learning with RL to separate safe and unsafe reasoning trajectories in latent space. Theoretically proves consistency constraints eliminate superficial alignment. Achieves 79.0% reasoning safety and 87.7% final-response safety improvements.
CRAFT: Defending Against Jailbreaks at the Hidden Representation Level — A Deep Technical Analysis of Contrastive Reasoning Alignment
The Core Problem: Superficial Safety Alignment (SSA)
The CRAFT paper from Northwestern University and the University of Michigan addresses a critical weakness in Large Reasoning Models (LRMs): **Superficial Safety Alignment (SSA)**.
What is SSA? Even after LRMs undergo RLHF or DPO safety alignment, the final response may be a safe refusal, but the reasoning trace still generates harmful content. For example, when asked to create a dangerous substance, an aligned model might detail manufacturing steps in its reasoning chain before finally saying "I can't help with that." The harmful information has already leaked through the reasoning trace.
CRAFT Methodology
CRAFT (Contrastive Reasoning Alignment from Hidden Representations) introduces a core innovation: **instead of defending at the output level, reshape the safety reasoning geometry in hidden representation space.**
#### 1. Contrastive Representation Learning
First, construct three categories of reasoning trace hidden representations:
- **Safe trajectories**: Face harmful requests with direct refusal and no harmful content in reasoning
- **Unsafe trajectories**: Generate harmful content during reasoning (even if ultimately refusing)
- **Rethink trajectories**: Transitional states between the two
PCA projection visualization reveals that these three categories form separable geometric structures in hidden representation space — observed in both DeepSeek-R1-Distill-Llama-8B and Qwen3-4B-Thinking, indicating model-agnostic latent structure.
#### 2. Contrastive Loss Design
A contrastive learning objective separates safe and unsafe trajectories in hidden space:
L_contrastive = -log(exp(sim(z_safe, z_anchor)/tau) / sum(exp(sim(z_neg, z_anchor)/tau)))
Where z_safe represents safe trajectory hidden states and z_neg represents unsafe ones. This objective pushes safe and unsafe reasoning into different regions of hidden space.
#### 3. Consistency-Aware GRPO
CRAFT modifies Group Relative Policy Optimization (GRPO) with a latent-textual consistency reward:
R_consistency = R_safety(text) * (1 + alpha * sim(h_reasoning, h_safe_prototype))
This reward ensures:
- Not only is the final text output safe
- The reasoning process hidden representations must also fall within the safe region
#### 4. Theoretical Guarantee
CRAFT's key theoretical contribution proves that: **incorporating latent-textual consistency constraints into GRPO eliminates superficially aligned policies as local optima.**
Intuitive understanding: Traditional GRPO rewards based on text output only, so models can learn to generate harmful content in reasoning while producing safe final responses — a local optimum. Adding consistency constraints gives such strategies low scores in hidden space, pushing them out of local optima regions.
Experimental Results
Evaluated on Qwen3-4B-Thinking and R1-Distill-Llama-8B:
Reasoning-level safety:
- Average 79.0% improvement over base models
- Outperforms IPO and SafeKey
Final-response safety:
- Average 87.7% improvement over base models
- Consistently outperforms comparison methods across all safety benchmarks
Reasoning capability preservation:
- Average 4.7% improvement over base models (improved, not degraded!)
- Demonstrates that safety alignment and reasoning capability are not zero-sum
Technical Complementarity with PreSafe (Today's Tech t6)
PreSafe makes safety decisions before CoT (prevention), CRAFT corrects reasoning trajectories during CoT (treatment). The ideal approach may combine both: PreSafe as the first line of defense, CRAFT as deep assurance.
Engineering Practice Recommendations
1. **Anti-jailbreak cannot only examine outputs**: Must check whether reasoning traces leak harmful information (the SSA problem)
2. **Hidden representation space is actionable**: Contrastive learning + RL can effectively reshape safety geometry
3. **Consistency constraints are critical**: Without latent-textual consistency constraints, safety alignment easily remains superficial
4. **Reasoning capability is preserved**: CRAFT proves better safety doesn't require sacrificing reasoning performance
Limitations
- Requires paired safe/unsafe trajectory data for contrastive training
- Contrastive learning hyperparameters (temperature tau, alpha) require careful tuning
- Currently validated only on 4B and 8B models; larger model effectiveness TBD
- Robustness against adaptive attacks needs further investigation