LLM Safety Techniques: Constitutional AI, Harmony, SAIF, and Llama Guard Compared
TL;DR
LLM safety is no longer one technique — it’s a stack. The four major labs each defend a different layer: OpenAI’s Harmony Protocol structures the model’s output channels so reasoning, tool calls, and final answers can be inspected separately. Anthropic’s Constitutional AI trains harmlessness into the model itself with RLAIF and a written constitution. Google’s Frontier Safety Framework wraps deployment in capability evals tied to Critical Capability Levels (CCLs). Meta’s Llama Guard sits as an external moderation classifier in front of (and behind) the LLM. Open-source projects like OpenRLHF replicate these techniques on open weights. Production systems use several of these layers at once — they are complementary, not alternatives.
Source: This article distills the living survey at github.com/sarkar-dipankar/llm-safety-protocol. The repo is updated as new techniques emerge; this post is the long-form companion.
Why LLM safety is a multi-vendor problem
There is no single “safety algorithm” for large language models. Each lab has converged on a different abstraction layer because each is solving a different concrete failure mode — jailbreaks, harmful content, misuse of tool calls, capability uplift for dangerous tasks. The result is a layered defense-in-depth picture that practitioners have to read across vendors. The sections below summarize each lab’s primary contribution and where it fits in the stack.
What is the OpenAI Harmony Protocol?
The Harmony Protocol is a structured response format OpenAI introduced for the gpt-oss open-weight models. Instead of producing one undifferentiated stream of tokens, the model emits a multi-channel response separating:
- Analysis / chain-of-thought — the model’s private reasoning
- Commentary — tool call planning and orchestration
- Final — the user-visible answer
The safety value is mechanical: by isolating channels, downstream systems can audit reasoning, gate tool calls, and decide what the user actually sees — without re-prompting the model. Harmony also standardizes how messages are encoded for chain-of-thought inspection and tool calling, which makes it easier to build deterministic policy layers around the model. See OpenAI’s Harmony Response Format cookbook entry and the openai-harmony PyPI package for the implementation.
Harmony is best understood as a contract for inspectability, not a behavioral safety technique. It is the foundation that lets the other layers do their job.
What is Constitutional AI?
Anthropic’s Constitutional AI (CAI) is a two-phase training process that produces models that are simultaneously more helpful and more harmless without relying solely on human labelling of harmful outputs.
- Supervised phase: the model is asked to critique and revise its own responses against a written set of constitutional principles (Claude’s constitution is published here).
- RLAIF phase (Reinforcement Learning from AI Feedback): instead of human raters comparing two responses, an AI judge model — itself prompted with constitutional principles — provides the preference signal for RL.
The original Anthropic paper, Constitutional AI: Harmlessness from AI Feedback, is the canonical reference. Anthropic later layered Constitutional Classifiers on top — externally-trained input/output classifiers that defend against universal jailbreaks the model itself might miss.
CAI works at a different level than Harmony: it changes what the model wants to do. Harmony changes what the model exposes.
For more on how the constitution is loaded into the model at inference time as a system prompt, see Prompt Structuring Techniques.
What is Google’s Frontier Safety Framework?
Google’s safety story has two pillars:
- SAIF (Secure AI Framework) — a generic security framework for AI systems covering supply chain, model access, and infrastructure controls. See Google’s SAIF page.
- Frontier Safety Framework (FSF) — a deployment policy that defines Critical Capability Levels (CCLs) for emerging dangerous capabilities (cyber-offense, autonomy, biology). Each CCL has a corresponding mitigation tier the model must satisfy before deployment. The DeepMind announcement is here.
Where Anthropic bakes safety into training, Google’s framework is deployment-gating: capabilities are evaluated, and deployment is conditional on mitigations being in place. This is the layer that answers “should we ship this model?” rather than “how should this model behave?”.
What are Meta’s safety approaches?
Meta’s safety strategy for the Llama family is layered:
- Llama Guard — an open-source content-moderation classifier (also a Llama model) that sits in front of and behind the generator. It classifies prompts and responses against a configurable taxonomy of harms. The TDS guide on integrating Llama Guard with LlamaIndex walks through a practical RAG deployment.
- PurpleLlama — a broader umbrella of safety tooling including red-team prompts and cybersecurity evals. See the Dispatch report on meta-llama/PurpleLlama.
- Extensive red-teaming — Meta runs structured adversarial testing before each Llama release.
Llama Guard is the most reused piece outside Meta because it is modular and open: any deployment — including non-Llama models — can plug it in as an external classifier. It is the “external moderator” pattern, distinct from CAI’s “internalized values” pattern.
Open-source AI safety implementations
The open-source ecosystem has reproduced each of the closed-lab techniques on top of open weights:
- Constitutional AI with Open LLMs (Hugging Face) — applies the CAI training loop to open models.
- OpenRLHF (paper: arXiv:2405.11143) — a scalable RLHF/RLAIF training framework.
- awesome-ai-security — a curated map of the broader AI security tooling landscape.
For most teams without a frontier-lab budget, the practical stack is: a Llama Guard-style external classifier + a constitutional system prompt + RLHF/RLAIF on a base open model. That combination reproduces ~80% of what the closed labs ship.
Comparison: vendor × method × layer
| Vendor | Primary technique | Where it sits in the stack | Open weights / code? |
|---|---|---|---|
| OpenAI | Harmony Protocol | Output formatting / inspection | Yes (gpt-oss + spec) |
| Anthropic | Constitutional AI + Classifiers | Training + input/output gating | Method public; weights closed |
| SAIF + Frontier Safety Framework | Deployment gating / capability eval | Frameworks public; weights closed | |
| Meta | Llama Guard + PurpleLlama | External moderation classifier | Yes |
| Open-source | OpenRLHF + CAI-on-open-LLMs | Training (reproduction) | Yes |
These are complementary, not competing. A serious production system uses Harmony-style structured output, a CAI-style trained model, an external Llama Guard-style classifier, and an FSF-style deployment policy.
FAQ
What is RLAIF?
Reinforcement Learning from AI Feedback. It replaces the human preference labels in standard RLHF with preferences from an AI judge model that has been instructed with a set of principles (a “constitution”). Anthropic introduced it as the second phase of Constitutional AI.
Is Constitutional AI open-source?
The method is — Anthropic published the paper and Hugging Face has demonstrated CAI on open LLMs. Anthropic’s own model weights and exact constitution-tuning runs are closed.
How does Llama Guard differ from a system prompt?
A system prompt steers the generator model itself. Llama Guard is a separate classifier model that runs before and after the generator and produces a safe/unsafe label against a defined taxonomy. It catches things the generator’s own prompt-following may miss, and it can be swapped or updated without retraining the generator.
What are Critical Capability Levels (CCLs)?
A concept from Google’s Frontier Safety Framework: defined thresholds of dangerous capability (e.g., autonomous cyber-offense at a given level) that, once a model is evaluated to meet them, trigger required mitigation tiers before deployment.
Does the Harmony Protocol prevent jailbreaks?
No — Harmony is a structured output format, not a defensive technique. Its safety contribution is inspectability: it makes the model’s reasoning, tool calls, and final answer separable so other layers (classifiers, policy engines) can act on each independently.
Which technique should I use for my LLM product?
Treat them as a stack. Use a structured output format (Harmony or your own) for inspectability, a constitution-style system prompt or fine-tune for behavioral alignment, and a Llama Guard-style external classifier for input/output moderation. Then add a deployment-gating policy modeled on Google’s FSF if you ship updated weights regularly.
Sources & further reading
- Original survey (living document): github.com/sarkar-dipankar/llm-safety-protocol
- OpenAI: Harmony Response Format cookbook · openai/harmony GitHub
- Anthropic: Constitutional AI paper · Claude’s Constitution · Constitutional Classifiers
- Google: SAIF · Frontier Safety Framework
- Meta: PurpleLlama analysis · Llama Guard with LlamaIndex (TDS)
- Open-source: OpenRLHF (arXiv:2405.11143) · Constitutional AI with Open LLMs (HF)
Related reading on this site: Prompt Structuring Techniques for how system prompts encode the constitutional layer; Automated Prompt Optimization for how those system prompts get tuned at scale.