Quick Reference

Chapter 39 of 40 · Haggai Roitman

Quick Reference

This chapter consolidates key equations, architecture specifications, API references, and failure mode diagnostics for rapid lookup during development and debugging.

28.1 Core RL & Alignment Equations

PPO Clip: L = E[min(rt ˆAt, clip(rt, 1±ϵ) ˆAt)], rt = πθ(at|st)/πold(at|st) (28.1)

DPO: L = −E[log σ(β log πθ(yw|x)

πref(yw|x) −β log πθ(yl|x)

πref(yl|x))] (28.2)

GRPO: ˆAi = (ri −µG)/σG, then PPO clip update (no critic) (28.3)

KTO: L = λw(1 −v(yw)) + λl · v(yl), v = σ(β log(πθ/πref) −z) (28.4)

IPO: L = E[(log(πθ(yw)/πref(yw)) −log(πθ(yl)/πref(yl)) −1/(2β))2] (28.5)

ORPO: L = LSFT(yw) −λ log σ(log(odds(yw)/odds(yl))) (28.6)

GAE: ˆAt = PT−t l=0 (γλ)lδt+l, δt = rt + γV (st+1) −V (st) (28.7)

KL Penalty: Rtotal = rϕ(x, y) −βDKL[πθ(y|x)∥πref(y|x)] (28.8)

RM (Bradley-Terry): L = −E[log σ(rϕ(x, yw) −rϕ(x, yl))] (28.9)

Best-of-N: y∗= arg max yi∼πθ(·|x), i=1..N rϕ(x, yi) (28.10)

28.2 Transformer & Architecture Formulas

Self-Attention: Attn(Q, K, V ) = softmax(QK⊤/ p

dk) · V (28.11)

Multi-Head: MHA(X) = Concat(head1, . . . , headh)W O, headi = Attn(XW Q i , XW K i , XW V i ) (28.12)

RoPE: f(xm, m) = xmeimθj, θj = 10000−2j/d (28.13)

LoRA: W ′ = W0 + (α/r) · BA, B ∈Rd×r, A ∈Rr×k (28.14)

KD (soft targets): LKD = (1−α)LCE(y, ˆy) + α T 2 · KL(pteacher T ∥pstudent T ) (28.15)

Method Formula / Rule Key Param

Greedy yt = arg maxv P(v|y<t) -- Beam search Keep top-B partial sequences by joint probability B = 4-8

Temperature P ′(v) = softmax(logitv/T) T ∈[0.1, 1.5] Top-k Zero out all but top-k logits, renormalize k = 40-100

Top-p (nucleus) Keep smallest set V ′

s.t. P v∈V ′ P(v) ≥p p = 0.9-0.95

Min-p Keep tokens with P(v) ≥ pmin · P(vmax) pmin = 0.05-0.1

Repetition penalty logitv ← logitv/θ if v appeared before θ = 1.1-1.3

28.4 Systems & Parallelism

Formula Value (70B, BF16) Description

Model memory 2P bytes 140 GB (weights only) Adam optimizer 2P × 4 bytes (m + v) 280 GB Full training footprint ∼8P bytes 560 GB (weights + opt + grad) FSDP memory/GPU 8P/NGPUs 70 GB with 8 GPUs Gen arithmetic intensity 2P/2P = 1 FLOP/byte Heavily memory-bound Token rate (gen) HBM_BW /(2P) ∼14 tok/s (A100, batch=1) TP AllReduce / layer 2 × 2 · T−1

T · bsd bytes ∼188 MB (70B, TP=8) PP bubble fraction (P −1)/(P + M −1) P=stages, M=micro-batches MFU observed_toks × 6P / peak_FLOPS Target: > 40%

28.5 GPU Hardware Specs

GPU Memory BW (HBM) BF16 TFLOPS NVLink Notes

Parameter Typical Range Default Notes

β (DPO/KTO) 0.05-0.5 0.1 Higher = more conservative ϵ (PPO clip) 0.1-0.3 0.2 Higher = more aggressive updates γ (GAE discount) 0.99-1.0 1.0 Use 1.0 for episodic tasks λ (GAE) 0.9-0.99 0.95 Lower = more biased, less variance KL coeff (βKL) 0.01-0.2 0.05 Auto-adapt to target KL ≈5-8 LR (RLHF) 1e-7 - 5e-6 5e-7 Much lower than pre-training LR (SFT) 1e-5 - 5e-5 2e-5 Standard fine-tuning range LoRA rank r 8-128 16-64 Higher r = more capacity, more memory LoRA alpha α r - 2r 2r Scaling factor; α/r is the effective scale Temperature (gen) 0.6-1.0 0.7 Lower = less diverse candidates Num generations K 4-64 4-16 For GRPO/Online DPO/Best-of-N Grad clip norm 0.5-2.0 1.0 Prevents gradient explosion

28.7 TRL API Quick Reference

Trainer Method Key Config Data Format

SFTTrainer Supervised FT packing, max_seq_length prompt + completion

RewardTrainer Reward model center_rewards_coefficient prompt + chosen + rejected PPOTrainer PPO init_kl_coef, target_kl, cliprange

prompts (online gen)

DPOTrainer DPO/IPO beta, loss_type="sigmoid"/"ipo" prompt + chosen + rejected

GRPOTrainer GRPO num_generations, beta, use_vllm prompts + reward_fn

OnlineDPOTrainer Online DPO num_generations, reward_model_path prompts (online gen)

KTOTrainer KTO desirable_weight, undesirable_weight prompt + completion + label

ORPOTrainer ORPO beta prompt + chosen + rejected Best-of-N (manual) Best-of-N n_samples prompts (inference)

28.8 RAG Pipeline Formulas

Cosine similarity: sim(q, d) = q · d ∥q∥· ∥d∥ (28.17)

Retrieval: Dk = top-kd∈C sim(embed(q), embed(d)) (28.18)

RAG generation: P(y|q) = PLLM(y | q, Dk) (28.19)

Chunking overlap: stride = chunk_size −overlap (28.20)

Pattern Structure Best For

ReAct Think →Act →Observe → loop General tool-use agents

Plan-and-Execute Plan →Execute steps →Revise Long-horizon, structured tasks

Supervisor Router →specialist agents Multi-domain, clear subtask boundaries Swarm (handoffs) Agent transfers control + context Customer service, escalation flows

Hierarchical Tree of delegating agents Complex decomposition Human-in-the-loop Agent →Approval gate → Continue High-stakes, irreversible actions

28.10 Agent Communication Protocols

Protocol Scope Transport Key Concept

MCP Tool integration stdio / HTTP+SSE Server exposes tools; client discovers & calls A2A Agent-to-agent HTTP + JSON-RPC Tasks with lifecycle (submitted→working→done) OpenAI Function Calling Tool use API payload JSON schema in tools[] array

28.11 Context Window Budget

C ≥ S |{z} system + M |{z} memory/RAG + T |{z} tool defs + H |{z} history + R |{z} reserved output (28.22)

Rule of thumb for 128K context:

• System prompt: 1-4K tokens (fixed)

• Tool definitions: 2-8K (scales with # tools)

• RAG context: 4-16K (top-k chunks)

• History: grows unbounded →summarize/truncate

• Reserved output: 2-8K

28.12 Common Failure Modes & Fixes

Symptom Likely Cause Fix

1. Have paired preferences (chosen + rejected)?

• Noisy labels →IPO

• Memory-constrained, no SFT done yet →ORPO

• Clean data, limited compute →DPO

• DPO plateaus, want exploration →Online DPO

2. Have only binary feedback (thumbs up/down)? →KTO

3. Have verifiable rewards (math/code)? →GRPO

4. Need maximum quality, any cost? →PPO

5. Want training-free improvement? →Best-of-N

28.14 Evaluation Metrics

Metric Range What It Measures

Perplexity [1, ∞) Model's surprise; lower = better language modeling Win Rate (vs. baseline) [0, 1] Fraction of outputs preferred by judge/human BLEU [0, 1] n-gram overlap with reference (precision-focused) ROUGE-L [0, 1] Longest common subsequence with reference Pass@k [0, 1] Probability ≥1 of k code samples passes tests MMLU / GPQA [0, 1] Multi-choice accuracy on knowledge/reasoning benchmarks HumanEval [0, 1] Functional correctness of generated code Faithfulness (RAG) [0, 1] Fraction of claims supported by retrieved context Context Relevancy [0, 1] Fraction of retrieved content relevant to query Answer Relevancy [0, 1] Degree to which answer addresses the question

28.15 Reasoning & Test-Time Scaling

Method Compute Cost Mechanism

Type Storage Use Case

Working memory Context window Current conversation, immediate tool results Episodic memory Vector store Past interactions, user preferences, session history Semantic memory Knowledge graph / embeddings Facts, concepts, domain knowledge

Procedural memory Skill library / code How-to procedures, learned workflows

28.17 MCP Quick Reference

Primitive Direction Side Effects? Purpose

Tools Client →Server Yes Execute actions (create, modify, delete) Resources Client →Server No (read-only) Read data (files, DB records, configs) Prompts Client →Server No Reusable templates for common tasks Sampling Server →Client No Server requests LLM generation from client

Transport: stdio (local subprocess) or HTTP+SSE (remote, streamable).

Discovery: Client calls tools/list, resources/list, prompts/list at connection init.

Tool annotations: readOnlyHint, destructiveHint, idempotentHint, openWorldHint.

28.18 A2A Protocol Quick Reference

Concept Description

Agent Card JSON at /.well-known/agent.json -- name, skills, supported content types Task Unit of work: id, status, artifacts. Lifecycle: submitted → working →completed/failed Message Communication unit within a task (role: user/agent, parts: text/file/data) Artifact Output produced by the agent (structured data, files, generated content) Push Notifications Webhook-based updates for long-running tasks (via tasks/pushNotification/set)

Framework Orchestration Multi-Agent Best For

LangGraph Explicit state graph Conditional routing Production: persistence, HITL, fine control OpenAI Agents SDK Declarative handoffs Handoff-based Simplicity: guardrails, tracing, fast start AutoGen (AG2) Conversation-driven GroupChat Prototyping: code execution, research CrewAI Role-based teams Sequential/parallel Low-code: quick demos, simple pipelines Google ADK Session + events A2A native Enterprise: artifact mgmt, multimodal

28.20 Agentic RL Formulas

t r(τi) t (28.23)

Trajectory GRPO: ˆAi = (R(τi) −µG)/σG, R(τi) = X

Agent reward: R = w1Rtask + w2Refficiency + w3Rsafety, Reff = max(0, 1 −steps/Nmax) (28.24)

t∈agent tokens min(rt ˆAt, clip(rt) ˆAt) (mask env outputs) (28.25)

Masking: L = X

�n−c k �

Pass@k : 1 −

�n k � , n = total samples, c = correct (28.26)

28.21 Agent Security Checklist

Threat Layer Mitigation

Metric Formula / Definition Target

Task Success Rate (TSR) Correct completions / total tasks > 85% (production)

Steps to completion Avg agent actions per successful task Lower = more efficient

Cost per task Total tokens × price/token Budget-dependent Latency (TTFC) Time from request to first useful output < 5s for interactive

Tool call accuracy Correct tool selections / total calls > 90%

Recovery rate Successful retries / initial failures > 60%

Human escalation rate Tasks requiring human / total tasks < 15%

28.23 Key Agentic Benchmarks

Benchmark Domain Metric SOTA (2025)

SWE-bench Verified Software engineering % resolved issues ∼70% WebArena Web browsing Task success rate ∼40% OSWorld Desktop computer use Task success rate ∼25%

GAIA General AI assistant Exact match accuracy ∼75% (L1)

Tau-bench Tool-use reliability Pass rate (5 trials) ∼65% HumanEval / MBPP Code generation Pass@1 > 95%

Chapter 29