Advanced LLM Agents MOOC

Berkeley Campanile Image by jngai58

Notes on Advanced Large Language Model Agents, Spring 2025, an online class that picks up where Large Language Model Agents MOOC, Fall 2024 left off. In the Berkeley catalog, it’s CS294/194-280.

Advanced LLM Agents MOOC

The Advanced LLM Agents course surveys twelve topics in current research focusing on reasoning, tool use, post-training and inference-time techniques, with applications spanning coding, mathematics, web interaction, and scientific discovery. Going beyond text completion, these models operate in workflows - planning, exploring, and evaluating iteratively.

Training has evolved towards staged recipes. Curated reasoning traces on tasks with verifiable outputs — code and math — serve as grounded reward signals for reinforcement learning, augmenting human feedback.

At inference time, models are augmented with retrieval, memory systems, and tool integration. Reasoning strategies such as chain-of-thought prompting and tree-based search allow models to decompose problems, explore solution spaces and self-correct.

Finally, the course underscores the convergence of LLMs with formal reasoning and scientific discovery. Systems like lean, a mathematical programming language and theorem prover, provide rigor with while LLMs supply intuition and abstraction - a powerful combination.

Yet, as model capabilities increase, so does the potential for exploitation. The final lecture promotes security principles for safely deploying increasingly powerful agentic agents.

Instructors

Dawn Song, Professor, UC Berkeley
Xinyun Chen, Research Scientist, Google DeepMind
Kaiyu Yang, Research Scientist, Meta FAIR

Topics

Inference-time techniques for reasoning
Post-training methods for reasoning
Search and planning
Agentic workflow, tool use, and functional calling
LLMs for code generation and verification
LLMs for mathematics: data curation, continual pretraining, and finetuning
LLM agents for theorem proving and autoformalization

Reading

OpenAI blog Learning to reason with LLMs
The Bitter Lesson by Rich Sutton

Class Resources

LLM Agents Discord

Lecture 1: Inference-Time Techniques for LLM Reasoning

Xinyun Chen Google DeepMind

Large language model agents

Solving real world tasks typically involves a trial-and-error process. Leveraging external tools and retrieving from external knowledge expand LLM’s capabilities. Agentic workflows facilitate complex tasks.

Task decomposition
allocation of subtasks to specialized modules
division of labor for collaboration

“Let’s think step by step”

In chain-of-thought prompting, we allow the model to adapt the amount of computation to the difficulty of the problem. For complex questions, the model can use more reasoning steps. Models can be trained or instructed to use reasoning strategies like decomposition, planning, analogies, etc.

Models can automate prompt design and optimize prompts. They can generate exemplars gaining the benefits of few-shot reasoning without human effort to write examples.

Explore multiple branches

We should not limit the LLM to only one solution per problem. Exploring multiple branches allows the LLM to recover from mistakes, generate multiple candidate solutions, or multiple next steps.

Self-consistency

Self-consistency is a simple and general principle in which we ask the model for several responses and select the response with the most consistent final answer. Consistency is highly correlated with accuracy.

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Sample-and-rank

Rather than counting, we can instead select the response with the highest log probability. This performs less well than self-consistency, unless the model has been specifically fine-tuned for this purpose.

She showed a nice example of clustering LLM output in the context of code generation.

LLM Code Generations

Tree of thought

Using the LLM to compare or rank candidate solution, or prioritize exploration of more promising partial solutions enables us to:

increase token budget for a single solution
increase width to explore the solution space
increase depth to refine the final solution over many steps

Lecture 2: Learning to Self-Improve & Reason with LLMs

Jason Weston, Meta & NYU

Goal: An AI that “trains” itself as much as possible

Creates new tasks to train on (challenges itself)
Evaluates whether it gets them right (“self-rewarding”)
Updates itself based on what it understood

RLHF vs DPO

Research question: can this help it become superhuman? Can an LLM improve itself by assigning rewards to its own outputs and optimizing?

Self-rewarding LMs

Observations:

LLMs can improve if given good judgements on response quality
LLMs can provide good judgements

Train a self-rewarding language model that:

Has instruction following capability
Has evaluation capability, i.e., given a user instruction, one or more responses, can judge the quality of responses, aka LLM-as-judge

…then this model can go through an iterative process of data creation/curation training on new data. And, get better at both instruction following and evaluation in each cycle.

Self-rewarding language models

Lecture 3: On Reasoning, Memory, and Planning of Language Agents

Yu Su, Ohio State University

Language agent framework

Memory

Memory is central to human learning. We recognize patterns and associations relevant to the current context. HippoRAG uses a learned associative concept map as an index for non-parametric (in context) learning. Large embedding models perform at least as well.

Reasoning

Can LLMs learn compositional and comparative reasoning?

Examples:

Barack’s wife is Michelle. Michelle was born in 1964. When was Barack’s wife born?
Trump is 78. Biden is 82. Who is younger?

Grokked transformers are implicit reasoners: extended training far beyond overfitting enables reasoning without prompting or fine-tuning.

Planning

Simplified definition: Given a goal G, decide a sequence of actions (a_0, a_1, …, a_n) that will lead to a state that passes the goal test g(•).

General trends in planning settings for language agents

Increasing expressiveness in goal specification, e.g., in natural language as opposed to formal language
Substantially expanded or open-ended action space
Increasing difficulty in automated goal test

Language Agents tutorial: https://language-agent-tutorial.github.io/

Lecture 4: Open Training Recipes for Reasoning in Language Models

Hanna Hajishirzi, University of Washington, Allen AI Institute Ai2

It’s critical for research that there be open frontier models with training process that is transparent and reproducible. Ai2 has produced produced open pretrained LLMs: OLMo, OLMo2, OLMoE and an open post-training process called Tulu. HH’s talk is on open recipes for training LLMs and reasoning models.

Overview of open recipe for training LLMs and reasoning models

Open training recipe

Data curation

Training sets for proprietary models are kept secret, not least due to use of copyrighted material. Open models require a training set free of legal conflicts.

Datasets graphics

Datasets table

Open post training recipe

Post training consists of three strategies - instruction fine-tuning, preference tuning (RLHF or RLAIF), and reinforcement learning with verifiable feedback (RLVF).

Open post-training recipe

Preference tuning

Preference tuning takes a base model oriented to document completion and improves its ability to follow instructions, hold a conversation, and perform reasoning tasks.

Proximal policy optimization

Proximal Policy Optimization (PPO; Schulman et al., 2017) first trains a reward model and then uses RL to optimize the policy to maximize those rewards.

\[\max_{\pi_{\theta}} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta}(y \mid x)} \left[ r_{\phi}(x, y) \right] - \beta \mathbb{D}_{\text{KL}} \left[ \pi_{\theta}(y \mid x) \,||\, \pi_{\text{ref}}(y \mid x) \right]\]

Direct policy optimization

Direct Preference Optimization (DPO; Rafailov et al., 2024) directly optimizes the policy on the preference dataset; no explicit reward model.

\[\mathcal{L}_{\text{DPO}}(\pi_{\theta}; \pi_{\text{ref}}) = - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_{\theta}(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_{\theta}(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \right]\]

SimPO

SimPO (Meng et al., 2024) does not use a reference model.

\[\mathcal{L}_{\text{SimPO}}(\pi_{\theta}) = - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \frac{\beta}{|y_w|} \log \pi_{\theta}(y_w \mid x) - \frac{\beta}{|y_l|} \log \pi_{\theta}(y_l \mid x) - \gamma \right) \right]\]

GRPO

Group relative policy optimization (GRPO) was introduced in the DeepSeek R1 paper. For each question 𝑞, GRPO samples a group of outputs {𝑜1, 𝑜2, …, 𝑜𝐺} from the old policy $\pi_{\theta_{\text{old}}}$ and then optimizes the policy model $\pi_{\theta}$ by maximizing the following objective:

\[\mathcal{J}_{GRPO}(\theta) = \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \min \left( \frac{\pi_{\theta}(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)} A_i, \text{clip} \left( \frac{\pi_{\theta}(o_i \mid q)}{\pi_{\theta_{\text{old}}}(o_i \mid q)}, 1 - \epsilon, 1 + \epsilon \right) A_i \right) \right] - \beta \mathbb{D}_{\text{KL}} \left( \pi_{\theta} \| \pi_{\text{ref}} \right)\] \[A_i = \left( \frac{r_i - \bar{r}}{\sigma_r} \right)\]

Or, as a loss to be more comparable to the other preference tuning losses:

\[\mathcal{L}_{GRPO}(\theta) = - \mathbb{E} \left[ \frac{1}{G} \sum_{i=1}^{G} \hat{A_i} \right] + \beta \mathbb{D}_{\text{KL}} \left( \pi_{\theta} \| \pi_{\text{ref}} \right)\]

…where $\hat{A_i}$ is the clipped policy advantage of the ith output in the group.

Each output is assigned a reward $r_i$, and the rewards are normalized by subtracting the mean and dividing by the standard deviation. This group-based reward normalization is more efficient by eliminating the need for a separate value function, and appears effective in practice.

I’m still working on understanding GRPO, so don’t take this as gospel.

Mid-training

Given the observation that post-training yields larger improvements on more capable base models, how do we go about trying to improve reasoning capabilities of base models? A mid-training stage can be inserted at the end of pre-training that trains on next-token prediction on curated high-quality data including human curated reasoning traces, and math and coding with a verifiably correct answer.

Lecture 5: Coding Agents and AI for Vulnerability Detection

Charles Sutton, Google DeepMind

Evaluating coding agents

SWE-bench is a dataset that tests systems’ ability to solve GitHub issues automatically. The dataset collects 2,294 Issue-Pull Request pairs from 12 popular Python repositories. Evaluation is performed by unit test verification using post-PR behavior as the reference solution.

SWE-agent lets your language model of choice (e.g. GPT-4o or Claude Sonnet 3.5) autonomously use tools to fix issues in real GitHub repositories, crack cybersecurity challenges, etc.

The ReACT loop

Repeat until timeout, error, or success

LLM generates text given current trajectory
Run tools from LLM output
Append tool output to trajectory

Passerine Google’s coding agent for bug fixes.

AI For computer security

LLMs playing capture the flag. Agentic techniques seem particularly natural fit. Lists datasets and a few challenge competitions.

Lecture 6: Web Agents

Ruslan Salakhutdinov, CMU/Meta

Rus spoke on multimodal autotonomous AI agents for performing tasks on the web. The agents were trained and evaluated in mock environments, with mock shopping, search, etc.

Tree search benefits accuracy in these contexts, but is slow and suffers from hard-to-undo actions. Synthetic data can also help - use Llama to generate and verify synthetic agentic tasks.

Common vs rare cases

A general principle, maybe: Training techniques that work well for common cases where there are lots of labeled examples to train on differ from techniques that work well in rare cases.

Examples:

Synthetic data can cover whole problem space evenly. Human-curated real world data covers limited parts of the problem space.
CBOW vs skip-grams in word2vec: skip-grams seem to be helpful for rare words.

Multistep tasks

AI Agents are brittle to multistep tasks due to compounding errors. If they go wrong, they have difficulty recovering. Work needed on self-correction and recovery.

Common failure modes

long horizon reasoning and planning
getting stuck looping or oscillating
correctly performing tasks then undoing them
stopping exploration or execution too early

Attacks

An attack on web agent is to hide instructions in HTML comments, figure captions, etc.

Reinforcement learning framing

POMDP - Partially observable Markov decision process.

\[\mathcal{E} = {\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{T}}\] \[\mathcal{S}: state\] \[\mathcal{A}: actions\] \[\mathcal{O}: observations\] \[\mathcal{T}: Transition function\] \[\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}'\]

Lecture 7: Multimodal Agents

Caiming Xiong, Salesforce AI Research

Another talk on multimodal agents for doing computer-based tasks. OSWorld is a virtual machine environment in which agents can perform tasks like coding, data analysis, making slides, manipulating images, and editing documents. Tasks come with evaluation scripts, a process they call execution-based evaluation.

Data Challenges for Agent Training

Agent models require expensive human annotation to collect agent trajectory data.
This contrasts with LLMs, which leverage existing text corpora.
Human annotation is time-consuming, costly, and limits scalability.
The cost and complexity of human annotation make it difficult to collect diverse and large-scale agent trajectory data.

Why not let models synthesize? What they did is agent trajectory synthesis via guiding replay with web tutorials.

CoTA: Chains-of-Thought-and-Action

Lecture 8: AlphaProof - When RL meets Formal Maths

Thomas HUBERT, Google DeepMind

Mathematics is a root node of intelligence. Being good at math requires:

Reasoning & Planning
Generalisation & Abstraction
Knowledge & Creativity
Open ended & Unbounded complexity
Even an eye for beauty

Formalisation in Mathematics provides:

Rigor and clarity
Efficiency and Communication
Abstraction and Generalisation
Unification
Created new fields

Lean

Lean is a programming language, theorem prover, and interactive proof assistant. It is a digital programmable formalism for mathematics that aims to be general and unified.

Reinforcement learning

Following in the footsteps of AlphaGo and AlphaFold, they hope to RL their way to superhuman math ability.

“If an agent can learn to master an environment tabula rasa, then you demonstrably have a system that is discovering and learning new knowledge by itself.”

He gives the following recipe for making a system superhuman:

Scaled up trial and error
Grounded feedback signal
Search
Curriculum

Lean gives us an environment in which to scale up trial and error with a grounded feedback signal - perfect proof verification.

Given a had problem, they generate many variants of the problem, some of which will be easier to solve. They use “test-time RL” to iterate from easier variants of the problem towards the original hard problem. Not sure I got this part.

The slides have a couple of very cool demos, which he skipped over. Maybe I can find those online somewhere.

Lecture 9: Language models for autoformalization and theorem proving

Kaiyu Yang, Meta FAIR

LLMs are frequently evaluated on math and coding tasks because they are important examples of reasoning and are relatively easy to evaluate. Math and code are deeply connected. See: Epoch AI’s FrontierMath benchmark.

Training LLMs for math

$SFT for math models$ How NuminaMath Won the 1st AIMO Progress Prize, Fleureau et al, 2024

Supervised Finetuning on mathematical date - math overflow pages or papers from arXiv.
SFT on problems with step-by-step solutions.
SFT on problems with tool-integrated solutions.
RL on problems with verifiable solutions but no intermediate steps.

See paper: Formal Mathematical Reasoning: A New Frontier in AI

Lean Dojo

Kaiyu Yang was first author on LeanDojo: Theorem Proving with Retrieval-Augmented Language Models (2023), in which they use vector similarity retrieval from a library of lemmas.

Autoformalization

Bridging between informal and formal mathematical expressions. See: AlphaGeometry: An Olympiad-level AI system for geometry and Autoformalizing Euclidean Geometry in which they translate 48 problems from Book 1 of Euclid’s Elements into Lean.

Challenges in autoformalization:

Human written proofs are full of holes - reasoning gaps
Evaluation is difficult

These are doable within limited domains, but how do you generalize across domains?

Lecture 10: Bridging Informal and Formal Mathematical Reasoning

Sean Welleck, CMU

Sean introduces Lean-STaR which is an RL system to train LMs to produce mathematical proofs by interleaving informal thoughts and steps in the proof, sampling lots of coninuations of the proof, and getting reward signal from Lean’s proof verification.

Automated Theorem Prover workflow

Two approaches:

Draft-sketch-prove
LeanHammer

A hammer theorem prover is a tool that uses external automated theorem provers (ATPs) to help find proofs within a larger proof assistant system.

Lecture 11: Abstraction and Discovery with Large Language Model Agents

Swarat Chaudhuri, UT Austin

Mathematical Discovery with LLM Agents

Challenges with NNs for proof generation

Data scarcity

Need traces or reward functions that enable rigorous mathematical reasoning
This is difficult beyond high-school or competition settings.

Lack of verifiability

Natural-language reasoning is hard to verify
In applications like system verification, edge cases are especially critical.

In-context learning for theorem proving

Copra - In-context learning for theorem proving

LLMs and in-context learning are powerful tools for mathematical reasoning.
- RAG over lemmas.
- Feedback from theorem prover, Coq or Lean.
Similar techniques work for program verification.

AI for Scientific Discovery

Scientific progress

Symbolic Regression - LLM Agents for Empirical Discovery

How do we derive a physical law from observed data? Genetic algorithms are often used for symbolic regression. But what if we use an LLM to generate candidate programs, like the cross-over step in evolutionary algos.

alt text

LLM-directed evolution is a powerful tool for empirical scientific discovery.
Frontier LLMs inject prior world knowledge into mutation/crossover operators.
LLMs can be used to learn abstract concepts that accelerate evolution.
All this can be applied to settings with visual inputs as well.

Concluding notes: Combinations of agentic LLMs with other machinery are a very powerful tool for discovery.

Lecture 12: Towards building safe and secure agentic AI

Dawn Song, UC Berkeley

Attacks on Agentic Systems

SQL injection using LLM
Remote code execution (RCE) using LLM
Direct/Indirect Prompt Injection
Backdoor

Prompt Injection Attack Surface

Manipulated user input
Data poisoning: open datasets, documents on public internet

Prompt injection attack

Decodingtrust.github.io - Comprehensive Assessment of Trustworthiness in GPT Models

Defense Principles

Defense-in-depth
Least privilege & privilege separation
Safe-by-design, secure-by-design, provably secure

Defense Mechanisms

Harden models
Guardrail for input sanitization
Policy enforcement on actions
Privilege management
Privilege separation
Monitoring and detection
Information flow tracking
Secure-by-design and formal verification