Agents - Part 1

Intro

What are agents?

image source: Tweet from Abhishek Thakur

Let's start with some definitions of agents from different sources.

Agent Definition from LangChain Blog Post - source

An AI agent is a system that uses an LLM to decide the control flow of an application.

Agent Definition from AWS - source

An artificial intelligence (AI) agent is a software program that can interact with its environment, collect data, and use the data to perform self-determined tasks to meet predetermined goals. Humans set goals, but an AI agent independently chooses the best actions it needs to perform to achieve those goals.

Agent Definition from Chip Huyen's Book "AI Engineering" - source

An agent is anything that can perceive its environment and act upon that environment. This means that an agent is characterized by the environment it operates in and the set of actions it can perform.

Agent Definition from Mongo DB Blog Post - source

An AI agent is a computational entity with an awareness of its environment that’s equipped with faculties that enable perception through input, action through tool use, and cognitive abilities through foundation models backed by long-term and short-term memory.

Agent Definition from Anthropic - source

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:

Workflows are systems where LLMs and tools are orchestrated through predefined code paths.

Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

Agent Definition from Hugging Face Blog Post on `smolagents` - source

Any efficient system using AI will need to provide LLMs some kind of access to the real world: for instance the possibility to call a search tool to get external information, or to act on certain programs in order to solve a task. In other words, LLMs should have agency. Agentic programs are the gateway to the outside world for LLMs.

Agents are programs where LLM outputs control the workflow. Note that with this definition, "agent" is not a discrete, 0 or 1 definition: instead, "agency" evolves on a continuous spectrum, as you give more or less power to the LLM on your workflow.

Is it an Agent? Is it Agentic? It's more like a spectrum with a lot of gray area!

image source: Tweet from Andrew Ng

There is a lot of debate and discussion on what exactly is an agent and what is not an agent. I think there is a lot of gray area here and something we have to just accept, at least for now. I think Andrew Ng makes some really good points in this tweet. As Andrew points out, rather than engaging in binary debates about whether something qualifies as a "true agent," we should think about systems as existing on a spectrum of agent-like qualities. The adjective "agentic" itself becomes particularly useful here, allowing us to describe systems that incorporate agent-like patterns to different degrees without getting caught in restrictive definitions.

This spectrum-based view is reinforced by Anthropic's recent blog post on agents. They acknowledge that while they draw an architectural distinction between workflows (systems with predefined code paths) and agents (systems with dynamic control), they categorize both under the broader umbrella of "agentic systems." Similarly, we saw from one of our definitions above that "agent" isn't a discrete, 0 or 1 definition, but rather evolves on a continuous spectrum as you give more or less power to the LLM in your system. This aligns with Andrew Ng's observation that there's a gray zone between what clearly is not an agent (prompting a model once) and what clearly is (an autonomous system that plans, uses tools, and executes multiple steps independently).

image source: Blog post from Nathan Lambert on the AI Agent Spectrum

Nathan Lambert also writes about the AI agent spectrum in this blog post. Nathan discusses that the simplest system on this spectrum would be any tool-use language model and that the spectrum of agents increases in complexity from there. I like how Nathan makes the point that the spectrum will continue to evolve and that the definition of an agent will continue to change as the field evolves. Over time, certain technologies will reach milestones where they become definitive examples of AI agents. Therefore at some point, basic tool use with an LLM may not be considered an agent, even though it's the basic starting point on the agentic spectrum.

image source: Tweet from Hamel Husain

Personally, agents and agentic workflows are still so new to me and I have a lot to learn on this topic. I have deployed LLMs in production as well as built some applications where LLMs use function calling (tools) within a conversational chat interface. So I think some of my previous work has fallen somewhere within this AI agentic spectrum, even if it's at one end of the spectrum. I'm going to keep an open mind and avoid getting caught up in debates about categorical definitions. I'll try to avoid the hype and marketing fluff but be on the lookout for innovation and practical applications.

The Tool Calling Loop: A Building Block for Agentic Systems {#sec-tool_calling_loop}

image source: Tweet from Abhishek Thakur

So where do we even start on this spectrum of AI agents? Practically, I think the first step is to start with an LLM equipped with tools. I think this is what Anthropic refers to as the "The augmented LLM".

image source: Blog post from Anthropic on Building effective agents

This is the building block, an LLM equipped with tools. I think we need to take it slightly further and make it clear we need a tool calling loop. The entire process is kicked off by sending a user request to the LLM. The LLM then decides on the initial tool calls to be made in the first step. These tool calls could be executed in parallel if they are independent of one another. After calling the initial tools, the LLM can choose whether to repeat follow up tool calls, which are dependent on the results of previous tool calls. Implementing this logic together within a loop is what I refer to as the "tool calling loop".

I wrote about this tool calling loop a while ago in a previous blog_post. Here is an image I created at the time to illustrate the concept.

{height=900px} image source: previous blog post

One could call this tool calling loop "agentic" since the LLM is making decisions on what tool calls to make. Or maybe we just call it an "augmented LLM". It does not really matter. What does matter is that it's simple to implement, it does not require any frameworks, and it can solve for quite a few scenarios. It's plain old LLM function calling.

Here is one such implementation of the tool calling loop. It assumes the typical JSON format for the tool calls and uses the OpenAI chat completion API format. I'm using the litellm library to call the OpenAI API since I can easily switch to another model (such as Anthropic) and still use the same OpenAI API format. If you have never used litellm before that is fine! This is my first time using it. I only first heard about it when I was reading about smolagents and how it utilizes it. All you need to know is that from litellm import completion is the same as calling chat.completions.create(...) from the openai library.

In the loop below I also have some "print to console" functionality which uses rich under the hood. I also borrowed this idea when looking through the source code of the smolagents library from Hugging Face. I will talk more about it later on in this post.

/Users/christopher/personal_projects/DrChrisLevy.github.io/posts/agents/env/lib/python3.11/site-packages/pydantic/_internal/_config.py:345: UserWarning: Valid config keys have changed in V2:
* 'fields' has been removed
  warnings.warn(message, UserWarning)

import json
from concurrent import futures
from typing import Any, Callable, Dict

from litellm import completion
from utils import (
    console_print_llm_output,
    console_print_step,
    console_print_tool_call_inputs,
    console_print_tool_call_outputs,
    console_print_user_request,
)


def call_tool(tool: Callable, tool_args: Dict) -> Any:
    return tool(**tool_args)


def run_step(messages, tools=None, tools_lookup=None, model="gpt-4o-mini", **kwargs):
    messages = messages.copy()
    response = completion(model=model, messages=messages, tools=tools, **kwargs)
    response_message = response.choices[0].message.model_dump()
    response_message.pop("function_call", None)  # deprecated field in OpenAI API
    tool_calls = response_message.get("tool_calls", [])
    assistant_content = response_message.get("content", "")
    messages.append(response_message)

    if not tool_calls:
        response_message.pop("tool_calls", None)
        return messages

    tools_args_list = [json.loads(t["function"]["arguments"]) for t in tool_calls]
    tools_callables = [tools_lookup[t["function"]["name"]] for t in tool_calls]
    tasks = [(tools_callables[i], tools_args_list[i]) for i in range(len(tool_calls))]
    console_print_tool_call_inputs(assistant_content, tool_calls)
    with futures.ThreadPoolExecutor(max_workers=10) as executor:
        tool_results = list(executor.map(lambda p: call_tool(p[0], p[1]), tasks))
    console_print_tool_call_outputs(tool_calls, tool_results)
    for tool_call, tool_result in zip(tool_calls, tool_results):
        messages.append(
            {
                "tool_call_id": tool_call["id"],
                "role": "tool",
                "content": str(tool_result),
                "name": tool_call["function"]["name"],
            }
        )
    return messages


def llm_with_tools(messages, tools=None, tools_lookup=None, model="gpt-4o-mini", max_steps=10, **kwargs):
    console_print_user_request(messages, model)
    done_calling_tools = False
    for counter in range(max_steps):
        console_print_step(counter)
        messages = run_step(messages, tools, tools_lookup, model=model, **kwargs)
        done_calling_tools = messages[-1]["role"] == "assistant" and messages[-1].get("content") and not messages[-1].get("tool_calls")
        if done_calling_tools:
            break
    console_print_llm_output(messages[-1]["content"])
    return messages

First we will run a single step, without any tools, which is a single LLM call. Note that I return the entire message history in the output.

messages = [{"role": "user", "content": "Hello friend!"}]
run_step(messages)

Some Tools

Before going through an example task, let's show some initial tools. These tools are a list of functions that we can call. We also have a lookup dictionary that maps the tool name to the tool function.

from tools import TOOL_LKP, TOOLS

TOOL_LKP

Let's see how each tool works first.

This first tool executes python code. It's actually running in a Modal Sandbox in a secure cloud container/environment. It's an awesome feature of Modal useful for executing arbitrary code. Let's skip the details for now and come back to it later. For now, just think of it as a way to execute python code and get back the results.

# This tool is a python code execution tool.
# The code is executed in a secure cloud container/environment using Modal.
# The results are returned locally as an object.
TOOL_LKP["execute_python_code"](code="print('Hello World!')")

# We even get the last expression evaluated as a result just like in ipython repl
TOOL_LKP["execute_python_code"](code="import math; x = math.sqrt(4); print(x); y=2; x-y")

The next tool uses duckduckgo-search to search the web.

TOOL_LKP["web_search"](query="What sporting events are happening today?")

[{'title': "Today's Top Sports Scores and Games (All Sports) | FOX Sports",
  'href': 'https://www.foxsports.com/scores',
  'body': "Visit FOXSports.com for today's top sports scores and games. Explore real-time game scores across MLB, NBA, NFL, Soccer, NHL and more."},
 {'title': 'Live Sports On TV Today - TV Guide',
  'href': 'https://www.tvguide.com/sports/live-today/',
  'body': "Here's sports to watch today, Thursday, Jan 23, 2025. ... coaches and celebrities are interviewed and discuss trending topics happening around the world. ... and he interviews various guests about ..."},
 {'title': 'Sports on TV today: Where to watch or stream games - Sports Media Watch',
  'href': 'https://www.sportsmediawatch.com/sports-on-tv-today-games-time-channel/',
  'body': 'See where to watch sports on TV today with this daily, updated guide of games and events on TV and streaming. This site may earn commission on subscriptions purchased via this page. For a full list of sports TV schedules, see this page. Games on TV Today (Thursday, January 23) All times Eastern (ET)'},
 {'title': 'Sports on TV - Channel Guide Magazine',
  'href': 'https://www.channelguidemag.com/sports-on-tv',
  'body': "Here's a list of all the sports airing on TV today. Use the drop-downs below to see what sports are airing on TV over the next week. ... PPL Event 3 San Diego: Semifinals. Soccer."},
 {'title': 'Live Sports on TV Today: Top Games to Watch & Previews - DIRECTV',
  'href': 'https://www.directv.com/insider/sports-on-tonight/',
  'body': 'NBA GAMES ON TODAY. Detroit Pistons at Houston Rockets - 2:00 PM - NBA League Pass The Detroit Pistons (21-21, 50% win, 112.3 avg points for, 113.5 avg points against) go up against the Houston Rockets (28-13, 68% win, 114.2 avg points for, 107.9 avg points against). The Rockets will look to reinforce their position in the league by exploiting their superior scoring and defensive, but the ...'}]

And the next tool visits a web page and converts it to markdown.

print(TOOL_LKP["visit_web_page"](url="https://drchrislevy.github.io/"))

Chris Levy

[Chris Levy](./index.html)

* [About](./index.html)
* [Blog](./blog.html)

 
 

## On this page

* [About Me](#about-me)

# Chris Levy

 
[twitter](https://twitter.com/cleavey1985)
[Github](https://github.com/DrChrisLevy)
[linkedIn](https://www.linkedin.com/in/chris-levy-255210a4/)

**Hello!** I’m Chris Levy. I work in ML/AI and backend Python development.

## About Me

I spent a good amount of time in school where I completed a PhD in applied math back in 2015. After graduating I shifted away from academia and started working in industry. I mostly do backend python development these days, and build ML/AI applications/services. I work across the entire stack from research, to training and evaluating models, to deploying models, and getting in the weeds of the infrastructure and devops pipelines.

Outside of AI/ML stuff, I enjoy spending time with my family and three kids, working out, swimming, cycling, and playing guitar.

![](pic_me.jpeg)

To pass these tools to the LLM, we use the typical JSON format used within the OpenAI API format.

TOOLS

[{'type': 'function',
  'function': {'name': 'execute_python_code',
   'description': 'Run and execute the python code and return the results.',
   'parameters': {'type': 'object',
    'properties': {'code': {'type': 'string',
      'description': 'The python code to execute.'}},
    'required': ['code']}}},
 {'type': 'function',
  'function': {'name': 'web_search',
   'description': 'Search the web for the query and return the results.',
   'parameters': {'type': 'object',
    'properties': {'query': {'type': 'string',
      'description': 'The query to search for.'}},
    'required': ['query']}}},
 {'type': 'function',
  'function': {'name': 'visit_web_page',
   'description': 'Visit the web page and return the results.',
   'parameters': {'type': 'object',
    'properties': {'url': {'type': 'string',
      'description': 'The URL to visit.'}},
    'required': ['url']}}}]

Example Task 1

Okay, so let's run the tool calling loop now with the tools defined above to illustrate how it works. Here is a task where we ask some questions about recent NBA events.

task = """
        Recently on Jan 2 2025, Steph Curry made a series of 3 pointers in one game without missing. 
        How many three pointers did he make in total that game?
        How many points did he score in total that game?
        How many combined points did both teams score on that game?
        Of the total points scored by both teams, what percentage was made by Steph Curry?

        One more task. Lebron James also played a game on Jan 2 2025.
        How old is Lebron James and how many points did he score in his game on Jan 2 2025?
        Take his total points scored that game and raise it to the power of 5. What is the result?
    
        """
messages = [
    {
        "role": "system",
        "content": """You are a helpful assistant. Use the supplied tools to assist the user. 
        Always use python to do math. After getting web search results be sure to visit the web page and convert it to markdown. 
        Todays date is 2025-01-03. Remember to give a final answer in your last message answering all of the user's questions.""",
    },
    {
        "role": "user",
        "content": task,
    },
]

There is an answer to this question. Here is the correct ground truth answer.

example_one_answer = """
Game stats from January 2, 2025:

Steph Curry:
- Made 8 three pointers
- Total points: 30
- Game final score: Warriors 139, 76ers 105 so the total points scored by both teams is 244
- Curry's percentage of total points: 30/244 ~= 12.3%

Lebron James on January 2, 2025:
- Age: 40
- Points scored: 38
- Points scored raised to the power of 5: 38^5 = 79,235,168
"""

Let's also have a simple LLM call to evaluate if a response is correct.

import json


def eval_example_one(input_answer):
    input_msgs = [
        {
            "role": "user",
            "content": f"""
         
Original question:
{messages[-1]["content"]}

Here is the ground truth answer:
{example_one_answer}

Here is the predicted answer from an LLM.
{input_answer}

Given the context of the correct answer and question, did the LLM get everything correct in its predicted answer?
Return True or False. Only return True if the LLM got everything correct
and answered each part of the question correctly. Also give an explanation of why you returned True or False.
Output JSON.

{{
    "correct": True or False,
    "explanation": "explanation of why you returned True or False"
}}
""",
        },
    ]

    return json.loads(run_step(input_msgs, model="gpt-4o", response_format={"type": "json_object"})[-1]["content"])


# Example of incorrect answer
print(eval_example_one("Lebron James is 40 years old and scored 38 points in his game on Jan 2 2025."))

# Example of correct answer
print(
    eval_example_one(
        "Lebron James is 40 years old and scored 38 points in his game on Jan 2 2025. 38 to the power of 5 is 79,235,168.  Steph scored 30, made 8 three pointers without missing. The total points scored by both teams was 244 and Steph scored 12.3 percent of the total points."
    )
)

{'correct': False, 'explanation': "The LLM correctly identified LeBron James's age as 40 and his points scored as 38 on January 2, 2025. However, the LLM did not address or verify the other components of the original question, specifically regarding Steph Curry's performance and game statistics, nor did it calculate the result of raising LeBron's points to the power of 5. Therefore, not all aspects of the original question were answered, and the LLM's response is incomplete, leading to a determination of False."}
{'correct': True, 'explanation': "The LLM provided the same answers as the ground truth for each part of the question. Steph Curry made 8 three pointers without missing and scored a total of 30 points. The combined score for both teams was 244, and Steph Curry's points accounted for approximately 12.3% of the total. LeBron James was 40 years old on January 2, 2025, and scored 38 points in his game on that day. When 38 is raised to the power of 5, the result is 79,235,168. Therefore, the LLM answered every part of the question correctly."}

gpt-4o-mini

Okay, lets send this same task to gpt-4o-mini and see how it does.

messages_final = llm_with_tools(messages, model="gpt-4o-mini", tools=TOOLS, tools_lookup=TOOL_LKP)

We can look at all the messages in the final output, which includes all the messages handled by the LLM.

# Commenting out since the output is long from the webpages visited.
# But has all the messages chat history and tool calls in the OpenAI API format.

# messages_final

Let's use our LLM judge to evaluate the final output.

eval_example_one(messages_final[-1]["content"])

{'correct': False,
 'explanation': "The LLM did not get everything correct. While it correctly noted Steph Curry's total three-pointers, total points scored, combined points scored by both teams, and the percentage of total points scored by Curry, it made an error in LeBron James's performance. The LLM stated that LeBron James scored 21 points in his game, but the ground truth indicates he scored 38 points. Consequently, raising 21 (the incorrect point total) to the power of 5 yields an incorrect result of 4,084,101, whereas the correct computation for 38 points raised to the power of 5 should be 79,235,168. Additionally, the age of LeBron James was correctly noted as 40 years old. The miscalculation for LeBron James's points means the LLM did not answer each part of the question correctly."}

claude-3-5-sonnet {#sec-claude-3-5-sonnet-ex1}

Let's send this same task to Anthropic's claude-3-5-sonnet model. That's the beauty of litellm! We can easily switch between models and still use the same all familiar OpenAI API format.

messages_final = llm_with_tools(messages, model="claude-3-5-sonnet-20240620", tools=TOOLS, tools_lookup=TOOL_LKP)

eval_example_one(messages_final[-1]["content"])

{'correct': True,
 'explanation': "The LLM correctly provided the number of three-pointers made by Steph Curry, his total points, the combined score of both teams, and the percentage of total points he scored. It also accurately stated LeBron James' age, points scored in his game, and the calculation of his points raised to the power of 5. Therefore, the LLM answered each part of the question correctly."}

deepseek/deepseek-chat

We can also try the same task with "deepseek/deepseek-chat".

messages_final = llm_with_tools(messages, model="deepseek/deepseek-chat", tools=TOOLS, tools_lookup=TOOL_LKP)

eval_example_one(messages_final[-1]["content"])

{'correct': True,
 'explanation': 'The LLM correctly answered all parts of the original question. It provided the number of three-pointers made by Steph Curry, his total points, the combined points scored by both teams, and the percentage of total points scored by Curry. Additionally, for LeBron James, it correctly stated his age, the points he scored, and the result of raising his points to the power of 5. Therefore, the predicted answer matches the ground truth for all aspects of the question.'}

ReAct

One of the main prompting techniques for building agents comes from the paper --> ReAct: Synergizing Reasoning and Acting in Language Models. It is also the approach smolagents uses in their library as talked about in their conceptual guide here. I'm sure a lot of other frameworks use this approach, or modified versions of it, as well. You should check out the smolagents library, documentation, and code for more details.

The ReAct prompting framework (short for Reasoning and Acting) is a technique designed to enhance the capabilities of large language model (LLM) agents by enabling them to reason and act iteratively when solving complex tasks. ReAct combines chain-of-thought reasoning with decision making actions, allowing the model to think step by step while simultaneously interacting with the environment to gather necessary information.

The key elements of ReAct are:

Reasoning: The model generates intermediate steps to explain its thought process while solving a problem or addressing a task.

Acting: The model performs actions based on its reasoning i.e. calling tools.

Observation: The outputs of actions (tool calls) provide feedback or data to guide the next reasoning step.

Iterative Process: ReAct operates in a loop, where the outputs of reasoning and acting are used to refine the approach, gather additional information, or confirm conclusions until the task is resolved.

It's some what similar to what we saw above in the Tool calling Loop @sec-tool_calling_loop. Actually, when you compare the output from our first example task in the tool calling loop, you can see that "anthropic/claude-3-5-sonnet" @sec-claude-3-5-sonnet-ex1 is quite verbose in explaining its reasoning while making tool calls. It's already using some sort of chain of thought reasoning. However the OpenAI gpt-4o-mini model does not output much in the way of reasoning.

Let's see if we can implement a simple version of ReAct prompting. The goal here is not to be robust as a framework, but rather to illustrate some of the concepts for educational purposes. I have a system prompt explaining ReAct with some examples, followed by code to run a step and run a loop. It's similar in structure to the tool calling loop. I have simplified things here by assuming only one tool call is made in each step. I have also chosen to use structured JSON output for all the assistant messages using the OpenAI API format. I am using the same tools as before but I have added a final_answer tool call.

import json
from typing import Any, Callable, Dict

from litellm import completion
from tools import TOOL_LKP
from utils import console_print_react_tool_action_inputs, console_print_react_tool_action_outputs, console_print_user_request

REACT_SYSTEM_PROMPT = """
You are a helpful assistant that uses reasoning and actions to solve tasks step by step. 
You have access to the following tools:

[{'type': 'function',
  'function': {'name': 'execute_python_code',
   'description': 'Run and execute the python code and return the results.',
   'parameters': {'type': 'object',
    'properties': {'code': {'type': 'string',
      'description': 'The python code to execute.'}},
    'required': ['code']}}},
 {'type': 'function',
  'function': {'name': 'web_search',
   'description': 'Search the web for the query and return the results.',
   'parameters': {'type': 'object',
    'properties': {'query': {'type': 'string',
      'description': 'The query to search for.'}},
    'required': ['query']}}},
 {'type': 'function',
  'function': {'name': 'visit_web_page',
   'description': 'Visit the web page and return the results.',
   'parameters': {'type': 'object',
    'properties': {'url': {'type': 'string',
      'description': 'The URL to visit.'}},
    'required': ['url']}}},
 {'type': 'function',
  'function': {'name': 'final_answer',
   'description': 'Return the final answer to the task.',
   'parameters': {'type': 'object',
    'properties': {'answer': {'type': 'string',
      'description': 'The final answer to the task.'}},
    'required': ['answer']}}}
]

For each step, you should:

1. Think: Explain your thought process and what you plan to do next
2. Act: Call one of the available tools using the proper JSON format. Only call one tool at a time.
3. Observe: Review the results from the tool call
4. Repeat or Conclude: Either take another step or provide your final answer

YOU MUST ALWAYS RESPOND IN STRUCTURED JSON FORMAT.
The output format must follow this Schema:

{
  "THOUGHT": "Your thought process",
  "ACTION": {
    "tool_name": "The name of the tool to call",
    "tool_arguments": "The arguments to pass to the tool"
  }
}

Here are some examples of how to solve tasks:
Example 1: "What was the average temperature in New York City last week?"

{
  "THOUGHT": "I need to search for NYC weather data from the past week.",
  "ACTION": {
    "tool_name": "web_search",
    "tool_arguments": {
      "query": "NYC weather data December 27-January 2 2025"
    }
  }
}

< wait for tool call to finish >

OBSERVATION: [
    {
        "title": "Historical Weather Data - NYC Central Park",
        "href": "https://weather.example.com/nyc/history",
        "body": "View detailed temperature records for New York City Central Park weather station..."
    }
]

{
  "THOUGHT": "Let me visit the webpage to get the detailed temperature data.",
  "ACTION": {
    "tool_name": "visit_web_page",
    "tool_arguments": {
      "url": "https://weather.example.com/nyc/history"
    }
  }
}

< wait for tool call to finish >

OBSERVATION: # NYC Central Park Weather History
## December 27, 2024 - January 2, 2025

Daily Temperature Records (°F):
- December 27: Average 35°
- December 28: Average 31.5°
- December 29: Average 37.5°
- December 30: Average 35°
- December 31: Average 29°
- January 1: Average 32.5°
- January 2: Average 37°

{
   "THOUGHT": "Now I can calculate the overall average temperature using Python.",
   "ACTION": {
       "tool_name": "execute_python_code",
       "tool_arguments": {
           "code": "temps = [35, 31.5, 37.5, 35, 29, 32.5, 37]\navg_temp = sum(temps) / len(temps)\nprint(f\"Average temperature: {avg_temp:.1f}°F\")"
       }
   }
}

< wait for tool call to finish >

OBSERVATION:
Average temperature: 33.9°F

{
   "THOUGHT": "I have the average temperature and can provide the final answer.",
   "ACTION": {
       "tool_name": "final_answer",
       "tool_arguments": {
           "answer": "Based on verified data from the NYC Central Park weather station, the average temperature in NYC last week was 33.9°F."
       }
   }
}

Example 2: "Find the latest news about AI regulation in the EU and summarize the key points"

{
   "THOUGHT": "I should search for recent news about EU AI regulation.",
   "ACTION": {
       "tool_name": "web_search",
       "tool_arguments": {
           "query": "latest EU AI regulation news 2025"
       }
   }
}

< wait for tool call to finish >

OBSERVATION: [
    {
        "title": "EU finalizes historic AI regulation law",
        "href": "https://example.com/eu-ai-law",
        "body": "The European Union has reached a landmark agreement on comprehensive AI regulations..."
    },
    {
        "title": "What businesses need to know about EU AI Act",
        "href": "https://example.com/eu-ai-business-impact",
        "body": "The new legislation will require companies to..."
    }
]

{
   "THOUGHT": "The first result seems most relevant. I should visit that webpage to get the full details.",
   "ACTION": {
       "tool_name": "visit_web_page",
       "tool_arguments": {
           "url": "https://example.com/eu-ai-law"
       }
   }
}

< wait for tool call to finish >

OBSERVATION: # EU Finalizes Historic AI Regulation Law

The European Union has reached a landmark agreement on comprehensive AI regulations, marking the world's first major legislation on artificial intelligence. The law, known as the AI Act, introduces strict rules for AI systems based on their level of risk.

## Key Points:
1. Risk-based approach to AI regulation
2. Strict rules for high-risk AI applications
3. Ban on certain AI uses like social scoring
...

{
   "THOUGHT": "Now I have detailed information from the webpage and can provide a comprehensive summary.",
   "ACTION": {
       "tool_name": "final_answer",
       "tool_arguments": {
           "answer": "According to recent news, the EU has finalized its AI Act with these key points: [detailed summary from webpage content]..."
       }
   }
}

Example 3: "Calculate the compound interest on $1000 invested for 5 years at 7% annual interest rate"

{
   "THOUGHT": "I need to use Python to calculate compound interest using the formula A = P(1 + r)^t",
   "ACTION": {
       "tool_name": "execute_python_code",
       "tool_arguments": {
           "code": "principal = 1000\nrate = 0.07\ntime = 5\namount = principal * (1 + rate) ** time\ninterest = amount - principal\nprint(f'Total amount: ${amount:.2f}')\nprint(f'Interest earned: ${interest:.2f}')"
       }
   }
}

< wait for tool call to finish >

OBSERVATION:
Total amount: $1402.55
Interest earned: $402.55

{
   "THOUGHT": "I have the calculation results and can provide the final answer.",
   "ACTION": {
       "tool_name": "final_answer",
       "tool_arguments": {
           "answer": "A $1000 investment at 7% annual interest for 5 years would grow to $1402.55, earning $402.55 in compound interest."
       }
   }
}

Example 4: "Find today's most popular movie on Rotten Tomatoes and its audience score"

{
   "THOUGHT": "Let me try to scrape the Rotten Tomatoes homepage using BeautifulSoup.",
   "ACTION": {
       "tool_name": "execute_python_code",
       "tool_arguments": {
           "code": "from bs4 import BeautifulSoup\nimport requests\n\nurl = 'https://www.rottentomatoes.com'\nresponse = requests.get(url)\nsoup = BeautifulSoup(response.text, 'html.parser')"
       }
   }
}

< wait for tool call to finish >

OBSERVATION:
{
    'stdout': '',
    'stderr': "ModuleNotFoundError: No module named 'bs4'",
    'success': False,
    'error': "ModuleNotFoundError: No module named 'bs4'"
}

{
   "THOUGHT": "Ah, we need to install beautifulsoup4 first. Let's install it and then scrape the data.",
   "ACTION": {
       "tool_name": "execute_python_code",
       "tool_arguments": {
           "code": "!pip install beautifulsoup4\nfrom bs4 import BeautifulSoup\nimport requests\n\nurl = 'https://www.rottentomatoes.com'\nresponse = requests.get(url)\nsoup = BeautifulSoup(response.text, 'html.parser')\n\n# Find the trending movie section\ntrending = soup.find('div', class_='trending-movie')\nmovie = trending.find('h2').text.strip()\nscore = trending.find('span', class_='audience-score').text.strip()\n\nprint(f\"Most Popular Movie: {movie}\")\nprint(f\"Audience Score: {score}\")"
       }
   }
}

< wait for tool call to finish >

OBSERVATION:
Collecting beautifulsoup4
  ...
Successfully installed beautifulsoup4-4.12.2
Most Popular Movie: Dune: Part Two
Audience Score: 97%

{
   "THOUGHT": "Great! We successfully installed BeautifulSoup and got the movie data.",
   "ACTION": {
       "tool_name": "final_answer",
       "tool_arguments": {
           "answer": "The most popular movie on Rotten Tomatoes today is 'Dune: Part Two' with an audience score of 97%. After encountering and fixing a missing package error, we were able to successfully scrape this data from the Rotten Tomatoes homepage."
       }
   }
}


Important rules:
1. Always explain your reasoning in the THOUGHT step
2. Use proper JSON format for tool calls in the ACTION step and only call one tool at a time.
3. Only use the available tools (web_search, visit_web_page, execute_python_code, final_answer)
4. Make your final answer using the "final_answer" tool to signal the end of the task
5. Break down complex tasks into smaller steps
6. Use Python code execution for any calculations
7. If a tool call fails, explain why in your next thought and try a different approach
8. Don't make assumptions - verify information when needed
9. Always review tool outputs before proceeding to next steps
10. When searching the web, follow up relevant results with visit_web_page to get detailed information
11. Remember that web_search returns a list of results with titles, URLs, and snippets
12. Remember that visit_web_page returns markdown-formatted content
13. If you encounter an error (website blocked, code syntax error, etc.), explain the error and try an alternative approach
14. Keep track of failed attempts and avoid repeating the same unsuccessful approach

Remember: Today's date is 2025-01-03."""


def final_answer(answer):
    return answer


TOOL_LKP["final_answer"] = final_answer


def call_tool(tool: Callable, tool_args: Dict) -> Any:
    return tool(**tool_args)


def run_step(messages, model="gpt-4o-mini", **kwargs):
    messages = messages.copy()
    response = completion(model=model, messages=messages, response_format={"type": "json_object"}, **kwargs)
    response_message = response.choices[0].message.model_dump()
    messages.append(response_message)
    assistant_json = json.loads(response_message.get("content", ""))
    if "ACTION" in assistant_json:
        console_print_react_tool_action_inputs(assistant_json)
        tool_name = assistant_json["ACTION"]["tool_name"]
        tool_result = call_tool(TOOL_LKP[tool_name], assistant_json["ACTION"]["tool_arguments"])
        console_print_react_tool_action_outputs(tool_name, tool_result)
        if tool_name == "final_answer":
            return messages
        else:
            messages.append(
                {
                    "role": "user",
                    "content": "OBSERVATION:\n" + str(tool_result),
                }
            )
    else:
        messages.append(
            {
                "role": "user",
                "content": 'Remember to always respond in structured JSON format with the fields "THOUGHT" and "ACTION". Please try again.',
            }
        )
    return messages


def react_loop(task: str, model="gpt-4o-mini", max_steps=10, **kwargs):
    messages = [
        {"role": "system", "content": REACT_SYSTEM_PROMPT},
        {"role": "user", "content": task},
    ]
    console_print_user_request(messages, model)
    done_calling_tools = False
    for counter in range(max_steps):
        done_calling_tools = messages[-1]["role"] == "assistant" and "final_answer" in messages[-1].get("content")
        if done_calling_tools:
            break
        messages = run_step(messages, model=model, **kwargs)
    return messages

Example Task 1

We will attempt to solve the same task as before using the ReAct prompting technique and the same model as before.

gpt-4o-mini

from react import react_loop

messages_final = react_loop(task)

eval_example_one(messages_final[-1]["content"])

{'correct': False,
 'explanation': "The LLM incorrectly stated LeBron James's age. According to the ground truth, LeBron was 40 years old on January 2, 2025, not 41. All other aspects of the LLM's answer, including Steph Curry's performance and LeBron's points scored and calculation, were accurate. However, since the age was incorrect, the LLM did not get everything correct in its predicted answer."}

Coding Action Agent

We have just utilized the standard JSON tool calling approach. This is a common approach used by the LLM APIs from OpenAI, Anthropic, Google, etc. The actions are tool calls consisting of JSON objects which state the function and arguments to use. Another approach is getting the LLMs to call the tools within code. I had heard of this before but read more about it in the smolagents blog post. One good paper on this topic is Executable Code Actions Elicit Better LLM Agents. Here is an image from the paper illustrating the differences between the JSON tool approach and code approach:

Figure from the CodeAct Paper

Instead of generating static JSON objects to represent tool calls, the code approach allows LLMs to write and execute Python code. This makes tool interactions more dynamic and adaptable, as the LLM can handle logic, conditionals, and iterations directly within the generated code. This flexibility enhances how LLMs can interact with complex tasks and environments.

Since we are relying on the LLM to write much more code, it's even more important to have a proper sandbox environment. Before we get to the code agent implementation, let's first take a detour to see how we can create a sandbox environment for executing arbitrary python code.

Modal Sandbox Environment - `IPython` REPL

Modal Sandboxes are super cool! I'm still learning about them, but they are a great way to execute arbitrary code in a secure environment. I wanted to build a simple proof of concept ipython REPL within an isolated sandbox environment.

The Modal sandbox implementation creates a secure environment for executing arbitrary Python code while maintaining state between executions. Let's break down how it works:

Custom IPython Shell: We create a persistent IPython shell that runs in a Modal container/sandbox, allowing us to maintain state and execute code interactively. This gives us the familiar IPython REPL experience but in a secure, isolated environment.
Input/Output Communication: I use a simple JSON-based protocol to communicate between the local environment and the Modal container. Code is sent to the container for execution, and results (including stdout, stderr, and the last expression value) are returned in a structured format.
State Persistence: Unlike typical serverless functions that are stateless, this sandbox maintains state between executions when using the same sandbox instance. This means variables and imports persist across multiple code executions.

Using Modal's sandbox provides security. This makes it safe to execute arbitrary Python code without risking the host system's security. The sandbox is particularly useful for AI agents that need to execute Python code as part of their reasoning process, as it provides a secure environment for code execution while maintaining the interactive nature of an IPython REPL.

Here is the code for my proof of concept IPython REPL within a Modal sandbox:

import json

import modal

# Create image with IPython installed
image = modal.Image.debian_slim().pip_install("ipython", "pandas")


# Create the driver program that will run in the sandbox
def create_driver_program():
    return """
import json
import sys
import re
from IPython.core.interactiveshell import InteractiveShell
from IPython.utils.io import capture_output

def strip_ansi_codes(text):
    ansi_escape = re.compile(r'\\x1B(?:[@-Z\\\\-_]|\\[[0-?]*[ -/]*[@-~])')
    return ansi_escape.sub('', text)

# Create a persistent IPython shell instance
shell = InteractiveShell()
shell.colors = 'NoColor'  # Disable color output
shell.autoindent = False  # Disable autoindent

# Keep reading commands from stdin
while True:
    try:
        # Read a line of JSON from stdin
        command = json.loads(input())
        code = command.get('code')
        
        if code is None:
            print(json.dumps({"error": "No code provided"}))
            continue
            
        # Execute the code and capture output
        with capture_output() as captured:
            result = shell.run_cell(code)

        # Clean the outputs
        stdout = strip_ansi_codes(captured.stdout)
        stderr = strip_ansi_codes(captured.stderr)
        error = strip_ansi_codes(str(result.error_in_exec)) if not result.success else None

        # Format the response
        response = {
            "stdout": stdout,
            "stderr": stderr,
            "success": result.success,
            "result": repr(result.result) if result.success else None,
            "error": error
        }
        
        # Send the response
        print(json.dumps(response), flush=True)
        
    except Exception as e:
        print(json.dumps({"error": strip_ansi_codes(str(e))}), flush=True)
"""


def create_sandbox():
    """Creates and returns a Modal sandbox running an IPython shell."""
    app = modal.App.lookup("ipython-sandbox", create_if_missing=True)

    # Create the sandbox with the driver program
    with modal.enable_output():
        sandbox = modal.Sandbox.create("python", "-c", create_driver_program(), image=image, app=app)

    return sandbox


def execute_python_code(code: str, sandbox=None) -> dict:
    created_sandbox = False
    if sandbox is None:
        sandbox = create_sandbox()
        created_sandbox = True
    # Send the code to the sandbox
    sandbox.stdin.write(json.dumps({"code": code}))
    sandbox.stdin.write("\n")
    sandbox.stdin.drain()

    # Get the response
    response = next(iter(sandbox.stdout))
    if created_sandbox:
        sandbox.terminate()
    return json.loads(response)

from python_sandbox import create_sandbox, execute_python_code

One simple use case is to spin up a sandbox, execute some code, and then terminate the sandbox automatically. This is what happens if you don't pass in a sandbox object.

code = """
print('This is a test running within a Modal Sandbox!!!')
x = 2
y = 6
print(x+y)
y-x
"""
execute_python_code(code=code)

Another interesting use case is to create a persistent sandbox and then use it for multiple python code executions. The state is maintained between executions.

sandbox = create_sandbox()
execute_python_code(code="x=2", sandbox=sandbox)

execute_python_code(code="y=6; print(x+y)", sandbox=sandbox)

execute_python_code(code="y-x", sandbox=sandbox)

code = """
numbers = list(range(1, 6))
squares = [n**2 for n in numbers]
sum_squares = sum(squares)
print(f"Numbers: {numbers}")
print(f"Squares: {squares}")
print(f"Sum of squares: {sum_squares}")
numbers
"""
execute_python_code(code=code, sandbox=sandbox)

{'stdout': 'Numbers: [1, 2, 3, 4, 5]\nSquares: [1, 4, 9, 16, 25]\nSum of squares: 55\nOut[1]: [1, 2, 3, 4, 5]\n',
 'stderr': '',
 'success': True,
 'result': '[1, 2, 3, 4, 5]',
 'error': None}

I can terminate the sandbox when I am done with it.

sandbox.terminate()

Code Agent Implementation

Here is a proof of concept implementation of a code agent. Much like the rest of this post, this is all for educational purposes. I got all my inspiration from the smolagents library. Since their repo is small it's such a great learning resource! Go check it out if you want something more robust.

I hacked this together, and the system prompt is sort of long. But I hope this gives a good illustration of the basics. It's really just the same things we have already seen.

LLM + system prompt + tools + sandbox python environment + for loop = code agent

import re

from litellm import completion
from python_sandbox import create_sandbox, execute_python_code

from utils import (
    console_print_code_agent_assistant_message,
    console_print_code_agent_code_block,
    console_print_code_agent_observation,
    console_print_llm_output,
    console_print_step,
    console_print_user_request,
)

CODING_AGENT_SYSTEM_PROMPT = """
You are an expert Python programmer who solves problems incrementally using a secure IPython REPL environment.
You break down complex tasks into small, verifiable steps, always checking your intermediate results before proceeding.

PROBLEM-SOLVING FORMAT:
You solve tasks through a repeating cycle of three steps:

Thought: Explain your reasoning and what you expect to learn
Code: Write code to solve step by step
Observation: Review the code execution results from the user to inform next steps

This cycle repeats, with each iteration building on previous results, until the task is completed. 
The task is only complete when you have gathered all the information you need to solve the problem.
You then submit your final answer to the user with a "FINAL ANSWER" submission tag.

You do the thinking and generate thoughts.
You write the code.
The user will execute the code and provide you the output/observation to inform your next steps.

ENVIRONMENT CAPABILITIES:
1. Secure Sandbox:
   - Isolated sandbox container for safe arbitrary code execution
   - Persistent state between executions
   - Nothing can go wrong on the host machine. Install any packages you need and run any code you need.
   - Built with Modal and IPython for secure code execution

2. Pre-imported Tools (Feel free to use these tools as needed or create your own from scratch!)
   - web_search(query: str) - Search the web for the given query. Always print the results.
   - visit_web_page(url: str) - Visit and extract content from the given URL. Always print the results.

3. String Formatting Requirements:
   - All print statements must use double backslashes for escape characters
   - Example: print("\\nHello") instead of print("\nHello")
   - This applies to all string literals containing \n, \r, \t etc.
   - This is required to prevent string termination errors in the sandbox

4. Code Execution Response Format:
   {
     'stdout': str,  # Printed output
     'stderr': str,  # Error messages
     'success': bool,  # Execution success
     'result': str,  # Last expression value
     'error': str | None  # Exception details
   }
   
PROBLEM-SOLVING APPROACH:
1. Break down the problem into smallest possible steps
2. Write and test one small piece at a time
3. Verify each intermediate result before continuing
4. Build upon working code incrementally
5. Keep track of what you learn at each step


**IMPORTANT:** 
    - Follow the Code formatting exactly as shown in the examples.
    - After the final code execution and observation, you must provide a "FINAL ANSWER" submission tag.
        This is the only message you do not need to use code for.
        Submit your FINAL ANSWER using markdown and use all the information you have gathered 
        to give a comprehensive answer to the user.

Your reward comes from solving tasks reliably and delighting users with clear, well-validated code developed step-by-step.

EXAMPLES:

Example 1
Task: "Find the average word length in a sentence, excluding punctuation."

Thought: Let's first get our sentence and print it to verify what we're working with.

Code:
`python
text = "Hello, World! This is a test sentence."
print(f"Working with text: {text}")
`<end_code>

Observation: {
    'stdout': 'Working with text: Hello, World! This is a test sentence.\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
 }

Thought: Now let's remove punctuation and verify our cleaning worked.

Code:
`python
import string
cleaned_text = text.translate(str.maketrans("", "", string.punctuation))
print(f"Cleaned text: {cleaned_text}")
`<end_code>

Observation: {
    'stdout': 'Cleaned text: Hello World This is a test sentence\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Let's split into words and verify our word list.

Code:
`python
words = cleaned_text.split()
print(f"Words: {words}")
print(f"Number of words: {len(words)}")
`<end_code>

Observation: {
    'stdout': "Words: ['Hello', 'World', 'This', 'is', 'a', 'test', 'sentence']\nNumber of words: 7\n",
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Now we can calculate word lengths and verify them individually.

Code:
`python
word_lengths = [len(word) for word in words]
print("Length of each word:")
for word, length in zip(words, word_lengths):
    print(f"'{word}': {length} characters")
`<end_code>

Observation: {
    'stdout': "Length of each word:\n'Hello': 5 characters\n'World': 5 characters\n'This': 4 characters\n'is': 2 characters\n'a': 1 characters\n'test': 4 characters\n'sentence': 8 characters\n",
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
    }

Thought: Finally, calculate the average and format it nicely.

Code:
`python
average_length = sum(word_lengths) / len(word_lengths)
print(f"Average word length: {average_length:.2f} characters")
`<end_code>


Observation: {
    stdout': 'Average word length: 4.14 characters\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: I have all the information I need to solve the users request. The answer is 4.14 characters.

FINAL ANSWER:
The average word length in a sentence, excluding punctuation, is 4.14 characters.


Example 2
Task: "Analyze a dataset of movie ratings. Calculate average ratings by genre, identify the most popular genres, and create a summary of the top-rated movies."

Thought: First, let's try to use pandas for this analysis.

Code:
`python
import pandas as pd
`<end_code>

Observation: {
    'stdout': '',
    'stderr': 'ModuleNotFoundError: No module named "pandas"',
    'success': False,
    'error': 'ModuleNotFoundError: No module named "pandas"'
}

Thought: We need to install pandas first. Let's do that and verify the installation.

Code:
`python
!pip install pandas
import pandas as pd
print("Pandas successfully imported!")
`<end_code>

Observation: {
    'stdout': 'Successfully installed pandas-2.1.0\nPandas successfully imported!\n',
    'stderr': '',
    'success': True,
    'result': None,
    'error': None
}

Thought: Let's start by loading a small sample of data and verify its structure.

Code:
`python
# Create sample data to work with
data = {
    'title': ['The Matrix', 'Inception', 'Jurassic Park', 'The Notebook', 'Die Hard'],
    'genre': ['Sci-Fi,Action', 'Sci-Fi,Thriller', 'Action,Adventure', 'Romance,Drama', 'Action,Thriller'],
    'rating': [8.7, 8.8, 8.1, 7.8, 8.2]
}
df = pd.DataFrame(data)
print("Data sample:\\n")
print(df.head())
`<end_code>

Observation: {
    'stdout': 'Data sample:\n           title             genre  rating\n0     The Matrix     Sci-Fi,Action     8.7\n1      Inception   Sci-Fi,Thriller     8.8\n2  Jurassic Park  Action,Adventure     8.1\n3   The Notebook     Romance,Drama     7.8\n4       Die Hard   Action,Thriller     8.2\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: We need to split the genre column since movies can have multiple genres. Let's do this step by step.

Code:
`python
# First, let's see what unique genres we have
all_genres = set()
for genres in df['genre']:
    all_genres.update(genres.split(','))
print("Unique genres:", sorted(all_genres))
`<end_code>

Observation: {
    'stdout': "Unique genres: ['Action', 'Adventure', 'Drama', 'Romance', 'Sci-Fi', 'Thriller']\n",
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Now let's create a function to analyze ratings by genre and test it with one genre.

Code:
`python
def get_genre_stats(genre):
    # Get movies in this genre
    genre_movies = df[df['genre'].str.contains(genre)]
    avg_rating = genre_movies['rating'].mean()
    count = len(genre_movies)
    return {'genre': genre, 'avg_rating': avg_rating, 'movie_count': count}

# Test with one genre
action_stats = get_genre_stats('Action')
print("Action genre stats:", action_stats)
`<end_code>

Observation: {
    'stdout': "Action genre stats: {'genre': 'Action', 'avg_rating': 8.333333333333332, 'movie_count': 3}\n",
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: The function works! Let's analyze all genres and create a summary DataFrame.

Code:
`python
# Create stats for all genres
genre_stats = [get_genre_stats(genre) for genre in all_genres]
stats_df = pd.DataFrame(genre_stats)
stats_df = stats_df.sort_values('avg_rating', ascending=False)
print("\\nGenre Statistics:")
print(stats_df)
`<end_code>

Observation: {
    'stdout': '\nGenre Statistics:\n       genre  avg_rating  movie_count\n0     Sci-Fi    8.750000            2\n4   Thriller    8.500000            2\n2     Action    8.333333            3\n1  Adventure    8.100000            1\n3    Romance    7.800000            1\n5      Drama    7.800000            1\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Finally, let's create a comprehensive summary including top movies in each genre.

Code:
`python
def create_summary():
    summary = "Movie Analysis Summary:\\n\\n"
    summary += "Top Genres by Average Rating:\\n"
    for _, row in stats_df.head(3).iterrows():
        summary += f"- {row['genre']}: {row['avg_rating']:.1f} ({row['movie_count']} movies)\\n"
    
    summary += "\\nTop Rated Movie Overall:\\n"
    top_movie = df.loc[df['rating'].idxmax()]
    summary += f"- {top_movie['title']} ({top_movie['rating']})"
    
    print(summary)
    return summary

print(create_summary())
`<end_code>

Observation: { 
    'stdout': 'Movie Analysis Summary:\n\nTop Genres by Average Rating:\n- Sci-Fi: 8.8 (2 movies)\n- Thriller: 8.5 (2 movies)\n- Action: 8.3 (3 movies)\n\nTop Rated Movie Overall:\n- Inception (8.8)\nMovie Analysis Summary:\n\nTop Genres by Average Rating:\n- Sci-Fi: 8.8 (2 movies)\n- Thriller: 8.5 (2 movies)\n- Action: 8.3 (3 movies)\n\nTop Rated Movie Overall:\n- Inception (8.8)\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: I now can format the answer into markdown and provide a comprehensive summary.

FINAL ANSWER:
# Movie Analysis Summary

## Top Genres by Average Rating
* **Sci-Fi**: 8.8 (2 movies)
* **Thriller**: 8.5 (2 movies)
* **Action**: 8.3 (3 movies)

## Top Rated Movie Overall
* **Inception** (8.8)

Example 3
Task: "Analyze the 30-day volatility trends of tech giants (Apple, Microsoft, Google) over the past year and identify which company has been the most volatile"

Thought: First attempt to get historical stock data directly.

Code:
`python
import yfinance
tickers = ['AAPL', 'MSFT', 'GOOGL']
data = yfinance.download(tickers, period='1y')
print("Downloaded data preview:")
print(data.head(2))
`<end_code>

Observation: {
    'stdout': "---------------------------------------------------------------------------\nModuleNotFoundError                       Traceback (most recent call last)\nFile <ipython-input-1-3a34c6ac3ab5>:1\n----> 1 import yfinance\n      2 tickers = ['AAPL', 'MSFT', 'GOOGL']\n      3 data = yfinance.download(tickers, period='1y')\n\nModuleNotFoundError: No module named 'yfinance'\n",
    'stderr': '',
    'success': False,
    'result': None,
    'error': "No module named 'yfinance'"
}

Thought: We need to install yfinance first. Let's do that and verify the installation.

Code:
`python
!pip install yfinance
import yfinance as yf
print("Successfully installed and imported yfinance!")
`<end_code>

Observation: {
    'stdout': 'Collecting yfinance\r\n  Downloading yfinance-0.2.52-py2.py3-none-any.whl.metadata (5.8 kB)\r\nRequirement already satisfied: pandas>=1.3.0 ......\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Now let's download the data and analyze the available columns.

Code:
`python
import pandas as pd
tickers = ['AAPL', 'MSFT', 'GOOGL']
data = yf.download(tickers, period='1y')
print("Data shape:", data.shape)
print("\\nColumns available:", data.columns.levels[0].tolist())
`<end_code>

Observation: {
    'stdout': "Data shape: (252, 15)\n\nColumns available: ['Close', 'High', 'Low', 'Open', 'Volume']\n",
    'stderr': '\r[                       0%                       ]\r[**********************67%*******                ]  2 of 3 completed\r[*********************100%***********************]  3 of 3 completed\n',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Let's check the data structure and calculate 30-day rolling volatility using closing prices.

Code:
`python
# Calculate daily returns and 30-day rolling volatility
returns = data['Close'].pct_change()
volatility = returns.rolling(window=30).std() * (252 ** 0.5)  # Annualized volatility
print("\\nFirst few days of volatility (will be NaN during first 30 days due to rolling window):")
print(volatility.head())
print("\\nLast 5 days of volatility:")
print(volatility.tail())
`<end_code>

Observation: {
    'stdout': '\nFirst few days of volatility (will be NaN during first 30 days due to rolling window):\nTicker      AAPL  GOOGL  MSFT\nDate                         \n2024-01-18   NaN    NaN   NaN\n2024-01-19   NaN    NaN   NaN\n2024-01-22   NaN    NaN   NaN\n2024-01-23   NaN    NaN   NaN\n2024-01-24   NaN    NaN   NaN\n\nLast 5 days of volatility:\nTicker          AAPL     GOOGL      MSFT\nDate                                    \n2025-01-13  0.184242  0.316788  0.184272\n2025-01-14  0.184753  0.318345  0.181594\n2025-01-15  0.191293  0.327256  0.196739\n2025-01-16  0.222245  0.330185  0.189958\n2025-01-17  0.219824  0.331567  0.192567\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: Now let's create a final summary comparing the volatility of each stock and identify the most volatile one.

Code:
`python
# Calculate mean volatility for each stock (excluding NaN values)
avg_vol = volatility.mean()
max_vol = volatility.max()
most_volatile = avg_vol.idxmax()

summary = {
    'most_volatile_stock': most_volatile,
    'average_volatility': {
        'AAPL': f"{avg_vol['AAPL']:.2%}",
        'MSFT': f"{avg_vol['MSFT']:.2%}",
        'GOOGL': f"{avg_vol['GOOGL']:.2%}"
    },
    'peak_volatility': {
        'AAPL': f"{max_vol['AAPL']:.2%}",
        'MSFT': f"{max_vol['MSFT']:.2%}",
        'GOOGL': f"{max_vol['GOOGL']:.2%}"
    },
    'analysis_period': f"{data.index[0].strftime('%Y-%m-%d')} to {data.index[-1].strftime('%Y-%m-%d')}"
}

print("\\nVolatility Analysis Summary:")
print(f"Most volatile stock: {summary['most_volatile_stock']}")
print("\\nAverage Volatility:")
for stock, vol in summary['average_volatility'].items():
    print(f"{stock}: {vol}")
print("\\nPeak Volatility:")
for stock, vol in summary['peak_volatility'].items():
    print(f"{stock}: {vol}")
print(f"\\nAnalysis Period: {summary['analysis_period']}")
`<end_code>

Observation: {
    'stdout': '\nVolatility Analysis Summary:\nMost volatile stock: GOOGL\n\nAverage Volatility:\nAAPL: 22.47%\nMSFT: 20.09%\nGOOGL: 27.26%\n\nPeak Volatility:\nAAPL: 32.75%\nMSFT: 25.31%\nGOOGL: 39.13%\n\nAnalysis Period: 2024-01-22 to 2025-01-21\n',
    'stderr': '',
    'success': True,
    'result': 'None',
    'error': None
}

Thought: I now can format the answer into markdown and provide a comprehensive summary.

FINAL ANSWER:
# Volatility Analysis Summary

**Analysis Period:** 2024-01-22 to 2025-01-21
**Most Volatile Stock:** GOOGL

## Average Volatility
* **AAPL**: 22.47%
* **MSFT**: 20.09%
* **GOOGL**: 27.26%

## Peak Volatility
* **AAPL**: 32.75%
* **MSFT**: 25.31%
* **GOOGL**: 39.13%


Example 4
Task: "Give a detailed summary of Trumps 2025 Inauguration speech. Consult multiple sources."

Thought: To provide a detailed summary of Trump's 2025 Inauguration speech, I will first search for the text of the speech and analyze multiple sources for their interpretations and key points. This will help ensure a comprehensive summary. 

Code:
`python
# Initiating a web search for Trump's 2025 inauguration speech and summaries
web_results = web_search("Trump 2025 inauguration speech summary")
print(web_results)
`<end_code>

Observation: {
'stdout': '[{\'title\': "Trump\'s inaugural address, annotated and fact-checked", \'href\': \'https://www.cnn.com/interactive/politics/2025/donald-trump-inaugural-speech-dg/\', \'body\': "Read a line-by-line analysis of President Donald Trump\'s inaugural address. ... 2025, is liberation day. This is wartime language. Trump suggests Americans have been under siege until today."}, {\'title\': "4 takeaways from Trump\'s second inaugural address : NPR", \'href\': \'https://www.npr.org/2025/01/20/g-s1-43759/trump-inaugural-address-key-moments\', \'body\': "For updates, context and analysis of Donald Trump\'s 2025 inauguration, check out NPR\'s live blog throughout the day Monday."}, {\'title\': "Takeaways from Trump\'s second inaugural address | AP News", \'href\': \'https://apnews.com/article/trump-inaugural-address-takeaways-212571b9f231469010c31e06ec530199\', \'body\': \'NEW YORK (AP) — President Donald Trump\\\'s second inaugural address featured similar themes to his first: a sweeping indictment of the country he inherits and grand promises to fix its problems.. Eight years ago, Trump described "American carnage" and promised to end it immediately. On Monday, he declared that the country\\\'s "decline" will end immediately, ushering in "the golden ...\'}, {\'title\': "6 takeaways from Trump\'s inaugural address - CBS News", \'href\': \'https://www.cbsnews.com/news/6-takeaways-from-trumps-inaugural-address/\', \'body\': \'Here are the highlights from Mr. Trump\\\'s second inaugural address. "The golden age of America begins right now" ... See the Inauguration Day schedule for Trump\\\'s 2025 presidential swearing in.\'}, {\'title\': "Donald Trump\'s Second Inaugural Speech, Annotated", \'href\': \'https://www.nytimes.com/interactive/2025/01/20/us/trump-inauguration-speech-annotated.html\', \'body\': \'1. Trump promised an "America First" vision in his first term, and he made a similar promise in his first inaugural address. This idea has been one of the animating ideas behind his political ...\'}]\n',
 'stderr': '',
 'success': True,
 'result': 'None',
 'error': None
}

Thought: I found several articles summarizing Trump's 2025 Inauguration speech, including detailed analyses and key takeaways. The notable sources include CNN, NPR, AP News, CBS News, and The New York Times. 
I'll extract information from these sources to compile a comprehensive summary. 
First I will visit each webpage to get the full details.

Code:
`python
for source in web_results:
    print(f"Visiting {source['title']} at {source['href']}")
    print(visit_web_page(source['href']))
`<end_code>

Observation: {
'stdout': "Visiting Trump's inaugural address, annotated and fact-checked at https://www.cnn.com/interactive/politics/2025/donald-trump-inaugural-speech-dg/\nDonald Trump's second inaugural speech, annotated and fact checked - CNN                            \n\n... (content truncated)\nVisiting 4 takeaways from Trump's second inaugural address : NPR at https://www.npr.org/2025/01/20/g-s1-43759/trump-inaugural-address-key-moments\n4 takeaways from Trump’s second inaugural address : NPR\n\nAccessibility links\n\n* [Skip to main conten\n\n... (content truncated)\nVisiting Takeaways from Trump's second inaugural address | AP News at https://apnews.com/article/trump-inaugural-address-takeaways-212571b9f231469010c31e06ec530199\nKey moments from Trump's second inaugural address | AP News\n\nMenu\n[![AP Logo](https://assets.apnews.\n\n... (content truncated)\nVisiting 6 takeaways from Trump's inaugural address - CBS News at https://www.cbsnews.com/news/6-takeaways-from-trumps-inaugural-address/\n6 takeaways from Trump's inaugural address - CBS News\n\n* [Latest](/)\n  + [U.S.](/us/)\n  + [Inaugurat\n\n... (content truncated)\nVisiting Donald Trump's Second Inaugural Speech, Annotated at https://www.nytimes.com/interactive/2025/01/20/us/trump-inauguration-speech-annotated.html\nDonald Trump’s Second Inaugural Speech, Annotated - The New York Times\n  \n\n[Skip to content](#site-c\n\n... (content truncated)\n",
 'stderr': '',
 'success': True,
 'result': 'None',
 'error': None
 }       

Thought: I have collect much information from multiple sources. I now can format the answer into markdown and provide a comprehensive summary while also linking to original sources.

FINAL ANSWER:
# Trump's 2025 Inauguration Speech Summary
- Trump talked about .... 

"""


def extract_code_blocks(response_text: str) -> list[str]:
    # Pattern matches content between `python and <end_code>, ignoring "Code:" prefix
    pattern = r"`python\n(.*?)<end_code>"
    # re.DOTALL allows . to match newlines
    matches = re.findall(pattern, response_text, re.DOTALL)
    # Clean up any "Code:" prefix, backticks, and whitespace
    return [block.replace("Code:", "").replace("```", "").strip() for block in matches]


def code_agent(task: str, model: str = "gpt-4o-mini", max_iterations: int = 20):
    sb = create_sandbox()

    # Copy the existing tools.py into the sandbox
    with open("web_tools.py", "r") as source_file:
        tools_content = source_file.read()

    with sb.open("web_tools.py", "w") as sandbox_file:
        sandbox_file.write(tools_content)

    execute_python_code("!pip install requests markdownify duckduckgo-search", sb)
    execute_python_code("import requests; from web_tools import web_search, visit_web_page;", sb)

    messages = [{"role": "system", "content": CODING_AGENT_SYSTEM_PROMPT}, {"role": "user", "content": task}]
    console_print_user_request(messages, model)
    for i in range(max_iterations):
        console_print_step(i)
        response = completion(model="gpt-4o-mini", messages=messages, stop=["<end_code>"])
        asst_message = response.choices[0].message.content
        contains_code = "Code:" in asst_message or "`python" in asst_message or "end_code" in asst_message
        if "FINAL ANSWER" in asst_message or not contains_code:
            messages.append({"role": "assistant", "content": asst_message})
            console_print_llm_output(asst_message)
            break
        asst_message = asst_message + "<end_code>"
        console_print_code_agent_assistant_message(asst_message)
        messages.append({"role": "assistant", "content": asst_message})
        try:
            code = extract_code_blocks(messages[-1]["content"])[0]
            console_print_code_agent_code_block(code)
        except Exception:
            messages.append(
                {
                    "role": "user",
                    "content": """
                            The was an error in extracting your code snippet.
                            The code is probably correct but you did not put it between the `python and <end_code> tags.
                            Like this:
                                Code:
                                `python
                                ...
                                `<end_code>
                            Please attempt the same code again.
                            """,
                }
            )
            continue

        observation = execute_python_code(code, sb)
        console_print_code_agent_observation(observation)
        messages.append({"role": "user", "content": f"Observation: {observation}"})

    sb.terminate()
    return messages

from coding_agent import code_agent

messages_final = code_agent(task)

eval_example_one(messages_final[-1]["content"])

{'correct': False,
 'explanation': "The LLM gave incorrect information regarding LeBron James' age. In the correct answer, it is stated that LeBron James was 40 years old on January 2, 2025, while the LLM predicted that he was 41 years old. Additionally, the rest of the information provided about both players' performances matches the ground truth, but the error regarding LeBron's age means the LLM did not answer each part of the question correctly."}

Example Tasks

Characters Per Token

gemini/gemini-2.0-flash-exp

task = """How many characters on average are in an LLM token? Analyze this across different OpenAI models using the tiktoken library. Please:
Install the tiktoken library. Use a relatively long sample text with multiple paragraphs.
Analyze token lengths for various OpenAI models including:
GPT-4
GPT-3.5
GPT-4-o
GPT-4o-mini
etc.
Search the tiktoken documentation or web to find the complete list of supported models. Compare the results to understand how tokenization differs between models."""

messages_final = code_agent(task, model="gemini/gemini-2.0-flash-exp")

Summarize Some of My Blog Posts

gemini/gemini-2.0-flash-exp

task = """
I wrote a series of blog posts on my website here: https://drchrislevy.github.io/blog.html.
First generate a list of all the blog posts.
Pick the top 3 you think are most interesting and give me a one paragraph summary of each post.
Be sure to visit the the page of the actual blog posts you select to get the details for summarizing.
"""

messages_final = code_agent(task, model="gemini/gemini-2.0-flash-exp")

Download and Analyze Kaggle Dataset

claude-3-5-sonnet-20240620

task = """
Download the kaggle dataset: vijayveersingh/the-california-wildfire-data"
Perform some interesting analysis on the dataset and report on your findings.
I can not view plots yet so don't make plots. But please aggregate data and display as markdown tables.

You can download the dataset using the kagglehub library.
!pip install kagglehub
import kagglehub
path = kagglehub.dataset_download("vijayveersingh/the-california-wildfire-data"")
print("Path to dataset files:", path)
"""
messages_final = code_agent(task, model="claude-3-5-sonnet-20240620")

Part 2 (TBD)

This exploration of AI agents has just scratched the surface. I planned to do much more but sort of ran out of steam. I'm going to come back to this in the future. This intro was all about skipping the frameworks and just playing around with tools and loops and seeing what was out there. I found smolagents library to be a great learning resource. They just announced a new version that supports vision capabilities so more to learn there.

There is so much more I read and minimally investigated, but I need more time to dig more into things. I tried to keep the list of resources I was reading and also ones I want to explore in the future. They are below and in no particular order.