With websearch
, our agents had opened a window to the world. But an expert researcher doesnt just read: they analyze data, perform calculations, interact with other systems and, when necessary, consult other experts. To elevate our agents from simple "information gatherers" to true "digital analysts," we needed to drastically expand their toolbox.
The OpenAI Agents SDK classifies tools into three main categories, and our journey led us to implement them and understand their respective strengths and weaknesses.
This is the most common and powerful form of tool. It allows you to transform any Python function into a capability that the agent can invoke. The SDK magically takes care of analyzing the function signature, argument types, and even the docstring to generate a schema that the LLM can understand.
The Architectural Decision: A Central "Tool Registry" and Decorators
To keep our code clean and modular (Pillar #14), we implemented a central ToolRegistry
. Any function anywhere in our codebase can be transformed into a tool simply by adding a decorator.
Reference code: backend/tools/registry.py
and backend/tools/web_search_tool.py
# Example of a Function Tool
from .registry import tool_registry
@tool_registry.register("websearch")
class WebSearchTool:
"""
Performs a web search using the DuckDuckGo API to get updated information.
Essential for tasks that require real-time data.
"""
async def execute(self, query: str) -> str:
# Logic to call a search API...
return "Search results..."
The SDK allowed us to cleanly define not only the action (execute
), but also its "advertisement" to the AI through the docstring, which becomes the tools description.
Some tools are so complex and require such specific infrastructure that it doesnt make sense to implement them ourselves. These are called "Hosted Tools," services run directly on OpenAIs servers. The most important one for us was the CodeInterpreterTool
.
The Challenge: The code_interpreter
– A Sandboxed Analysis Laboratory
Many tasks required complex quantitative analysis. The solution was to give the AI the ability to write and execute Python code.
Reference code: backend/tools/code_interpreter_tool.py
(integration logic)
As mentioned, our first encounter with the code_interpreter
was traumatic. An agent generated dangerous code (rm -rf /*
), teaching us the fundamental lesson about security.
The Lesson Learned: "Zero Trust Execution"
Code generated by an LLM must be treated as the most hostile input possible. Our security architecture is based on three levels:
Security Level | Implementation | Purpose |
---|---|---|
1. Sandboxing | Execution of all code in an ephemeral Docker container with minimal permissions (no access to network or host file system). | Completely isolate execution, making even the most dangerous commands harmless. |
2. Static Analysis | A pre-execution validator that looks for obviously malicious code patterns (os.system , subprocess ). |
A quick first filter to block the most obvious abuse attempts. |
3. Guardrail (Human-in-the-Loop) | An SDK Guardrail that intercepts code. If it attempts critical operations, it pauses execution and requests human approval. |
The final safety net, applying Pillar #8 to tool security as well. |
This is the most advanced technique and the one that truly transformed our system into a digital organization. Sometimes, the best "tool" for a task isnt a function, but another agent.
We realized that our MarketingStrategist
shouldn't try to do financial analysis. It should consult the FinancialAnalyst
.
The "Agent-as-Tools" Pattern:
The SDK makes this pattern incredibly elegant with the .as_tool()
method.
Reference code: Conceptual logic in director.py
and specialist.py
# Definition of specialist agents
financial_analyst_agent = Agent(name="Financial Analyst", instructions="...")
market_researcher_agent = Agent(name="Market Researcher", instructions="...")
# Creation of the orchestrator agent
strategy_agent = Agent(name="StrategicPlanner",
instructions="Analyze the problem and delegate to your specialists using tools.",
tools=[
financial_analyst_agent.as_tool(tool_name="consult_financial_analyst",
tool_description="Ask a specific financial analysis question."
),
market_researcher_agent.as_tool(
tool_name="get_market_data",
tool_description="Request updated market data."
),
],
)
This unlocked hierarchical collaboration. Our system was no longer a "flat" team, but a true organization where agents could delegate sub-tasks, request consultations, and aggregate results, just like in a real company.
✓ Choose the Right Tool Class: Not all tools are equal. Use Function Tools
for custom capabilities, Hosted Tools
for complex infrastructure (like the code_interpreter
), and Agents as Tools
for delegation and collaboration.
✓ Security is Not Optional: If you use powerful tools like code execution, you must design a multi-layered security architecture based on the "Zero Trust" principle.
✓ Delegation is a Superior Form of Intelligence: The most advanced agent systems arent those where every agent knows how to do everything, but those where every agent knows who to ask for help.
Chapter Conclusion
With a rich and secure toolbox, our agents were now able to tackle a much broader range of complex problems. They could analyze data, create visualizations, and collaborate at a much deeper level.
This, however, made the role of our quality system even more critical. With such powerful agents, how could we be sure that their outputs, now much more sophisticated, were still high quality and aligned with business objectives? This brings us back to our Quality Gate, but with a new and deeper understanding of what "quality" means.
We wrote a test to verify that the tools were working.
Reference code: tests/test_tools.py
The test was simple: give an agent a task that clearly required a web search (e.g., "Who is the current CEO of OpenAI?") and verify that the websearch
tool was called.
The first results were disconcerting: the test failed 50% of the time.
Disaster Logbook (July 27):
ASSERTION FAILED: Web search tool was not called.;
AI Response: "As of my last update in early 2023, the CEO of OpenAI was Sam Altman."
The Problem: The LLM was "lazy." Instead of admitting it didnt have updated information and using the tool we had provided, it preferred to give an answer based on its internal knowledge, even if obsolete. It was choosing the easy way out, at the expense of quality and truthfulness.
The Lesson Learned: You Must Force Tool Usage
Its not enough to give a tool to an agent. You must create an environment and instructions that incentivize (or force) it to use it.
The solution was a refinement of our prompt engineering:
These changes increased tool usage from 50% to over 95%, solving the "laziness" problem and ensuring our agents actively sought real data.
✓ Agents Need Tools: An AI system without access to external tools is a limited system destined to become obsolete.
✓ Centralize Tools in a Registry: Don't tie tools to specific agents. A modular registry is more scalable and maintainable.
✓ AI Can Be "Lazy": Don't assume an agent will use the tools you provide. You must explicitly instruct and incentivize it to do so.
✓ Test Behavior, Not Just Output: Tool tests shouldn't just verify that the tool works, but that the agent decides to use it when strategically correct.
Chapter Conclusion
With the introduction of tools, our agents finally had a way to produce reality-based results. But this opened a new Pandora's box: quality.
Now that agents could produce data-rich content, how could we be sure this content was high quality, consistent, and, most importantly, of real business value? It was time to build our Quality Gate.
Now that you understand specialist agent architecture, it's time to build the perfect toolkit to orchestrate them effectively.