Tool Testing - Anchoring AI to Reality | Core Philosophy Architecture

We had a dynamic team and an intelligent orchestrator. But our agents, however well-designed, were still "digital philosophers." They could reason, plan, and write, but they couldnt act on the external world. Their knowledge was limited to what was intrinsic to the LLM model—a snapshot of the past, devoid of real-time data.

An AI system that cannot access updated information is destined to produce generic, outdated, and ultimately useless content. To respect our Pillar #11 (Concrete and Actionable Deliverables), we had to give our agents the ability to "see" and "interact" with the external world. We had to give them Tools.

The Architectural Decision: A Central "Tool Registry"

Our first decision was not to associate tools directly with individual agents in the code. This would have created tight coupling and made management difficult. Instead, we created a centralized Tool Registry.

Reference code: backend/tools/registry.py (hypothetical, based on our logic)

This registry is a simple dictionary that maps a tool name (e.g., "websearch") to an executable class.

# tools/registry.py
class ToolRegistry:
    def __init__(self):
        self._tools = {}

    def register(self, tool_name):
        def decorator(tool_class):
            self._tools[tool_name] = tool_class()
            return tool_class
        return decorator

    def get_tool(self, tool_name):
        return self._tools.get(tool_name)

tool_registry = ToolRegistry()

# tools/web_search_tool.py
from .registry import tool_registry

@tool_registry.register("websearch")
class WebSearchTool:
    async def execute(self, query: str):
        # Logic to call a search API like DuckDuckGo
        ...

This approach gave us incredible flexibility:

Modularity (Pillar #14): Each tool is a standalone module, easy to develop, test, and maintain.
Reusability: Any agent in the system can request access to any registered tool, without needing specific code.
Extensibility: Adding a new tool (e.g., an ImageGenerator) simply means creating a new file and registering it, without touching the logic of agents or the orchestrator.

The First Tool: `websearch` – The Window to the World

The first and most important tool we implemented was websearch. This single instrument transformed our agents from "students in a library" to "field researchers."

When an agent needs to execute a task, the OpenAI SDK allows it to autonomously decide whether it needs a tool. If the agent "thinks" it needs to search the web, the SDK formats a tool execution request. Our Executor intercepts this request, calls our implementation of the WebSearchTool, and returns the result to the agent, which can then use it to complete its work.

Tool Execution Flow:

System Architecture

graph TD A[Agent receives Task] B{AI decides to use a tool} C[SDK formats request for websearch] D{Executor intercepts the request} E[Calls tool_registry.get_tool websearch] F[Executes the actual search] G[Returns results to Executor] H[SDK passes results to Agent] I[Agent uses data to complete Task] A --> B B --> C C --> D D --> E E --> F F --> G G --> H H --> I

System Architecture

"War Story": The Test That Revealed AIs "Laziness"

We wrote a test to verify that the tools were working.

Reference code: tests/test_tools.py

The test was simple: give an agent a task that clearly required a web search (e.g., "Who is the current CEO of OpenAI?") and verify that the websearch tool was called.

The first results were disconcerting: the test failed 50% of the time.

Disaster Logbook (July 27):

ASSERTION FAILED: Web search tool was not called.;
AI Response: "As of my last update in early 2023, the CEO of OpenAI was Sam Altman."

The Problem: The LLM was "lazy." Instead of admitting it didnt have updated information and using the tool we had provided, it preferred to give an answer based on its internal knowledge, even if obsolete. It was choosing the easy way out, at the expense of quality and truthfulness.

The Lesson Learned: You Must Force Tool Usage

Its not enough to give a tool to an agent. You must create an environment and instructions that incentivize (or force) it to use it.

The solution was a refinement of our prompt engineering:

Explicit Instructions in System Prompt: We added a phrase to each agent's system prompt: "When you need current or specific information that you don't have, you MUST use the appropriate tools available to you."

"Priming" in Task Prompt: When assigning a task, we started adding a hint: "This task requires up-to-date information. Use your tools to ensure accuracy."

These changes increased tool usage from 50% to over 95%, solving the "laziness" problem and ensuring our agents actively sought real data.

📝 Key Takeaways of the Chapter:

✓ Agents Need Tools: An AI system without access to external tools is a limited system destined to become obsolete.

✓ Centralize Tools in a Registry: Don't tie tools to specific agents. A modular registry is more scalable and maintainable.

✓ AI Can Be "Lazy": Don't assume an agent will use the tools you provide. You must explicitly instruct and incentivize it to do so.

✓ Test Behavior, Not Just Output: Tool tests shouldn't just verify that the tool works, but that the agent decides to use it when strategically correct.

Chapter Conclusion

With the introduction of tools, our agents finally had a way to produce reality-based results. But this opened a new Pandora's box: quality.

Now that agents could produce data-rich content, how could we be sure this content was high quality, consistent, and, most importantly, of real business value? It was time to build our Quality Gate.

prompt = f""" You are a Director of a world-class AI talent agency. Your task is to analyze a new project's objective and assemble the perfect AI agent team to ensure its success, treating each agent as a human professional. **Obiettivo del Progetto:** "{workspace_goal}" **Available Budget:** {budget} EUR **Expected Timeline:** {timeline} **Required Analysis:** 1. **Functional Decomposition:** Break down the objective into its main functional areas (e.g., "Data Research", "Creative Writing", "Technical Analysis", "Project Management"). 2. **Role-Skills Mapping:** For each functional area, define the necessary specialized role and the 3-5 essential key competencies (hard skills). 3. **Soft Skills Definition:** For each role, identify 2-3 crucial soft skills (e.g., "Problem Solving" for an analyst, "Empathy" for a designer). 4. **Optimal Team Composition:** Assemble a team of 3-5 agents, balancing skills to cover all areas without unnecessary overlaps. Assign seniority (Junior, Mid, Senior) to each role based on complexity. 5. **Budget Optimization:** Ensure the total estimated team cost doesn't exceed the budget. Prioritize efficiency: a smaller, senior team is often better than a large, junior one. 6. **Complete Profile Generation:** For each agent, create a realistic name, personality, and brief background story that justifies their competencies. **Output Format (JSON only):** {{ "team_proposal": [ {{ "name": "Agent Name", "role": "Specialized Role", "seniority": "Senior", "hard_skills": ["skill 1", "skill 2"], "soft_skills": ["skill 1", "skill 2"], "personality": "Pragmatic and data-driven.", "background_story": "A brief story that contextualizes their competencies.", "estimated_cost_eur": 5000 }} ], "total_estimated_cost": 15000, "strategic_reasoning": "The logic behind this team's composition..." }} """

Tool Testing - Anchoring AI to Reality

The Architectural Decision: A Central "Tool Registry"

The First Tool: `websearch` – The Window to the World

System Architecture

System Architecture

"War Story": The Test That Revealed AIs "Laziness"

📝 Key Takeaways of the Chapter:

System Architecture

The Heart of the System: The AI Recruiter Prompt

"War Story": The Agent Who Wanted to Hire Everyone

📝 Chapter Key Takeaways:

The Architectural Decision: A Central "Tool Registry"

The First Tool: websearch – The Window to the World

System Architecture

System Architecture

"War Story": The Test That Revealed AIs "Laziness"

📝 Key Takeaways of the Chapter:

System Architecture

The Heart of the System: The AI Recruiter Prompt

"War Story": The Agent Who Wanted to Hire Everyone

📝 Chapter Key Takeaways:

🔗 Related Chapters

15 Pillars of AI Systems

Tool Testing Reality Anchor

Orchestrator as Conductor

📚 Continue Reading

The First Tool: `websearch` – The Window to the World