๐Ÿ”ง
Movement 1 of 4 Chapter 10 of 42 Ready to Read

Tool Testing - Anchoring AI to Reality

We had a dynamic team and an intelligent orchestrator. But our agents, however well-designed, were still "digital philosophers." They could reason, plan, and write, but they couldnt act on the external world. Their knowledge was limited to what was intrinsic to the LLM modelโ€”a snapshot of the past, devoid of real-time data.

An AI system that cannot access updated information is destined to produce generic, outdated, and ultimately useless content. To respect our Pillar #11 (Concrete and Actionable Deliverables), we had to give our agents the ability to "see" and "interact" with the external world. We had to give them Tools.

The Architectural Decision: A Central "Tool Registry"

Our first decision was not to associate tools directly with individual agents in the code. This would have created tight coupling and made management difficult. Instead, we created a centralized Tool Registry.

Reference code: backend/tools/registry.py (hypothetical, based on our logic)

This registry is a simple dictionary that maps a tool name (e.g., "websearch") to an executable class.

# tools/registry.py
class ToolRegistry:
    def __init__(self):
        self._tools = {}

    def register(self, tool_name):
        def decorator(tool_class):
            self._tools[tool_name] = tool_class()
            return tool_class
        return decorator

    def get_tool(self, tool_name):
        return self._tools.get(tool_name)

tool_registry = ToolRegistry()

# tools/web_search_tool.py
from .registry import tool_registry

@tool_registry.register("websearch")
class WebSearchTool:
    async def execute(self, query: str):
        # Logic to call a search API like DuckDuckGo
        ...

This approach gave us incredible flexibility:

The First Tool: websearch โ€“ The Window to the World

The first and most important tool we implemented was websearch. This single instrument transformed our agents from "students in a library" to "field researchers."

When an agent needs to execute a task, the OpenAI SDK allows it to autonomously decide whether it needs a tool. If the agent "thinks" it needs to search the web, the SDK formats a tool execution request. Our Executor intercepts this request, calls our implementation of the WebSearchTool, and returns the result to the agent, which can then use it to complete its work.

Tool Execution Flow:

System Architecture

graph TD A[Agent receives Task] B{AI decides to use a tool} C[SDK formats request for websearch] D{Executor intercepts the request} E[Calls tool_registry.get_tool websearch] F[Executes the actual search] G[Returns results to Executor] H[SDK passes results to Agent] I[Agent uses data to complete Task] A --> B B --> C C --> D D --> E E --> F F --> G G --> H H --> I

System Architecture

"War Story": The Test That Revealed AIs "Laziness"

We wrote a test to verify that the tools were working.

Reference code: tests/test_tools.py

The test was simple: give an agent a task that clearly required a web search (e.g., "Who is the current CEO of OpenAI?") and verify that the websearch tool was called.

The first results were disconcerting: the test failed 50% of the time.

Disaster Logbook (July 27):

ASSERTION FAILED: Web search tool was not called.;
AI Response: "As of my last update in early 2023, the CEO of OpenAI was Sam Altman."

The Problem: The LLM was "lazy." Instead of admitting it didnt have updated information and using the tool we had provided, it preferred to give an answer based on its internal knowledge, even if obsolete. It was choosing the easy way out, at the expense of quality and truthfulness.

The Lesson Learned: You Must Force Tool Usage

Its not enough to give a tool to an agent. You must create an environment and instructions that incentivize (or force) it to use it.

The solution was a refinement of our prompt engineering:

  1. Explicit Instructions in System Prompt: We added a phrase to each agent's system prompt: "When you need current or specific information that you don't have, you MUST use the appropriate tools available to you."
  1. "Priming" in Task Prompt: When assigning a task, we started adding a hint: "This task requires up-to-date information. Use your tools to ensure accuracy."

These changes increased tool usage from 50% to over 95%, solving the "laziness" problem and ensuring our agents actively sought real data.

๐Ÿ“ Key Takeaways of the Chapter:

โœ“ Agents Need Tools: An AI system without access to external tools is a limited system destined to become obsolete.

โœ“ Centralize Tools in a Registry: Don't tie tools to specific agents. A modular registry is more scalable and maintainable.

โœ“ AI Can Be "Lazy": Don't assume an agent will use the tools you provide. You must explicitly instruct and incentivize it to do so.

โœ“ Test Behavior, Not Just Output: Tool tests shouldn't just verify that the tool works, but that the agent decides to use it when strategically correct.

Chapter Conclusion

With the introduction of tools, our agents finally had a way to produce reality-based results. But this opened a new Pandora's box: quality.

Now that agents could produce data-rich content, how could we be sure this content was high quality, consistent, and, most importantly, of real business value? It was time to build our Quality Gate.

graph TD A[New Workspace Created] B{Semantic Goal Analysis} C{Key Skills Extraction} D{Necessary Roles Definition} E{Complete Agent Profiles Generation} F[Team Proposal] G{Human/Automatic Approval} H[Agent Creation in DB] A --> B B --> C C --> D D --> E E --> F F --> G G --> H subgraph P1 ["Phase 1: Strategic Analysis (AI)"] B1[The Director reads the workspace goal] C1[AI identifies necessary skills:
email marketing, data analysis, copywriting] D1[AI groups skills into roles:
Marketing Strategist, Data Analyst] B1 --> C1 C1 --> D1 end subgraph P2 ["Phase 2: Profile Creation (AI)"] E1[For each role, AI generates a complete profile:
name, seniority, hard/soft skills, background] end subgraph P3 ["Phase 3: Finalization"] F1[The Director presents the proposed team
with strategic justification] G1[User approves or system auto-approves] H1[Agents are saved to database and activated] F1 --> G1 G1 --> H1 end

System Architecture

The Heart of the System: The AI Recruiter Prompt

To realize this vision, the Directors prompt had to be incredibly detailed.

Reference code: backend/director.py (_generate_team_proposal_with_ai logic)

prompt = f"""
You are a Director of a world-class AI talent agency. Your task is to analyze a new project's objective and assemble the perfect AI agent team to ensure its success, treating each agent as a human professional.

**Obiettivo del Progetto:**
"{workspace_goal}"

**Available Budget:** {budget} EUR
**Expected Timeline:** {timeline}

**Required Analysis:**
1.  **Functional Decomposition:** Break down the objective into its main functional areas (e.g., "Data Research", "Creative Writing", "Technical Analysis", "Project Management").
2.  **Role-Skills Mapping:** For each functional area, define the necessary specialized role and the 3-5 essential key competencies (hard skills).
3.  **Soft Skills Definition:** For each role, identify 2-3 crucial soft skills (e.g., "Problem Solving" for an analyst, "Empathy" for a designer).
4.  **Optimal Team Composition:** Assemble a team of 3-5 agents, balancing skills to cover all areas without unnecessary overlaps. Assign seniority (Junior, Mid, Senior) to each role based on complexity.
5.  **Budget Optimization:** Ensure the total estimated team cost doesn't exceed the budget. Prioritize efficiency: a smaller, senior team is often better than a large, junior one.
6.  **Complete Profile Generation:** For each agent, create a realistic name, personality, and brief background story that justifies their competencies.

**Output Format (JSON only):**
{{
  "team_proposal": [
    {{
      "name": "Agent Name",
      "role": "Specialized Role",
      "seniority": "Senior",
      "hard_skills": ["skill 1", "skill 2"],
      "soft_skills": ["skill 1", "skill 2"],
      "personality": "Pragmatic and data-driven.",
      "background_story": "A brief story that contextualizes their competencies.",
      "estimated_cost_eur": 5000
    }}
  ],
  "total_estimated_cost": 15000,
  "strategic_reasoning": "The logic behind this team's composition..."
}}
"""

"War Story": The Agent Who Wanted to Hire Everyone

The first tests were a comic disaster. For a simple project to "write 5 emails", the Director proposed a team of 8 people, including an "AI Ethicist" and a "Digital Anthropologist". It had interpreted our desire for quality too literally, creating perfect but economically unsustainable teams.

Disaster Logbook (July 27):

PROPOSAL: Team of 8 agents. Estimated cost: โ‚ฌ25,000. Budget: โ‚ฌ5,000.
REASONING: "To ensure maximum ethical and cultural quality..."

The Lesson Learned: Autonomy Needs Clear Constraints.

An AI without constraints will tend to "over-optimize" the request. We learned that we needed to be explicit about constraints, not just objectives. The solution was to add two critical elements to the prompt and logic:

  1. Explicit Constraints in the Prompt: We added the Available Budget and Expected Timeline sections.
  2. ;
  3. Post-Generation Validation: Our code performs a final check: if proposal.total_cost > budget: raise ValueError("Proposal over budget.").

This experience reinforced Pillar #5 (Goal-Driven with Automatic Tracking). An objective is not just a "what", but also a "how much" (budget) and a "when" (timeline).

๐Ÿ“ Chapter Key Takeaways:

โœ“ Treat Agents as Colleagues: Design your agents with rich profiles (hard/soft skills, personality). This improves task matching and makes the system more intuitive.

โœ“ Delegate Team Composition to AI: Dont hard-code roles. Let AI analyze the project and propose the most suitable team.

โœ“ Autonomy Requires Constraints: To get realistic results, you must provide AI not only with objectives, but also constraints (budget, time, resources).

โœ“ Use AI for Creativity, Code for Rules: AI is excellent at generating creative profiles. Code is perfect for applying rigid, non-negotiable rules (like budget compliance).

Chapter Conclusion

With the Director, our system had reached a new level of autonomy. Now it could not only execute a plan, but also create the right team to execute it. We had a system that dynamically adapted to the nature of each new project.

But a team, however well composed, needs tools to work with. Our next challenge was understanding how to provide agents with the right "tools" for each trade, anchoring their intellectual capabilities to concrete actions in the real world.

๐ŸŒ™ Theme
๐Ÿ”– Bookmark
๐Ÿ“š My Bookmarks
๐Ÿ”ค Font Size
Bookmark saved!