We had a well-defined SpecialistAgent
, clean architecture, and robust data contracts. We were ready to build the rest of the system. But we immediately ran into a problem as trivial as it was blocking: how do you test a system whose heart is a call to an external service thats unpredictable and expensive like an LLM?
Every execution of our integration tests would involve:
- Monetary Costs: Real calls to OpenAI APIs.
- Slowness: Waiting seconds, sometimes minutes, for a response.
- Non-Determinism: The same input could produce slightly different outputs, making tests unreliable.
AI Provider Abstraction Layer
"War Story": The Commit That Saved the Budget (and the Project)
Evidence from Git Log: f7627da (Fix stubs and imports for tests)
This seemingly innocent change was one of the most important of the initial phase. Before this commit, our first integration tests, running in a CI environment, made real calls to OpenAI APIs.
On the first day, we consumed over $44 USD of our self-funded daily budget of $110 USD, simply because every push to a branch triggered a series of tests that called gpt-4
dozens of times.
The Financial Context: AI in a Self-Funded Learning Project
This wasnt just a technical concern. As a self-funded personal learning project, we had established a maximum budget of $110 USD per day for APIs. As Tunguz's analysis highlights, AI is rapidly becoming one of the main R&D expense items, potentially reaching 10-15% of the total budget easily.
The lesson was brutal but fundamental: An AI system that cannot be tested economically and reliably is a system that cannot be developed sustainably.
Discovering OpenAI Tiers: Strategic Budget Planning
During the budget management challenge, we learned to intimately know the OpenAI usage tier system:
- Tier 1 (after $5 spent): $100/month limit - perfect for prototyping
- Tier 2 ($50 spent + 7 days): $500/month - serious development
- Tier 3 ($100 spent + 7 days): $1,000/month - development team
- Tier 4 ($250 spent + 14 days): $5,000/month - SME production
- Tier 5 ($1,000 spent + 30 days): $200,000/month - enterprise scale
Our strategy: 95% mock tests for rapid and economical development, 5% real tests for final validation.
The CLI Coding Effect: When Tests Multiply
Ironically, just as we managed to contain API costs with mocks, a new challenge emerged: the advent of AI-assisted coding CLIs. Tools like Claude Code, GitHub Copilot CLI, and Cursor revolutionized how we write code.
Where we used to manually write 10 tests per component, now with AI assistance we easily generate 100+ in minutes. The paradox: while the cost per test drops, the total volume of tests grows exponentially.
;Implementing the AI Abstraction Layer wasnt just a best practice; it was an economic survival decision:
- Free: 99% of tests now run without API costs
- Fast: From 10 minutes to 30 seconds for complete suite
- Reliable: Deterministic and repeatable tests
End of the Third Movement
Isolating intelligence was the step that allowed us to transition from "experimenting with AI" to "doing software engineering with AI". It gave us the confidence and tools to build the rest of the architecture on solid and testable foundations.
With a robust single agent and reliable testing environment, we were finally ready to tackle the next challenge: making multiple agents collaborate. This led us to create the Orchestra Director, the beating heart of our AI team.