For the last 2 years I have built experiences that had features that used AI agents (mostly using LangGraph), but I have learned more in the last 2 months embedded in the team migrating a digital support assistant from the pre-llm-AI world to a new agents-first architecture. And while the goal is to be completely agentic system someday, building a hybrid system with some systems leveraging agents while others don’t has been an interesting challenge.
Choosing an agentic architecture
There is a good post on LangChain’s blog on architectures for multiple agents that explains the details but generally the choice comes down to
- A centralized “main” agent coordinating subagents (Fat Main Agent)
- A thin main agent essentially acting as a router to domain specific agents (Fat Subagents)
A big part of the decision comes down to how much of the experience do you manage and how much are black boxes you transition to. Fat subagents works if different teams manage different agents and agent-to-agent transitions are infrequent. In this model agent-to-agent transfer also risks context loss as subagents may only see part of the conversation. But, as with microservices, Fat Subagents solve an organizational challenge, not an experiential one. This is also a good option if you are a believer in the vision of A2A experiences, but really, we are ways off from getting there.
In general, I would recommend Fat Main Agent architecture unless you have a strong reason not to do that. Having a central planner and conversation manager leveraging subagents mostly as tools is a good pattern.
A caveat on Fat Agents: In internal debates, there was a concern that loading creating an agent prompt with all the business logic for all behavior would confuse the agent, but that is not how these agents work. Instead agents load skills on-demand which can be pretty efficient, especially if your agents generally only engage in one or two domains in a conversation.
Shallow Agent Hierarchies
One of the challenges in writing software is deciding the granularity of each component that is part of a functional whole. For example, in Object Oriented Programming, you can have many small classes or few giant ones, and both are anti-patterns. How much a component is should do is a matter of good judgement.
In our system, we added way too many subagents each with very narrow set of responsibilities. But as each agent transitioned control to the next one during a conversation turn, we paid the price of those transitions in latency and context loss.
In general avoid architectures with deep agent-t0-agent chains.
Prompt Engineering
While creating a prompt for an agent feels trivial, it can be a rabbit-hole when you see bizarre responses in testing and try to figure out what happened. One of the funniest hallucinations we saw was the agent suddenly acting like the customer – talking about its kids and their day. An internal bug had basically reset it to its default behavior, that of a sentence completion agent and it was helpfully extending the user’s incoming message.
Another thing I do a lot of, copy the prompt and the unexpected output to another agent like Claude or ChatGPT and ask it why the agent responded the way it did. Turns out agents are pretty good at hardening prompt files.
Another interesting learning: turns out agents are better behaved when prompts are structured as XML instead of raw text (I hadn’t realized that Anthropic recommends it as best practice). Explicitly marking tools and behavior in labeled nodes keeps the agent better in check
I need to look more into prompt-versioning, something we don’t really do right now as an independent thing. Each iteration of the prompt is considered a full agent change and we redeploy the entire agent when prompts change.
Keep Subagents independent
Keep all your routing behavior in your main agent (regardless of fat or thin). Domain-specific subagents should generally not call each other, but rather communicate to the central agent when they want to declare that they cannot respond usefully to a user query. For example, a technical troubleshooting agent should never invoke a human-customer-support agent – it should just let the main agent know the results of its processes and let the main agent decide, based on policy, if the human-customer-support agent needs to be invoked.
Memory is a non-trivial detail
Be strategic about what goes into long-term memory and what should be in session-level implementation. This also gets complicated like, in our world, the subagents are often built on different platforms and sharing session data is not an out-of-the-box implementation.
Also, a great video on options for memory systems:
Evals and Mockable Tools
As we move towards probabilistic architectures, human QA might catch less and less of your system’s errors. Shipping without agent-level and system-level evals is just asking for trouble. And since so much of your agent behavior is in its leveraging tools available to it, being able to mock those tools is an essential part of the process. Agents are unpredictable enough and trying to reason their behavior when the input is also not predictable is painful.
For conversational agents, evals can also be used to simulate multi-turn conversations, which most QA scripts tend not to do. Our QA scripts are often structured in “if this then that” patterns, however evals simulate real users trying to achieve a goal and the LLM evaluator may run many turns of a conversation (usually configurable) to try to achieve that goal. The conversation threads of those evals often tend to be very insightful.
Observability
Agents will go wrong in a variety of ways in production and having tools to reconstruct what happened is essential. Agents built on frameworks like ADK are usually deployed as FastAPI applications. And while traditional Observability tools for web applications can be used, a whole new set of tools dedicated to agentic systems is emerging. Tools like Opik, BrainTrust or LangSmith or managed platforms like Google’s Agent Studio not only let you view your application’s behavior in production but also close the loop by integrating evals and prompt versioning allowing you to go from any observed abnormal behavior to a fix in a short time.
The challenges of hybrid systems
While a lot has been written on developing agentic systems in green field scenarios, evolving a non-agentic system to agentic one has had a lot of challenges. The crux of it comes down to underestimating the complexity of adapting tools that were meant to be driven by humans to being driven by an agentic interface acting in between the human and the non-agentic subsystem. Consolidating observability across two very different systems has been a particular challenge and reconstructing a user journey across boundaries has been difficult. User experience can also be jarring as you go from a conversational system to a non-conversational one in one session. The goal now is to accelerate the migration of the rest of the system to the new architecture.
Your mileage may wary based on how your specific non-agentic systems operate.







