Building Agentic Systems

For the last 2 years I have built experiences that had features that used AI agents (mostly using LangGraph), but I have learned more in the last 2 months embedded in the team migrating a digital support assistant from the pre-llm-AI world to a new agents-first architecture. And while the goal is to be completely agentic system someday, building a hybrid system with some systems leveraging agents while others don’t has been an interesting challenge.

Choosing an agentic architecture

There is a good post on LangChain’s blog on architectures for multiple agents that explains the details but generally the choice comes down to

  1. A centralized “main” agent coordinating subagents (Fat Main Agent)
  2. A thin main agent essentially acting as a router to domain specific agents (Fat Subagents)

A big part of the decision comes down to how much of the experience do you manage and how much are black boxes you transition to. Fat subagents works if different teams manage different agents and agent-to-agent transitions are infrequent. In this model agent-to-agent transfer also risks context loss as subagents may only see part of the conversation. But, as with microservices, Fat Subagents solve an organizational challenge, not an experiential one. This is also a good option if you are a believer in the vision of A2A experiences, but really, we are ways off from getting there.

In general, I would recommend Fat Main Agent architecture unless you have a strong reason not to do that. Having a central planner and conversation manager leveraging subagents mostly as tools is a good pattern.

A caveat on Fat Agents: In internal debates, there was a concern that loading creating an agent prompt with all the business logic for all behavior would confuse the agent, but that is not how these agents work. Instead agents load skills on-demand which can be pretty efficient, especially if your agents generally only engage in one or two domains in a conversation.

Shallow Agent Hierarchies

One of the challenges in writing software is deciding the granularity of each component that is part of a functional whole. For example, in Object Oriented Programming, you can have many small classes or few giant ones, and both are anti-patterns. How much a component is should do is a matter of good judgement.

In our system, we added way too many subagents each with very narrow set of responsibilities. But as each agent transitioned control to the next one during a conversation turn, we paid the price of those transitions in latency and context loss.

In general avoid architectures with deep agent-t0-agent chains.

Prompt Engineering

While creating a prompt for an agent feels trivial, it can be a rabbit-hole when you see bizarre responses in testing and try to figure out what happened. One of the funniest hallucinations we saw was the agent suddenly acting like the customer – talking about its kids and their day. An internal bug had basically reset it to its default behavior, that of a sentence completion agent and it was helpfully extending the user’s incoming message.

Another thing I do a lot of, copy the prompt and the unexpected output to another agent like Claude or ChatGPT and ask it why the agent responded the way it did. Turns out agents are pretty good at hardening prompt files.

Another interesting learning: turns out agents are better behaved when prompts are structured as XML instead of raw text (I hadn’t realized that Anthropic recommends it as best practice). Explicitly marking tools and behavior in labeled nodes keeps the agent better in check

I need to look more into prompt-versioning, something we don’t really do right now as an independent thing. Each iteration of the prompt is considered a full agent change and we redeploy the entire agent when prompts change.

Keep Subagents independent

Keep all your routing behavior in your main agent (regardless of fat or thin). Domain-specific subagents should generally not call each other, but rather communicate to the central agent when they want to declare that they cannot respond usefully to a user query. For example, a technical troubleshooting agent should never invoke a human-customer-support agent – it should just let the main agent know the results of its processes and let the main agent decide, based on policy, if the human-customer-support agent needs to be invoked.

Memory is a non-trivial detail

Be strategic about what goes into long-term memory and what should be in session-level implementation. This also gets complicated like, in our world, the subagents are often built on different platforms and sharing session data is not an out-of-the-box implementation.

Also, a great video on options for memory systems:

Evals and Mockable Tools

As we move towards probabilistic architectures, human QA might catch less and less of your system’s errors. Shipping without agent-level and system-level evals is just asking for trouble. And since so much of your agent behavior is in its leveraging tools available to it, being able to mock those tools is an essential part of the process. Agents are unpredictable enough and trying to reason their behavior when the input is also not predictable is painful.

For conversational agents, evals can also be used to simulate multi-turn conversations, which most QA scripts tend not to do. Our QA scripts are often structured in “if this then that” patterns, however evals simulate real users trying to achieve a goal and the LLM evaluator may run many turns of a conversation (usually configurable) to try to achieve that goal. The conversation threads of those evals often tend to be very insightful.

Observability

Agents will go wrong in a variety of ways in production and having tools to reconstruct what happened is essential. Agents built on frameworks like ADK are usually deployed as FastAPI applications. And while traditional Observability tools for web applications can be used, a whole new set of tools dedicated to agentic systems is emerging. Tools like Opik, BrainTrust or LangSmith or managed platforms like Google’s Agent Studio not only let you view your application’s behavior in production but also close the loop by integrating evals and prompt versioning allowing you to go from any observed abnormal behavior to a fix in a short time.

The challenges of hybrid systems

While a lot has been written on developing agentic systems in green field scenarios, evolving a non-agentic system to agentic one has had a lot of challenges. The crux of it comes down to underestimating the complexity of adapting tools that were meant to be driven by humans to being driven by an agentic interface acting in between the human and the non-agentic subsystem. Consolidating observability across two very different systems has been a particular challenge and reconstructing a user journey across boundaries has been difficult. User experience can also be jarring as you go from a conversational system to a non-conversational one in one session. The goal now is to accelerate the migration of the rest of the system to the new architecture.

Your mileage may wary based on how your specific non-agentic systems operate.

Categorizing Google’s “101 real-world Gen AI use-cases”

Google Cloud recently published a post titled “101 real-world gen AI use cases from the world’s leading organizations“. As with any list that long, my eyes started to glaze over by the time I reached the 10th bullet point. So I asked an LLM (well, ChatGPT, sorry Google Gemini) to categorize the list by grouping them into the themes that are in play. This was a lot more useful:

  1. Customer Service & Support:
    • ADT: Customer agent for home security setup.
  2. Travel & Hospitality:
    • Alaska Airlines: Personalized travel search experience.
    • IHG Hotels & Resorts: Generative AI chatbot in their app.
    • The Minnesota Division of Driver and Vehicle Services: Translation for non-English speakers.
  3. Retail & E-Commerce:
    • Etsy: AI to improve search recommendations and ads.
    • Target: AI solutions for personalized offers.
    • Victoria’s Secret: AI agents for in-store associate support.
    • Woolworths: Generative AI to improve communications.
  4. Healthcare & Life Sciences:
    • Bayer: Radiology platform for data analysis.
    • DaVita: Transform kidney care with AI.
    • Highmark Health and Freenome: AI for clinical trial planning.
    • Dasa: Faster detection of findings in test results.
  5. Finance & Banking:
    • ING Bank: AI chatbot for improved customer query answers.
    • IntesaSanpaolo, Macquarie Bank, Scotiabank: Gen AI to transform banking operations.
  6. Marketing & Media:
    • Golden State Warriors: AI to improve fan experience content.
    • Los Angeles Rams: AI for content analysis and scouting.
    • Carrefour: AI for dynamic marketing campaigns.
    • Procter & Gamble: Generative AI platform for creative asset creation.
    • WPP: Google Cloud’s gen AI in their marketing system.
  7. Internal Knowledge & Productivity:
    • Cintas: Knowledge center for customer service and sales teams.
    • Uber: AI agents to improve productivity.
    • Box, Typeface, Glean: AI for marketing and financial services.
    • Workday: Vertex Search for data insights.
  8. Education & Research:
    • Pepperdine University: Real-time translation for students and faculty.
    • Mayo Clinic: Accelerated scientific research with data retrieval via Vertex AI.
    • BenchSci: AI for scientists to understand biological connections.
  9. Creative & Design:
    • Canva: Vertex AI for video editing.
    • Paramount: Streamline metadata and video summaries.
  10. Security & Compliance:
    • BBVA: Google SecOps for security threat detection.
    • Pfizer: Aggregating cybersecurity data sources.
    • Palo Alto Networks: AI-driven security operations platform.
  11. Data & Analytics:
    • Spotify: Dataflow for ML podcast previews.
    • AI21 Labs: BigQuery integration with Contextual Answers.
    • MSCI: ML with Vertex AI for climate-related risk insights.