Over the past year, we've helped companies put thousands of AI agents to work. These agents have handled everything from tier 1 phone and chat support to pulling insights from mountains of data and process automation. Along the way, we've learned some critical lessons about what really works—and what doesn't work.
Large Language Models(LLM) are smart and getting smarter, but they don't inherently know your business, your data, your processes, or your intent. The single biggest factor in an agent's success is providing it with the right information and tools at the right time. Without proper context and system access, even the most capable model will struggle to deliver meaningful results.
This mental shift changes everything. Instead of thinking "How do I program this?" you need to think "How do I delegate this effectively?" What context and resources does the agent need? What does success look like? How will I measure performance?
Focus more on prompts than logic. Remember: don't design, give context.
Start with generic tools for agents like file systems and code sandboxes versus building specialized tools. For example, if you want an agent to keep track of subtasks, just add instructions to the prompt to create a file and check off items when done.
Start with frontier models and then optimize. Models are rapidly becoming cheaper, faster, and smarter. Don't waste your time getting small LLMs working on a task until you validate a larger model can do it. With good evaluations (see below), you can do this easily.
When an agent doesn't perform as expected (and it will happen), you need to understand why. Was it the prompt? The data it accessed? The tool it tried to use? Without tools to trace the agent's "thought process"—its inputs, the LLM's responses, the actions it took—you're just guessing.
AI agents are non-deterministic - you can't just write a unit test and call it a day. Having robust evaluations pays dividends and forms the foundation for reliable AI systems. Use an 'LLM as a judge' and specialized RAG evaluations to evaluate an agent's output for accuracy, helpfulness, relevance, detect hallucination and potential harm. This is a new discipline and vital for trust.
Deploy evaluation systems in two complementary ways:
Without evaluations, you can't measure, and if you can't measure, you can't optimize or improve performance.
Here's a common scenario: Give an agent access to systems like Salesforce and Zendesk via API or MCP (Model Context Protocol) with an open-ended objective, and it sometimes works! However, it might take 30 minutes and $500 in LLM usage to figure things out (true story). Users expect quick responses and won't wait 15 minutes for routine queries or tasks.
Pre-process your data. Don't make agents fetch everything on demand. Index your business data to speed up queries and searches. This saves time and money.
Implement agent reflection. Enable agents to learn from experience and apply those lessons. For example, if an agent discovers that tickets labeled as "API" channel and "Product X" in the system are actually phone calls (based on human feedback or other context), it should remember this pattern and apply it when analyzing volume metrics.
Use knowledge compression. Rather than having agents re-read entire call transcripts or email threads, implement preprocessing that creates summaries and extracts structured data. This allows agents to process and act on information orders of magnitude faster.
AI can generate a lot of content. But if that content isn't concise, actionable, and delivered at the right moment, it just adds to human cognitive load. The goal is to make humans move faster by surfacing insights, understanding user intent, and taking actions that reduce human effort, not just shift it.
What works for one or two agents can break down when you have dozens or hundreds. This is where you wish you had started with observability and evaluations built in. These problems compound. Planning for this from a management and governance perspective is crucial, even if you're starting small.
AI costs can spiral quickly with multiple agents running 24/7. Use smaller models for simple tasks and save expensive ones for complex work and planning. Implement circuit breakers for both spending and time limits.
AI agents create unique security challenges by operating across multiple systems and data sources. The key question: how do you verify proper user authorization when an agent acts on their behalf? Agents must inherit and maintain appropriate user permissions throughout their operations, with careful logging and auditing of all interactions.
These are just some of the lessons we've learned so far. The field is evolving incredibly fast, and while these insights have helped us, I'm certain there's still much more to discover. I'd love to hear what you've learned in your own AI journey.