Chatbots & AI Agents

What are chatbots and AI agents?

Chatbots and AI agents are conversational systems that handle customer or internal workflows on behalf of an organization. A chatbot answers questions; an AI agent takes actions — calling APIs, updating records, scheduling, or orchestrating multi-step workflows. Modern builds are almost always agents with a conversational interface.

We build agents that actually work — not the frustrating kind that loop users back to "I didn't understand that" and have to escalate every interaction. The difference is grounded retrieval, tight tool design, evaluation, and intentional human handoff. The technology to do this well exists; most production agents fail because the team skipped one of those four.

Key terms used on this page:

LLM (Large Language Model): The neural network — GPT-4, Claude, Llama 3, Gemini — that generates the agent's responses.
RAG (Retrieval-Augmented Generation): The pattern of fetching relevant documents from a knowledge base at query time and giving them to the LLM as context, so answers are grounded in your real data instead of model memory.
Tool use / function calling: The mechanism by which an agent calls APIs (look up an order, create a ticket, schedule a meeting) as part of generating a response.
Guardrails: Programmatic constraints on what the agent can say or do — input validation, output filtering, topic restrictions, and safety classifiers.
Deflection rate: The percentage of customer conversations the agent resolves without human handoff. The primary KPI for support agents.

How does a customer support agent work?

A modern support agent is an LLM orchestrating four things: a retriever over your knowledge base, a set of tools (look up orders, create tickets, escalate), a system prompt defining tone and policy, and an evaluation harness that catches regressions. The user sees a chat window; behind it is a small system, not a model.

Retrieval. When a user asks a question, the system searches your knowledge base, help center, past tickets, and product docs for relevant passages and feeds them to the LLM as context.
Tool use. The agent can call APIs to look up an order status, check inventory, issue a refund within a cap, or create a ticket. Each tool has a strict schema and authorization.
Guardrails. A system prompt defines what the agent will and won't do. A separate classifier blocks obviously off-policy responses before they reach the user.
Handoff. When the agent is uncertain, the topic is sensitive, or the user is frustrated, the conversation routes to a human with full context preserved.

A well-built support agent on a clean knowledge base typically deflects 40–70% of tier-1 tickets with first-contact accuracy comparable to a junior human agent.

When should you build a custom agent versus buy a packaged product?

Buy a packaged product (Intercom Fin, Ada, Zendesk AI, Salesforce Agentforce) when:

The use case is generic customer support deflection.
Your knowledge base is structured and lives in a help-center tool the vendor integrates with.
You want to ship in weeks, not months, and you accept the vendor's pricing scaling with conversation volume.
Your team doesn't have the engineering capacity to maintain a custom agent.

Build a custom agent (on LangChain, LlamaIndex, CrewAI, AutoGen, or a hand-rolled orchestrator) when:

The agent must take actions in your systems — order management, scheduling, CRM updates, billing — that packaged products don't integrate with.
The workflows are specific to your business and you need fine control over tools, prompts, and routing.
You have or want to own the IP and the model choice (avoiding lock-in to a vendor's foundation model decisions).
Volume is high enough that per-conversation pricing on a packaged tool exceeds the cost of running it yourself by a wide margin.

The middle path we often recommend: use a packaged tool for tier-1 deflection, build custom agents for the workflows that need real action-taking and integration.

How do you make sure the agent doesn't hallucinate?

Hallucination — the model confidently making up facts — is the single biggest risk in production agents. We layer four defenses:

Retrieval-augmented generation. The agent cites only facts retrieved from your knowledge base. The system prompt explicitly forbids answers not grounded in retrieved context.
Tight tool schemas. Tools have strict input and output schemas, so the agent can't invent invalid order IDs or hallucinated user data.
Output filtering. A classifier or rules layer checks responses for off-policy content (specific products not offered, prices not in the system, claims not in the docs) before they reach the user.
Evaluation suite. A library of test conversations runs against every change, with assertions on accuracy, citation, and safety. Regressions block deploy.

Hallucination is a solvable engineering problem when you treat it as one. The agents that fail are the ones built on raw LLM calls without retrieval and without evaluation.

How do you handle handoff to humans?

Every agent we ship has explicit handoff. The triggers are tuned per use case but typically include:

Low model confidence. When the LLM's response confidence or the retrieval similarity score falls below a threshold.
Sensitive topics. Billing disputes, account security, complaints, or regulated questions (legal, financial, medical advice) route to a human by default.
Customer frustration. Sentiment classifiers detect signals like repeated rephrasing, profanity, or explicit "talk to a human" requests.
Tool failures. When a backend API call fails or returns ambiguous results, the agent escalates rather than guessing.

The handoff carries the full conversation transcript and any tool calls already made, so the human agent doesn't have to start over. This is the detail most brands skip and the reason most customers say they hate chatbots.

Should you build, buy, or partner for chatbots and AI agents?

Option	Best for	Speed	Differentiation	Cost (3 yr TCO)	Lock-in
Buy a support-deflection product (Intercom Fin, Ada, Zendesk AI)	Tier-1 deflection, structured KB, fast launch	2–6 weeks	Low — competitors get the same agent	USD 80K–500K depending on volume tier	High — pricing scales with conversations, IP stays with vendor
Buy an enterprise platform (Cognigy, Kore.ai, Salesforce Agentforce)	Multi-channel, voice + chat, complex routing	6–12 weeks	Medium — configurable workflows	USD 200K–1M+	High — proprietary flow engine
Use a low-code builder (Voiceflow, Botpress)	Quick prototypes, internal tools, mid-complexity flows	Weeks	Medium	USD 30K–150K	Medium — flows portable, runtime sticky
Build custom on LangChain / LlamaIndex / CrewAI	Action-taking agents, deep integrations, IP ownership	8–16 weeks	High — built on your systems	USD 80K–300K build + USD 5K–30K/mo runtime	Low — you own the code
Partner with a custom shop (our model)	Differentiated workflows, no in-house team, want to own IP	6–16 weeks	High	Predictable, paid back in 6–12 months on support deflection	Low — you own the code

A common pattern: ship Intercom Fin or Ada for general support deflection in month one, partner on a custom agent for the high-value workflow (order recovery, scheduling, internal HR helpdesk) in month two through four, and converge on owning the differentiated parts.

How do you choose between OpenAI, Anthropic, and open-source models?

The model is downstream of the requirements. We typically test 2–3 in week one and let the evaluation suite decide.

OpenAI (GPT-4, GPT-4o, GPT-4o mini). Strong general-purpose performance, excellent function calling, broad ecosystem. Default for many use cases.
Anthropic Claude (Sonnet, Opus, Haiku). Stronger long-context handling, often better at following complex instructions and refusing off-policy requests. Default for agents with long knowledge bases or strict policies.
Google Gemini. Competitive on cost, strong multimodal, native to Vertex AI for teams already on Google Cloud.
Open-source (Llama 3, Mistral, Qwen). Strong fit when data residency, cost at high volume, or fine-tuning on proprietary data drives the choice. Operational overhead is real — plan for a platform engineer.

We pick by evaluation results on your actual conversations, not vendor benchmarks.

What does a chatbot or AI agent engagement look like with us?

A typical engagement runs 6 to 16 weeks. Support deflection agents on a clean knowledge base ship at the low end; multi-tool agents with deep CRM, scheduling, or billing integrations run longer. We start with a 1-week scoping sprint that defines the handoff policy and success metrics, then iterate weekly with live conversation reviews.

We charge hourly with a cap, so the budget is predictable and scope can flex. Outcomes are measured against business metrics — deflection rate, customer satisfaction (CSAT) on bot-handled conversations, conversion lift, or cycle-time reduction — and instrumented from day one in a shared dashboard. We never propose a success-fee model because it incentivizes gaming the deflection metric, which is the easiest way to wreck CSAT.

What do chatbots and AI agents cost?

Realistic ranges based on the engagements we run:

Support deflection agent (RAG over a knowledge base, 1–2 tools, single channel): USD 60,000 to 120,000.
Multi-tool agent (deep CRM, scheduling, or billing integration, 4–8 tools): USD 120,000 to 280,000.
Multi-agent system (orchestrated workflow across departments, voice + chat, advanced routing): USD 250,000 to 600,000.

Runtime costs are typically USD 0.02 to 0.30 per conversation in LLM inference depending on model and conversation length. At 10,000 conversations per month, expect USD 200 to 3,000 in API costs plus a few hundred in vector database (Pinecone, Weaviate, pgvector) and observability (LangSmith, Langfuse, Helicone) tooling.

Annual maintenance runs 15–25% of build cost for prompt tuning, knowledge-base updates, evaluation, and incident response.

For pricing on the strategy work that often precedes a build, see our AI Consulting page. For platform and engagement pricing details, see Pricing.

Frequently asked questions about chatbots and AI agents

What is the difference between a chatbot and an AI agent?

A chatbot answers questions in a conversation. An AI agent takes actions — it can call APIs, update systems, schedule meetings, or chain together multi-step workflows. The line has blurred since GPT-4-class models, and most modern builds are agents with conversational interfaces.

Should we use Intercom Fin, Ada, or build a custom agent?

Use a packaged product like Intercom Fin or Ada when you need a customer-support deflection bot fast and your knowledge base is structured. Build custom when the agent must take actions in your systems, handle workflows specific to your business, or operate where lock-in to a vendor's pricing model is unacceptable.

How accurate are LLM-based chatbots in production?

With proper retrieval-augmented generation (RAG) and guardrails, a well-built support agent typically deflects 40–70% of tier-1 tickets with accuracy comparable to a junior agent. Without RAG, accuracy drops sharply and hallucinations become a liability. RAG is non-negotiable for any agent that cites facts.

How do you prevent the chatbot from hallucinating or going off-script?

Three layers: retrieval-augmented generation grounds answers in your real documents, system prompts and tool restrictions limit what the agent can do, and an evaluation suite catches regressions before deploy. We also log every conversation and review samples weekly during launch.

How long does it take to build and deploy an AI agent?

A focused customer-support agent with RAG over a knowledge base ships in 6–10 weeks. Multi-tool agents that take actions in CRM, scheduling, or billing systems typically take 10–16 weeks because the integrations carry most of the work.

How much does an AI agent cost to run?

LLM API costs typically run USD 0.02–0.30 per conversation depending on length and model choice (GPT-4 vs. Claude Sonnet vs. Haiku vs. open-source). At 10,000 conversations per month, expect USD 200–3,000 in inference costs. Hosting, vector database, and observability add a few hundred more.

Can the agent hand off to a human when it gets stuck?

Yes, and it should. We design every agent with explicit handoff triggers — low confidence, sensitive topics, customer frustration, or specific keywords — and route the conversation to a human with full context preserved. Agents without good handoff are how brands get viral support disasters.