erlangb

Posted on Mar 11

The Hidden Cost of MCP Tools: a 2.5x Token Reduction to Save 50% in Costs

#agentic #ai #go #mcp

Introduction

I want to be clear: I'm not an AI guru. I'm just a developer running experiments with "agentic" programming in Go, trying to see what actually works once you move past the "Hello World" phase.

After finishing this course about Langchain and LangGraph, I wanted to find a practical way to build an agent. Since most of my experience is in Go, I started exploring the Go ecosystem and came across the EINO framework.

Almost immediately, I hit a wall: how do you actually keep track of the steps, actions, and results in a system where more than one actor is involved? I started with the usual approach—debugging and following logs—but I quickly realized that logs weren't enough to see the full picture.

To help me see what was happening under the hood, I built two small projects:

agent_monitor: A Go observability playground for inspecting and running agentic pipelines.
agentmeter: A library specifically designed to track tokens and reasoning traces

I'm not mentioning these to show off the code, but just because I used these projects to run different experiments.

This is the first in a series of articles where I'll share what I'm learning about building agents in Go (though the logic applies to any language). In this post, we're looking at MCP tool optimization. In the next one, I'll dive into a movie reflection system I built to help agents double-check their own decisions and limit hallucinations.

The "Aha!" Moment

While using these tools to inspect my own agents, I noticed something. Initially, I took the easy route: I injected a raw MCP client connection directly into the Agent. Since an MCP connection is designed to expose all available tools, I figured, "Let the agent have everything; it's smart enough to handle it."

I was wrong. After running tests with Tavily Search and MapBox MCPs, I realised that giving an agent raw, unfiltered access to an MCP connection is usually a bad idea.

If you're an expert in the field, this first post might seem trivial. But if you're just starting to approach MCP, I hope these findings save you some time. Even for a simple pipeline, you must consider:

A Tool Filter: To control exactly which tools the agent can see for a specific task.
A Tool Overlay: A custom layer that uses a "Tolerant Reader" approach to prune the tool's response before the LLM ever sees it.

Let's dive into the Tool Overlay. The tool filtering it's also important to dont overload the agent context with tools you don't need.

The Setup: A Simple Test

I set up an instructed LLM travel agent with one job: "Suggest 10 places to visit in Rome." I used the Tavily MCP search tool to get the data.

I ran the experiment twice:

Run 1 (The Lazy Way): I let the agent use the MCP client to access the tool directly.
Run 2 (The Structured Approach): I added a layer between the tool and the LLM to parse the response and remove the fields I didn't need.

Run 1: The Raw MCP Response

MCP tools return verbose responses by design. Here is a snippet of what Tavily actually sends back for a single search:

{
  "query": "top 10 places to visit in Florence",
  "results": [
    {
      "title": "Top 10 Must See Places in Florence..",
      "url": "https://www.romecabs.com/blog/docs/top-10-must-see-places-rome/",
      "content": "**Piazza del Duomo**, the **Gallery of the Academy**, **Uffizi** **Gallery**",
      "score": 0.80405265,
      "raw_content": "...",
      "favicon": "..."
    }, 
    ....
  ],
  "response_time": "1.67",
  "auto_parameters": { "topic": "travel", "search_depth": "basic" },
  "usage": { "credits": 1 },
  "request_id": "123e4567-e89b-12d3-a456-426614174111"
}

Reasoning & Results (Run 1):

The "Tolerant Reader" Overlay

In my second run, I defined a custom tool using the same
Tavily MCP connection but overridden the search function, unmarshaling only the content and score fields.

// TavilyResult holds only the fields we care about
type TavilyResult struct {
    Content string  `json:"content"`
    Score   float64 `json:"score"`
}

// ... inside the tool call ...
var resp model.TavilySearchResponse
// We Unmarshal only the essential fields
if err := sonic.UnmarshalString(mcpRawResponse, &resp); err != nil {
    return "", err
}

out, _ := sonic.Marshal(resp.Results)
return string(out), nil

Reasoning & Results (Run 2):

Analyze the Results

When I looked at the output, the difference was significant—especially considering this is just a single, simple interaction.

Metric	Raw MCP (`tavily_raw`)	Parsed (`tavily_parsed`)	Delta
Tool payload	12,184 b	5,115 b	2.4x smaller (−58%)
Tokens in	4,157	1,446	2.9x fewer
Cost	$0.0102	$0.0051	50% cheaper

Test: "Suggest 10 places to visit in Florence" — model: gpt-4.1

The MCP Overhead in Numbers

I repeated the experiment several times, and the results were consistent. The 7,069 bytes stripped per tool call are not just wasted bandwidth—they are directly converted into input tokens that the LLM must read, price, and fit into its context window.

The raw Tavily response carries fields the agent simply doesn't need for this task: title, url, favicon, raw_content, request_id, response_time, auto_parameters, and usage. Once you remove them, the payload drops by 58%, mapping almost perfectly to the 2.9x token reduction.

Why a "Small" Saving Matters

In a complex system, a 2.9x token reduction compounds quickly. If your agent makes 10 tool calls in a single session, you aren't just saving a few cents—you are effectively preventing your context window from exploding. By keeping the input lean, you leave more room for the actual reasoning and long-term memory the agent needs to finish the job.

Real-world systems work in loops: the agent searches, reasons, calls another tool, summarizes, and then responds. A 50% cost reduction on one call might look like pocket change, but we rarely stop at one.

Conclusion

MCP servers return everything because they are built for interoperability. They don’t know if your agent is a travel bot or a data scientist, so they send the "kitchen sink" to be safe.

However, as a developer, the space between that server and your Agent is your responsibility. There is a clear tradeoff here: if you prune too much data, you might limit the agent's capacity to find unexpected connections. But for most specific tasks, forcing an LLM to read favicon URLs and request_ids is just paying a "tax for noise."

This becomes even more critical in an enterprise environment where you might be wrapping your own internal APIs with an MCP server. It is increasingly evident that MCP tools should be wrapped and executed in a layer outside the agent's direct context.

If you want to dive deeper into the "Tool Overload" problem, these articles were instrumental in my research:

I'm still just scratching the surface of agentic programming in Go, but this was an important lesson: don't just "plug and play" your MCP tools. Apply a layer between MCP and agents and save the tokens for the reasoning that actually matters.

In the next post, I'll dive into how I built a "movie reflection system" to improve agent accuracy and reduce hallucinations.

Check my work

The code for both variants is available in agent_monitor.
This can be used to run pre-filled simple use-cases, or write your own using EINO.

DEV Community