However, this use case is "too valuable" to give up on. It would be an MSP's wildest dream to ask AI "was Ironwood Manufacturing profitable last year" and get an accurate breakdown of invoice vs. labor cost. We HAVE TO make this dream a reality.
Attempt 1: API filter
We first attempted to solve this problem at the API level. APIs support query parameters that only return relevant data. For example, there may be 5,000 tickets processed in 2025, but only 250 are relevant to Ironwood. If we can fetch only the 250 relevant tickets, we reduce the data volume processed by 20x. We can layer additional filters to further reduce the response.
This approach may sound feasible, but it broke down quickly once we put it in front of customers. As the system became more powerful, users' appetite grew — instead of asking only about 2025, they started asking about the past 5 years. That introduced 5x more data to be processed, which inevitably breached the context window limitation.
Attempt 2: Custom tool
We then attempted to build custom tools that supported pagination. These tools could process large amounts of data in-memory but only return the relevant information to the Large Language Model (LLM). Imagine a ticket has 50 fields; the special tool only fetches 1 of those fields, and performs aggregation before returning. So the LLM only sees the post-aggregation data.
This solution worked, but quickly became difficult to maintain. Each question required a custom tool, and our codebase ballooned. Our development team can't write code as fast as users ask questions. We had to find a different way that is more scalable.
Final solution: Trust the LLM
When working with new technology, we need to be open minded about change. The final solution we landed on involves a mindset change — instead of relying on humans to build custom tools ahead of time, we need to give our language model the ability to decide what information it wants to see. Users may ask questions that require aggregating data from Projects, Tickets, and Invoices, and the LLM handling the request knows best what information it needs to answer the question. Our job as developers is to trust LLMs to make the right decision, and give them the ability to answer all the questions.
We created a new interface for the LLM to decide what information it wants, before making the API calls to fetch that information. This solution works beautifully — if the LLM only needs ticket id, it will ask only for ticket id. It can hold tens of thousands of ticket ids in its context window without any problem. If the LLM needs to dig into a specific ticket, it will load the full ticket detail, which alone may take up the majority of the context window.
This approach allowed us to break through the context window barrier in a scalable way, and enabled Bumblebee to answer questions that involve massive amounts of data aggregation. More importantly, we are no longer constrained by the quirks of upstream vendor APIs. It gave us leverage as well as scalability.
The takeaway - A leap of faith
In the world of AI, best practices evolve constantly. Sometimes a great solution requires us to re-think how we approach problems, and reinvent ourselves. We've been taught NOT TO TRUST large language models. We have to re-wire our own brains to start trusting LLMs for certain things in order to get the breakthrough.
TL;DR — Lessons learned
- API-level filtering doesn't scale with user appetite. It works until users widen the time window or the question. Then you're back at the context window limit, just slower.
- Hand-rolling a custom tool per question is a treadmill. Each new question burns engineering time, and the codebase grows faster than users learn to stop asking new questions.
- Let the LLM decide what data it needs. Give it an interface to request only the fields it cares about. It can hold tens of thousands of ids in context, and only pulls full records when it actually needs them.
- The hardest part was the mindset shift. "Don't trust the LLM" is a useful default for output quality, but it's the wrong default for data selection. Trusting the model to choose its own context is what unlocked the breakthrough.