Bumblebee - Lessons learned from pushing agent-based invoice audit workflow to its limits

One of the MSP partner has spent two years perfecting this process. Still, it takes five full days every month. Over a thousand line items, hundreds of companies, one person cross-referencing everything manually. Worse, as the company grows, the audit workload grows with it.

Objective

We set out to automate this with Bumblebee's agent-based workflows. The invoice audit boils down to a comparison problem. At the 7th date of each month, Pax8 sends a CSV with every line item they billed the MSP for. The PSA has the scheduled invoices synced from Pax8. The job is to make sure those two match, with the correct markup.

The Setup

We used GPT-OSS-120B as the reasoning model because it's open-source model and can be hosted on a private inference provider that gives us privacy guarantee. It is also the most price-performant model to date.

It has 130K input tokens and 32K output tokens limit, enough context to hold a meaningful chunk of invoice data and reason about mismatches. The workflow runs inside Bumblebee, which handles the orchestration: pulling data from Pax8 and the PSA, feeding it to the model, and collecting structured results.

Over the course of 4 months, we improved the workflow every month. Here's the limit we ran into:

Limit 1: A Single LLM Can't Process the Entire Invoice

A thousand-line CSV doesn't fit neatly into a single LLM call - even if we tried the maximum 2M context window. As the tokens reach context window limit, accuracy drops when you ask a model to reason over that many rows at once.

Solution: Group by company. We break the Pax8 CSV into one file per customer, using Claude for the splitting step. This turns one impossible task into hundreds of manageable ones. Each company's line items get their own workflow run.

To make this practical, we built a bulk run feature in Bumblebee. Upload the per-customer CSVs, and the platform kicks off up to 250 workflows in a single batch. Thousands of line items become hundreds of companies become one button click.

Limit 2: Concurrency and Rate Limits

Running 250 workflows in parallel sounds great until you start getting throttled. Pax8's API has rate limits. The PSA's API has rate limits. Bumblebee also enforces LLM usage limits at the user level. When firing too many workflows at once, the LLM limit gets breached first and everything stalls.

Solution: Stagger with retry. We configured a 5-second wait between workflow launches and built auto-retry into the agent to recover from transient API failures. The total invoice audit run time becomes 20 minutes for 250 workflows. The intuition here is Little's Law — the number of in-flight requests at any time equals the arrival rate times the processing time. By controlling the wait, we control the concurrency.

One nuance worth noting: the right wait depends on how many API calls each workflow makes. For our scenario, 5 seconds wait is more than enough of buffer given we are making just 2 tool calls during a workflow run.

Limit 3: Organizing the Output

After 200+ workflows complete, you have 200+ individual results. There's no easy way to scan all of that for the things that actually matter.

Solution: Standardize and post-process. We structured every workflow to output results in a consistent format, then piped all results into Claude for post-processing. Claude consolidates the individual outputs into a single structured summary — organized by discrepancy type, sorted by dollar impact, ready for a human to review and act on.

This post-processing step required the most amount of time investment. We spent tens of hours making adjustments to the prompt that organizes output into different categories: data not found, quantity mismatch, price mismatch, one-time charge, etc. To validate prompt changes, we run the audit multiple times a month to ensure the output lines up to our expectation.

What Worked That Surprised Us

The markup calculations turned out remarkably accurate. Going into this, we assumed math would be the weak link — LLMs have a bad reputation for being unreliable with arithmetic. We expected to spend significant time building guardrails around price comparisons and percentage calculations. That concern turned out to be outdated. GPT-OSS-120B handled markup verification, proration math, and seat-count comparisons with consistent accuracy across hundreds of invoices. We didn't need to bolt on a calculator tool or offload math to a separate step. This was the pleasant surprise of the project — the part we expected to fail quietly worked from the start.

What's Still Broken

Two limits remain beyond what we can solve today.

Large single-client invoices. One client has roughly 500 endpoints, which generates an invoice large enough to exceed the model's context window even after isolating that single company. Theoretically we could break the invoice down further, but the return is not worth the cost. For now, that client still gets audited manually.

Reasoning token limits. GPT-OSS-120B has a 32K output token limit, and all reasoning happens within that budget. When the line item count grows, the model runs out of room to think before it runs out of room to write. The input might fit in 130K tokens, but the model can't reason over all of it within its output budget. This is a fundamental constraint of the current architecture — bigger context windows alone won't fix it without proportionally larger output limits.

As of today, we are keeping the audit process for these two cases manual because it only represents around 3% of the client base so it's not a huge time sink.

The ROI after 4 months

By the end of the fourth month, we reduced total audit process time (pre-processing, Bumblebee workflows, human review) from 5 days to 6 hours. The number of invoice discrepancies found has also dropped to 20% of the first month's total, indicating that the process is maturing and upstream billing accuracy is improving. More importantly, we recovered around 0.5% of top-line recurring revenue and reduced potential reputational damage.

What's Next

NVIDIA released Nemotron 3 Super earlier this month — a 120B parameter open-source model with 12B active parameters and a native 1M token context window. It directly benchmarks against GPT-OSS-120B with comparable accuracy and up to 2.2x higher inference throughput. Since it's open-weight and available through multiple inference providers, it fits our zero-day data retention requirement.

We're planning to run next month's invoice audit on Nemotron 3 Super alongside GPT-OSS-120B and compare results head-to-head: accuracy on known discrepancies, handling of the large client that currently requires manual audit, and total run cost. We'll share the results.

TL;DR — Lessons Learned

Break the problem into company-sized chunks. Don't feed a 1,000-row CSV into one LLM call. Group by customer, run each as its own workflow, and use bulk execution to scale.
LLMs can calculate markup reliably now. Markup calculations, proration math, and seat-count comparisons worked accurately out of the box. The "LLMs are bad at math" assumption is outdated for current reasoning models — don't over-engineer around it.
Add wait time between run to respect rate limit. Stagger workflow launches with a fixed wait interval and build auto-retry into the agent. The right interval depends on how many API calls each workflow makes.
Output formatting is the hardest part. Getting the LLM to produce structured, categorized output (not just "here are the mismatches") took more iteration than any other piece. Expect to run the full audit multiple times just to validate prompt changes.
Some problems need bigger models, not better prompts. Large single-client invoices and reasoning token limits are architectural constraints. We chose to keep 3% of the audit manual rather than over-engineer around limits that new models will solve.