From Automotive Diagnostics to Banking Complaints: What I Learned Building a Local LLM Evaluation Platform

A few weeks ago I shared what I found when I tested a local language model against automotive warranty workflows.

Fine-tuning outperformed RAG. RAG reduced hallucinations but did not improve usefulness. Combining them made things worse, not better.

But that experiment left me with an uncomfortable question I could not shake.

Was the result about the workflow? The model? The retrieval design? Or some combination of all three?

I did not know. And the honest answer was that I could not figure it out by running the same experiment again on the same data.

So I built a second one, in a completely different domain, with a different model and a different dataset, specifically designed to challenge the conclusions I had already drawn.

That decision changed the entire direction of this project.

Building the platform first

Before I get to the banking results, I want to share something about how this work evolved, because it shaped everything that followed.

What started as a single experiment became a reusable local LLM evaluation platform.

The same orchestration pipeline, the same scoring framework, and the same evaluation engine run across both domains. When I moved from automotive to banking, the only things that changed were the dataset, the prompts, the scoring rubric, and the output schema.

That turned out to be one of the most valuable architectural decisions I made.

When the results differed across domains, I could be reasonably confident the difference was real, not just a measurement artifact from comparing two completely different setups.

Why I started with automotive

The first domain I chose was automotive warranty and service diagnostics.

Toyota Camry, model years 2018 to 2024.

Technician notes and customer complaints in this space are messy by nature: abbreviated, inconsistent, sometimes contradictory. The model had to take that messy input and produce structured JSON covering likely component, severity, probable root cause, recommended workflow, and escalation guidance.

For this I used Qwen 2.5 7B. Not because of benchmark rankings, but because it was strong at instruction following, reliable at structured JSON generation, and lightweight enough for local iteration.

Structurally, Qwen performed well. Valid JSON almost every time. Responses looked polished and convincing on the surface.

But once I started scoring the outputs systematically, the gap became clear.

Severity was overstated. Recommendations were inconsistent. Hallucination was high. The model looked operationally correct while frequently being operationally wrong.

When I compared baseline, RAG, fine-tuning, and fine-tuned plus RAG, fine-tuning improved workflow behavior more than anything else.

RAG reduced hallucinations meaningfully but barely moved usefulness. Adding RAG on top of the fine-tuned model caused performance to regress.

The model had learned internal patterns for structuring outputs and making decisions. When retrieved context did not align cleanly with those patterns, the two signals competed. The outputs became unstable in ways that were difficult to predict.

That result felt significant. But I still did not know if I was seeing something about workflows in general, or something specific to automotive troubleshooting.

So I kept going.

Moving to banking

For the second experiment I chose banking complaint triage: consumer complaint narratives converted into structured JSON for routing, escalation, and categorization workflows.

I made three deliberate changes.

New domain.

New dataset: 400 public consumer complaint records loaded into a database and transformed into a reusable evaluation format.

A different model.

Instead of Qwen, I used Hermes 3 running on Llama 3.1 8B. The reason was not benchmarks. I wanted a model with more conservative language behavior, lower hallucination tendency, and a more consistent enterprise tone.

The difference was noticeable right away. Compared to Qwen, the Hermes/Llama baseline felt safer. Hallucinations were rare. Outputs looked closer to what you would expect in a professional context.

But once I started evaluating how the model actually behaved inside the workflow, a different problem appeared.

The model understood the complaints semantically. It just did not align with how an actual organization would route and categorize them. Urgency was consistently overstated. Recommendations sounded reasonable but were not something a team could actually act on.

The model was behaving like a helpful assistant.

Not like a workflow.

The baseline numbers reflected that gap clearly.

Where retrieval behaved completely differently

This is where the banking experiment started to push back on what I thought I understood.

In automotive diagnostics, RAG barely moved the needle on usefulness. In banking, retrieval had a dramatically larger impact.

The model started aligning better with product categories. Routing became more consistent. The outputs felt less generic and more grounded in the actual patterns of the domain.

I had to sit with this for a while.

The same retrieval approach that felt underwhelming in automotive was genuinely useful in banking.

The difference, I think, comes down to the nature of the workflow itself.

Banking complaints are semantically repetitive: billing disputes, debt collection, credit reporting, late fees. Many complaints cluster around the same recurring patterns. Retrieval was giving the model something real to anchor to.

Automotive troubleshooting depends more on causal reasoning and symptom interpretation. Retrieved context in that environment introduced noise as often as it introduced signal.

Fine-tuning changed the behavior, not just the scores

Adding fine-tuning produced the largest shift in both experiments.

But what stood out was not just the numbers. It was the qualitative difference in how the outputs felt.

The fine-tuned model did not just score better. It behaved differently.

It learned routing behavior, escalation thresholds, taxonomy alignment, and recommendation structure. The baseline model understood the complaint. The fine-tuned model started behaving like the workflow.

That distinction matters more than I initially expected.

Fine-tuning is not simply adding knowledge. It is something closer to teaching a model how an organization actually operates.

At that point, it looked like retrieval and fine-tuning might naturally complement each other in banking workflows.

That assumption turned out to be wrong.

Then the experiment took a turn I did not expect

After seeing fine-tuning perform that well, I added RAG back in.

Performance collapsed.

Not slightly, but completely.

The retrieved complaints were technically relevant. The context was correct.

So why did adding more information make everything dramatically worse?

I kept turning this over. And I think the answer is not about retrieval quality.

It is about retrieval shape.

The fine-tuned model had internalized routing behavior, escalation calibration, label alignment, and how the workflow was supposed to feel. But the raw retrieved complaint narratives brought noisy emotional language, inconsistent terminology, conflicting escalation patterns, and examples that were semantically similar but practically different.

Now the model was trying to reconcile learned behavior, retrieved narrative, prompt instructions, and the current complaint all at once.

It could not hold them together cleanly.

The workflow became unstable in ways that were hard to debug and harder to trust.

Why Label-RAG changed everything

After the regression, I tried a different approach.

Instead of retrieving full complaint narratives, I retrieved only compact operational signals: product category, issue type, sub-issue, routing team, and escalation hints.

Structured labels instead of narrative context.

Narrative RAG versus Label-RAG

Label-RAG gave the model compact operational signals that reinforced learned workflow behavior instead of competing with it.

The results shifted immediately.

Retrieval was not providing more context. It was providing better-shaped context, signals that worked with the model’s learned behavior instead of pulling against it.

That changed the question I was asking entirely.

It stopped being “does retrieval help?”

It became something more like: what kind of retrieval structure actually supports a model behaving consistently inside a real workflow?

That feels like a much more useful question to be working with.

What I think these experiments are actually showing

Across both experiments, a pattern is becoming clearer to me.

Different workflows respond differently to retrieval. Different models fail differently under the same orchestration setup. Retrieval can either stabilize or destabilize learned workflow behavior depending entirely on how evidence is structured when it enters the prompt.

The most effective configurations were not the ones with the most context.

They were the ones where retrieval structure, learned behavior, and the signals the model needed were all reinforcing each other rather than competing.

I started this project trying to answer a fairly simple question about small models and enterprise workflows.

What I have ended up with is something more interesting: a clearer sense of how the architecture of a system, not just the model inside it, determines whether that system actually behaves like the workflow it is meant to support.

That feels like the right problem to keep working on.