From Baseline to RAG to Fine-Tuning: What Actually Works for Automotive AI

I wanted to answer a simple question.

Can a general-purpose language model help with real automotive service workflows?

Not in theory. In a practical, measurable way.

So I built a proof of concept around automotive warranty and service diagnostics.

The problem

Every service organization deals with the same challenge.

Technicians and customers describe issues in messy, inconsistent ways. Warranty analysts then try to map those descriptions to known patterns, recalls, service bulletins, and past cases.

That mapping step is where time is lost. And where inconsistency creeps in.

The goal of this experiment was to reduce that friction and understand what it actually takes to turn messy inputs into structured, usable outputs.

Scope

To keep this grounded, I focused on one vehicle.

Toyota Camry, model years 2018 to 2024.

The dataset included customer complaints, recalls, manufacturer communications, and investigation summaries. The idea was to simulate how a technician note or complaint could eventually be compared against known automotive records.

This was not about building a full system. It was about understanding behavior under controlled conditions.

What I built

I created a small evaluation pipeline that allowed me to test changes in architecture in a consistent way.

It organizes automotive safety and service data, converts complaint-style text into structured service cases, generates repeatable test inputs, runs the same inputs across different model setups, and scores outputs against structured diagnostic expectations.

Automotive AI evaluation pipeline

The same test cases were run across each architecture so the comparison stayed grounded.

This made it possible to compare baseline, RAG, and fine-tuned approaches using the same dataset.

Part 1: Baseline model

I started with a local Qwen 2.5 7B instruct model. No retrieval. No fine-tuning. Just the model.

I tested 50 representative service cases with messy, real-world style inputs.

The model handled structure surprisingly well. It produced valid JSON in almost all cases and followed the expected schema consistently. It could understand symptoms and generate plausible explanations.

But the outputs were not reliable enough for a real workflow.

Component classification was weak. Hallucination was high. Severity was frequently overstated, which matters in any workflow where prioritization is critical. Recommendation quality was the biggest gap. Very few outputs were actionable.

The key realization was simple.

The model was very good at looking structured. It was not good at being correct.

Part 2: RAG vs fine-tuning

After establishing the baseline, I tested two approaches.

RAG to ground the model in real automotive data.

Fine-tuning to improve how the model behaves.

Model comparison across baseline, RAG, fine-tuning, and combined approaches

RAG helped factual grounding, while fine-tuning had the strongest impact on useful behavior.

RAG eliminated hallucinations completely. That is a meaningful result, especially for safety-related workflows.

But it did not improve usefulness. Component accuracy increased slightly. Recommendation quality stayed weak. The overall score dropped.

Fine-tuning had the largest impact. It improved classification, consistency, severity calibration, and recommendation quality.

This is where the model stopped being a sophisticated text generator and started behaving like a system: one with reliable structure, calibrated judgment, and outputs a technician could actually act on.

What this reveals

One thing became very clear.

Grounding and correctness are not the same thing.

RAG made the model more factual. It did not make it more useful. A model can eliminate hallucinations entirely and still fail to produce actionable output.

These are separate problems that require separate solutions.

Why fine-tuned + RAG regressed

This was the most instructive part of the experiment.

Adding RAG to the fine-tuned model reduced performance across multiple dimensions. JSON consistency dropped. Component accuracy fell. Recommendation quality declined.

The reason goes deeper than retrieval quality.

Fine-tuning changes how the model behaves. It builds internal patterns about how to structure outputs and make decisions. When RAG injects external context that does not align cleanly with those patterns, the model has to reconcile two competing signals.

That reconciliation is not always stable. The result is drift, not toward better answers, but toward less consistent ones.

The issue is not just what you retrieve. It is how retrieved context integrates with the model’s learned behavior. That is a prompt architecture problem as much as a retrieval problem.

A practical way to think about it

Use RAG when the problem is knowledge-heavy and context-driven.

Use fine-tuning when consistency and structure are non-negotiable.

Combine them only when retrieval quality and prompt design are both under control.

For workflows like warranty triage, consistency is the foundation. That makes fine-tuning the right starting point, not an optimization step.

A note on scope

This experiment was intentionally narrow. Toyota Camry, model years 2018 to 2024.

That makes the results clean and measurable, but it also raises a fair question about generalization to other makes, other model years, and edge-case complaint patterns that fall outside the training distribution.

That is the next thing I would test.

What I would improve next

The next step is not adding more data. It is improving the system design.

Better retrieval relevance starts with metadata filtering and tighter scoping, not broader context injection.

Evidence coverage needs to increase before RAG can contribute meaningfully to structured decisions. The way retrieved context enters the prompt needs to be controlled precisely, ideally through selective retrieval rather than passing everything available.

Only once those pieces are in place does combining fine-tuning with targeted RAG become worth attempting.

Final thought

RAG is often presented as a silver bullet.

It is not.

In this experiment, fine-tuning delivered the biggest improvement. RAG improved safety by eliminating hallucinations. Combining them requires more care than most implementations assume.

The model that looked most useful was not the one with the most knowledge.

It was the one that had learned how to behave.

That is where the real work is.