Loading…

From multilingual semantic search to virtual assistants at Bosch Digital

From sprawling PDFs to a fast, factual conversational assistant.

Article hero image
Credit: Alexandra Francis

This article is licensed under a CC BY-SA license.

An e-bike rider types “reset Kiox 300 display,” and the answer must land in a heartbeat—not as a 200-page manual or a dozen near-miss FAQ links. The same expectation applies to mechanics updating brake firmware in a noisy workshop and to sales reps hunting torque specs on weak showroom Wi-Fi. Bosch eBike Systems an independent business division within the Bosch Group that serves millions of pages of manuals, release notes, and CAD drawings in twenty-seven languages. Roughly five percent of that content changes every month. But for Bosch eBike Systems, this wasn't just about efficiency; it was about elevating the customer experience and ensuring seamless support for riders, dealers, and service partners worldwide. Meeting expectations like these forced us from Bosch Digital to leave plain keyword search behind and build a retrieval engine that understands intent across languages, keeps costs predictable, and still answers in under a second.

Why did keyword search crumble?

Let's talk about why the old approach just couldn’t keep up. The world of bikes—and bike documentation—is wild with synonyms, part nicknames, and shifting terminology. “Display,” “NYON2,” or “BUI350” might all mean the same thing to a rider, but a bag-of-words search engine treats each as a stranger. Recall falls off a cliff unless you’re willing to hand-craft endless synonym lists.

Typographical quirks and voice-to- text slip-ups don’t help. Real-world queries show up as “Kioxx 300,” “réinitialiser kios,” or, thanks to smeared microphones, as voice-recognition garble like “reset chaos 300.” Exact-token searches? They just shrug and show “No results.” In contrast, embedding-based search is far more forgiving of noisy input.

Intent also gets lost in translation, especially for complex or constraint-laden queries. Someone might type, “Update brake firmware without a laptop” or “max torque under rain mode only.” Keyword search latches onto negated words (“laptop”) and dredges up the wrong docs. Modern transformer models, by contrast, grasp what the user really meant and rank results accordingly.

Combine all these headaches—synonyms, noisy input, intent confusion, rapidly changing languages—and you’ve got the main reasons keyword search kept missing the mark. For Bosch Digital, moving to a vector-based, multilingual SmartSearch wasn’t an upgrade. It was survival.

Designing a smarter way to search

Once we mapped out every pitfall of traditional keyword search, it was time to rethink the pipeline from the ground up. Today, every answer SmartSearch delivers takes a precise, three-step journey from raw document to ranked result-a journey engineered for speed, accuracy, and multilingual scale.

Step one: Crawling. Our self-developed Rust-based crawler zips through about 25 webpages per second, swiftly navigating vast documentation libraries while remaining polite enough never to trip rate limits—a digital librarian who reads fast but never ruffles feathers.

Step two: Chunking before embedding. HTML gets dissected to separate titles from contents, and semantically coherent topics are stitched together using LLMs. Then come embeddings. Thanks to OpenAI's Ada 002 model (with a hefty 1536 dimensions), every content chunk lands accurately in semantic space. If it quacks like “reset Kiox 300,” our system will surface answers, even if the actual language is wildly different.

Step three: Rank searches using a hybrid approach. Semantic search isn’t always the best approach. Dense vectors live in a vector database, while BM25 keeps classic keyword search in the mix. At query time, we blend the two—70% semantic, 30% sparse—then run the finalists through a MiniLM cross-encoder for the decisive sort. The result? Answers typically appear in about 750 ms, with 95% delivered in under a second and a half—even during those infamous firmware launch stampedes.

But all this performance wasn’t without pain. Building SmartSearch meant ramming into hard limits: a 10-million vector per collection cap, painful re-indexing every time we added metadata, storage bills bloated by 32-bit floats, and no elegant ways to compress, quantize, or tier out storage to cheaper SSDs. Scale much beyond eight million vectors and everything slowed to a crawl.

SmartSearch forced us to evolve—crawl, structure, represent, and rank—leaving the constraints of generic search infrastructure behind. The result is nimble, cost-effective, and fluent in every dialect your e-bike manuals throw at it.

When search becomes chat

With search bars, ten imperfect links might do. But for Bosch eBike Systems to deploy this as a conversational assistant for its global user base , there’s no room for error—the bot usually only has one shot. The very first retrieval must be laser-accurate, because every token we hand off to an LLM costs real money—and user trust evaporates if the bot’s opening statement misses the mark. Chat also explodes the data scale. Now we're not just retrieving from documentation, but juggling vast conversational histories and real-time follow-ups. Hundreds of thousands of chat snippets in the form of a short-term and long-term memory need to be stored, searched, and surfaced in milliseconds. Here, the cracks in our previous vector store yawned open: hard vector count limits, glacial re-index times, zero support for quantization or in-build multi-stage queries, and an insistence on keeping all vectors on DISC—bloating budgets and bottlenecking speed. Every shortcoming of the old architecture was amplified by chat’s relentless demand for cheaper, smarter, and scalable retrieval.

Enter Qdrant. After pitting several vector databases against our most punishing workloads, Qdrant won hands-down. On a 25k-query, multilingual test set, it delivered recall above 0.96 with quantization, kept p95 latency under 120 ms with 400 concurrent chats, and we reduced the storage costs for our 10M dataset through quantization by 16x. Qdrant didn’t just handle chat’s challenges—it thrived on them. Now, suddenly, lightning-fast, chat-scale retrieval was not only possible, it was affordable.

Slimming the brain, not the brains

Our first prototype spoke fluent relevance but was a glutton for storage. Every text chunk wrapped itself in a massive 1536-dimensional Ada-002 vector—millions of high-precision floats devouring our SSDs by the rackful. Something had to give.

The breakthrough came with Jina Embeddings v3. Flip a flag and you get binary quantized embeddings with 1024-dimension vector, flip another flag the 1024-dimension vector can be reduced with the power of Matryoshka Representation Learning down to 64. With lots of internal testing on recall quality, we found the best performance to quality ratio at 256-dimensions. Overnight, the footprint dropped by ninety-eight percent, and search quality even crept up over Ada-002. In recent evaluations, this setup outperformed Ada-003 and left a few MTEB chart-toppers in the dust (we will evaluate the Qwen3 embeddings model next). Additionally, thanks to our fine-tuned ModernBERT re-ranker, any minuscule loss vanishes completely.

Qdrant turns those slimmed vectors into lightning answers. Because it natively understands multi-stage retrieval, we now run a two-stage search: a blistering-fast 256-dimension recall phase fused with BM25, then a fine-tuned reranker based on ModernBERT for pinpoint precision. This is how an ultra-lean operation should look like.

Most importantly, Qdrant’s tiered storage lets us keep hot shards in RAM and cold vectors chilling on SSD, again cutting storage making it a total storage reduction of 5x while p95 latency remains well below 400 ms. Hybrid search? Dense scores blend seamlessly with BM25 in the same API call, so typo-riddled or perfect queries get equal love.

The result: the answer to “reset Kiox 300” flashes onto a rider’s screen before the traffic light turns green—lighter vectors today, headroom for even slimmer tomorrow, and no compromises in quality. This is SmartSearch at chat-speed-fast, frugal, and fiercely precise, perfectly suiting as a backbone for our assistant.

Names, not guesses: How GLiNER supercharged recognition

By now, our assistant could find relevant facts with impressive speed and accuracy—but it still stumbled where it mattered most: names. “My Kiox 300 flashes 503 after the v1.7.4-B update” and “Nyon freezes on boot” appeared almost identical to a language model that didn’t truly see products, error codes, or firmware versions—just a blur of nouns and verbs. Context got lost; precision suffered. And bringing in a multi-billion-parameter AI hammer for this problem was pure overkill.

The breakthrough came from an unexpected place—a doomscroll through LinkedIn. There it was: GLiNER, promising general, lightweight NER (named-entity recognition). Few-shot learning, CPU-fast inference, and a footprint small enough (800 MB) to fit in our Docker image—GLiNER checked every box we didn’t even know we had.

It wasn’t just “easy”—it was transformative. With only a handful of annotated examples—just two for products, two for error codes, and two for firmware—GLiNER learned our entire domain in minutes. Inference was nearly instant: less than 30 ms per paragraph, even on a single laptop core.

With labels persisting across chat turns, context sticks. So when a rider says, “Kiox 300 shows 503 after v1.7.4-B,” then follows up with, “Does it also hit CX Gen4?” the assistant keeps every product, error, and firmware straight. Each answer is routed with surgical precision, no more mistaking a Kiox for a Nyon, no more guesswork.

All because of a LinkedIn scroll, a 800 MB model, and few lines of labeled text. Names matter. Now, finally, the assistant knows them cold.

From answers to actions: Agentic workflows for the next-gen assistant

Finding the right paragraph is one thing. For the Bosch eBike Systems assistant, tasked with supporting diverse user needs from simple inquiries to complex troubleshooting, carrying out a real-world task—filing a warranty claim, collecting the latest firmware links for three different drive units, or guiding a mechanic step-by-step through a “display reset” on chat—demands something more. A simple pipeline falls short: modern assistants need to reason, plan, coordinate, and act, not just retrieve.

This is where agentic workflows come in.

Instead of funneling every query through a single, monolithic language model (and hoping it never drops a detail), our platform orchestrates a team of specialized AI agents, each with a defined responsibility. Picture a user asking, “My Kiox 300 flashes error 503. Can you check if my firmware is out of date, tell me how to fix it, and draft a message to support if that doesn’t work?” In the old days, that threw a tangle of ambiguous instructions at a black-box chatbot. Now, agentic workflows break the request into manageable, coordinated steps—each agent picking up what it does best.

The process starts with an orchestrator agent that parses user intent into subtasks: error code lookup, firmware verification, troubleshooting guide retrieval, and, if needed, support ticket drafting. Each subtask is routed to a specialist agent—e.g. a custom reasoning workflow based on product variants and corresponding information. These agents consult our retrieval backbone (built for precision, even with noisy queries), gather facts, cross-check versions, and piece together the findings.

The upshot? Agentic workflows let our assistant go beyond answering “what”—they let it do “how” and “what’s next,” chaining knowledge, actions, and even human handover, seamlessly. Whether it’s a simple spec lookup, a multi-step troubleshooting procedure, or orchestrating real-world follow-ups, agentic workflows are the connective tissue behind our assistant’s leap from search box to conversational partner.

We’ve found that this modular, transparent approach doesn't just improve speed-it brings new peace of mind. When something breaks, the scratchpad log shows exactly what was done (and why). If a process hits a wall, the orchestrator pivots—never leaving the user in limbo, and never letting important details fall between the cracks.

The result: tasks handled start-to-finish, user intent actually understood, and the confidence that, under the hood, every answer isn’t just the luck of a generative roll, but the well-planned output of agents working in concert. That’s agentic workflow in action-the step change from answers to real assistance.

What we’d do again—and what we wouldn’t

Scars teach deeper than trophies, so here are the three that still itch (in all the right ways):

Polish the pages before you pamper the model

We once spent a solid week deduplicating near-identical paragraphs, chopping out boilerplate (“© 2021 Bosch eBike Systems-All rights reserved”), and flattening FAQ echo-chambers until they stopped swallowing fresh questions whole. The improvement in search quality? Bigger than any new encoder, model drop, or clever agent could manage—by a mile. Lesson learned: a clean, well-structured corpus is the cheapest upgrade you’ll never find on Hugging Face, and it makes every downstream agent that much sharper.

Bake shrinkage into your day-one plans

Binary quantization and dimension-slimming saved a small fortune on storage and inference. But we bolted those features on after launch, which meant re-encoding 10 million chunks while users were searching live—a gnarly headache nobody needs. Next time, the compression and size targets go on the first whiteboard, right up there with recall, latency, and now, agent handoff compatibility. Diets work better before the group photo. And it’s not just storage: your embedding model, vector database, chunking strategy, and, yes, agent workflows and communication schemes all need to work together from the start.

Complex queries mean complex agent designs

LLMs are both a blessing and a budget breaker—latency, cost, and “intelligence” all become make-or-break variables in a multi-agent system. As workflows get agentic—planning, delegating, keeping state—the challenge shifts from “Can we answer this?” to “Can we coordinate this, auditable and efficient?” Keep the data clean, plan your storage and compute diet early, and never skimp on people who can read between the lines and handle the edge cases. Everything else is just another line on a model card, or now, an agent manifest. This collaborative endeavor, made possible by the strategic investment and close partnership with Bosch eBike Systems, has truly reshaped how information is accessed and utilized within their ecosystem.

In the end, it’s the painful lessons—not just the pretty graphs—that shaped SmartSearch into the system it is now. And with each round of learning, our answers get a little faster, a little sharper, and maybe-one day—just a little closer to perfect.

Add to the discussion

Login with your stackoverflow.com account to take part in the discussion.