
An enterprise-grade, intent-aware hybrid RAG chatbot with ML-guided metadata filtering and local LLM fallback, optimized for academic department queries.
uploaded
I built this production-ready Retrieval-Augmented Generation (RAG) chatbot to help students, faculty, and visitors query university department profiles, faculty rosters, lab facilities, and course syllabi in natural English. Most basic RAG setups struggle in production: they hallucinate details, pull in off-topic information (like confusing a Computer Science HOD with an IT HOD because the queries sound so similar), or break down when cloud APIs hit rate limits or go offline. To fix these issues, I designed an advanced, highly specialized retrieval pipeline.
This chatbot isn't just a basic wrapper around an LLM. It's a complete, production-grade application equipped with features designed for speed, accuracy, and reliability.
To make the architecture clean and decoupled, I divided the system into two major boundaries: a lightweight, secure Web Gateway (built in Express.js) that handles public rate limiting and admin JWT verification, and a high-performance FastAPI backend hosting the ML models and retrieval logic.
Here is how structured department files and unstructured textbooks are processed and indexed into ChromaDB. For structured files, I designed an intent-aware splitter to organize data into semantic, count, and detail groups rather than using raw character slices:
Here is a deep technical breakdown of the architectural solutions I engineered to overcome the limitations of naive RAG:
Splitting documents blindly by character counts destroys structured academic lists (like faculty rosters).
department) β Category (e.g., computer_eng) β Topic (e.g., faculty / lab / syllabus)list: Pre-formatted bullet points for rosters or course offerings.count: Synthetic helper chunks (e.g., "Total faculty members: 15") designed specifically to satisfy counting-related queries.detail: Context-rich paragraphs describing the items in detail.To prevent "department bleed"βwhere vector similarity confuses near-identical queries between different departments (e.g., "HOD of CE" vs. "HOD of IT")βI built an offline-trained query routing engine:
Stacked Feature Vector = [ Dense Sentence-Embedding (384d) || Sparse TF-IDF Vector ]
LogisticRegression models predict the query labels. High-confidence predictions are locked in as hard database filters:
Type: 0.40, Category: 0.40, Topic: 0.50, Intent: 0.60Type: 0.65, Category: 0.65, Topic: 0.70OR condition (category OR topic) to avoid dropping relevant documents.ChromaDB Filter = Type AND Intent AND (Category OR Topic)
Running vector databases and lexical search engines (like Elasticsearch) on separate clusters is slow and resource-heavy. I resolved this by engineering a Zero-DB-Roundtrip Hybrid Pipeline:
βββββββββββββββββββββββββββββββββββββββββββ
β Raw Ingested Query β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β 1. Similarity Search with ML Filters β
β - Fetches top-15 candidate chunks β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β
ββββββββββββββββββββββββββ
βΌ (Extract docs) βΌ (Vector Ranks)
βββββββββββββββββββββββββββββββββββββββββββ β
β 2. Dynamic BM25 Retriever β β
β - Rebuilds corpus of 15 candidates β β
β - Performs fast keyword re-indexing β β
ββββββββββββββββββββββ¬βββββββββββββββββββββ β
β β
βΌ (BM25 Ranks) β
βββββββββββββββββββββββββββββββββββββββββββ β
β 3. Reciprocal Rank Fusion (RRF) βββββ
β - Fuses ranks using weights (60/40) β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β 4. Metadata Boost & exact Title Boost β
β - Multiplies score on classifier hit β
β - Adds +0.1 per word found in title β
ββββββββββββββββββββββ¬βββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββ
β 5. Threshold Filter (>=0.4) & Top-5 β
βββββββββββββββββββββββββββββββββββββββββββ
RRF_Score(doc) = (w_bm25 * (1 / (k_rrf + rank_bm25))) + (w_vector * (1 / (k_rrf + rank_vector)))
(Weights: BM25 $0.6$, Vector $0.4$, smoothing constant $k_{\text{rrf}} = 60$.)Final_Score = Fused_Score + (matching_words_in_title * 0.1)
To achieve high availability, the backend manages a robust, local GGUF fallback mechanism:
gemini-2.5-flash-lite for blazing-fast generation, high reasoning capability, and clean structured summaries.Rather than guessing search quality, I built two mathematical validation metrics directly into the REST API:
POST /api/v1/rag/test): Batches expected results to calculate Hit Rate and Mean Reciprocal Rank (MRR). It also calculates Doc Noise Rate (proportion of incorrect source files retrieved) to measure prompt dilution:
MRR = (1 / total_queries) * Sum(1 / rank_of_correct_chunk)
POST /api/v1/rag/test_classifier): Evaluates classification accuracy using scikit-learn to report precise Precision, Recall, and F1-Scores across all predicted metadata fields.