All Work

AI · Distributed Architecture

IntelliChat AI

Microservice-based AI orchestration platform with dynamic tool routing across RAG, web retrieval, and conversational services.

Role Lead Engineer
Stack BERT · RAG · Spring Boot · Microservices
Repository GitHub ↗
IntelliChat AI Architecture

The Challenge

Intelligent
prompt routing.

Modern conversational AI systems face a fundamental architectural challenge: a single monolithic model cannot optimally serve every query type. Document-grounded questions demand retrieval-augmented generation, factual queries benefit from real-time web search, and open-ended conversations require different inference parameters entirely.

The challenge was to engineer a system that autonomously classifies user intent in real-time, routes each query to the optimal service, and orchestrates responses across multiple specialized backends — all with sub-second latency.

The Architecture

Multi-service
orchestration.

The system is decomposed into independent microservices, each responsible for a single domain of intelligence:

01

Intent Classification Layer

Fine-tuned BERT model that analyzes incoming prompts and classifies them into intent categories (RAG, web search, general chat) with confidence scoring. Fallback logic ensures graceful degradation.

02

RAG Service

Document ingestion with semantic chunking, vector embeddings via FAISS, and retrieval-augmented response generation with source attribution and context window management.

03

Web Retrieval Service

Real-time web search integration with result parsing, relevance filtering, and LLM-based synthesis to generate factual, sourced responses for time-sensitive queries.

04

Spring Boot Orchestrator

Central coordinator managing request lifecycle, service health, load balancing, and response aggregation. REST API gateway for frontend consumption.

The Outcome

Autonomous
intelligence.

The resulting platform autonomously routes queries with high classification accuracy. The microservice architecture enables independent scaling of each service — the RAG pipeline handles document-heavy workloads independently of the conversational service, and each component can be updated without system-wide downtime.

The BERT classifier's confidence-based routing eliminates the need for users to manually specify their intent, creating a seamless conversational experience regardless of query type.