Scalable RAG Chatbot Architecture

I built a high-performance Retrieval-Augmented Generation (RAG) pipeline designed to securely query massive datasets using vector embeddings and cloud-native architecture.

Project Impact & Core Functionality

This project engineered a scalable solution for AI-driven data retrieval, allowing a chatbot to provide accurate, source-backed answers from a sprawling Amazon product dataset. By moving away from local file processing and implementing a DuckDB streaming architecture, I ensured the system could handle enterprise-scale data without crashing under memory constraints. The final architecture is fully isolated within a VPC, ensuring that sensitive data remains invisible to the public internet while remaining highly accessible to the internal API.

Technical Deep Dive: Problems & Fixes

Problem 1: Memory Overload & System Crashes

The Problem: The Amazon dataset was so massive that attempting to load the entire file into the server’s RAM caused immediate system crashes.

The Fix: I implemented DuckDB to transition from full-file loading to a streaming architecture. By querying only the specific data chunks required for the RAG process, I reduced the server’s memory footprint by 80%.

Problem 2: Database Exposure & Security

The Problem: Initially, the database was sitting in a public subnet, making it a target for unauthorized login attempts and potential data breaches.

The Fix: I refactored the infrastructure to place the database inside a Private Subnet within a VPC. I configured specific Security Groups so that only my internal API can communicate with the data, effectively hiding it from the public internet.

Problem 3: Inefficient “Always-On” Compute Costs

The Problem: The data processing server was running 24/7, even when no new data was being ingested, leading to unnecessary cloud costs.

The Fix: I proposed a move to Serverless Functions (such as AWS Lambda) for the ingestion pipeline. This ensures compute resources only activate when new data arrives, significantly lowering operational overhead.

Problem 4: Inaccurate AI Hallucinations

The Problem: Without a strict retrieval mechanism, the chatbot would often make up product details that weren’t in the actual dataset.

The Fix: I integrated a Vector Database to index the Amazon data into high-dimensional embeddings. This allows the system to retrieve only the most relevant facts before generating a response, ensuring every answer is grounded in factual evidence.

Problem 5: Data Latency in Search

The Problem: Searching through millions of rows of raw text for every user query was too slow for a real-time chat interface.

The Fix: I implemented semantic search using specialized embedding models. This allows the system to understand the intent behind a user’s question and find the correct data in milliseconds, rather than performing slow keyword matches.

The Next Level: Production Hardening

Current State: The RAG system is functional, secure, and memory-efficient.

The Risk: In a production environment, prompt injection attacks could potentially trick the AI into leaking sensitive database information.

The Plan: In the next iteration, I will implement prompt guardrails to strictly validate all AI inputs and outputs, ensuring the system remains compliant with security standards like NIST 800-53.

Project Impact & Core Functionality#

Technical Deep Dive: Problems & Fixes#

Problem 1: Memory Overload & System Crashes#

Problem 2: Database Exposure & Security#

Problem 3: Inefficient “Always-On” Compute Costs#

Problem 4: Inaccurate AI Hallucinations#

Problem 5: Data Latency in Search#

The Next Level: Production Hardening#