Calvin Berndt - Fall 2025 Internship Report

Multi-Tenant RAG Chatbot System

Full Stack AI Engineering

1. Overview

Developed a scalable, context-aware AI chatbot widget embeddable into existing websites using Retrieval-Augmented Generation (RAG). The system utilizes a hybrid stack with a Python-based ingestion pipeline (Playwright, OpenAI embeddings) and a Next.js 15 frontend backed by NeonDB and pgvector.

2. Backend: Data Ingestion Pipeline

Constructed a six-stage pipeline: Scraping, Chunking, Embedding, Deduplication, Storage, and Retrieval. Utilized Playwright for handling JavaScript-rendered content with a "smart content detection" fallback.

Semantic Chunking: Implemented ~1200 char chunks preserving heading context to maintain meaning.
Metadata Scoring: Assigned priority scores (e.g., Pricing=75, Info=100) to weight retrieval logic.
Deduplication: Used cosine similarity (>0.95) to discard redundant chunks, achieving a 7.2% reduction.

3. Algorithm Design & Optimization

Designed a priority boosting algorithm where BoostedScore = Similarity × (1 + PriorityScore / 100). This ensured critical business details took precedence over generic content. Optimized retrieval parameters by increasing Top-K to 8 and setting a minimum similarity threshold of 0.5.

Automated QA & VoIP Architecture

Telephony & Network Engineering

1. Objective & Challenge

Engineered an automated QA pipeline to grade agents on 26 criteria. Identified a critical blocker: legacy G.729 compression and mono audio from the Infinity/Amtelco switch prevented accurate speaker diarization and transcription required for AI analysis.

2. Infrastructure Discovery

Mapped the telephony path: SIP Trunk Providers → Adtran (Protocol Converter) → Infinity Switch (PRI/TDM). The legacy PRI setup was preserved for stability but caused the data quality bottleneck.

3. Architecture Re-engineering

Implemented a "bump in the wire" solution using a Network TAP before the Adtran converter. This allowed for traffic duplication: one stream to business operations, the second to a dedicated server running VoIPmonitor and AI analysis tools.

Outcome: Successfully captured stereo, uncompressed audio without impacting live call flow.
Next Steps: Reconfiguring Adtran to force G.711 negotiation for standardized AI-ready audio.

Real-Time CDC Pipeline

Data Engineering & Distributed Systems

1. The Problem

Needed to associate legacy metadata (Agent IDs, Call Scripts) with the new stereo audio stream. Traditional polling caused database strain and latency (up to 5 seconds), which was unacceptable for real-time AI processing.

2. The Solution: Change Data Capture

Implemented an event-driven architecture using Debezium (Kafka Connect) to stream changes from MS SQL Server to a local PostgreSQL instance via Apache Kafka.

3. Key Implementation Details

SQL Server Agent: Leveraged standard "capture jobs" and shadow tables (e.g., cdc.dbo_evaluations_CT) to read transaction logs without locking the live DB.

Dual-Listener Kafka: Configured Docker networking to expose port 9092 for internal traffic and 9093 for the external Node.js consumer.

Idempotency: Implemented ON CONFLICT DO UPDATE logic to handle potential duplicate message delivery, ensuring data integrity.

Result: Achieved sub-second latency with negligible load on the legacy SAN.

Distributed Multi-LLM Inference

HPC & ML Infrastructure

1. Overview

Tasked with configuring a cluster of two NVIDIA DGX Spark units to serve a "zoo" of LLMs (Mistral 7B, MiniMax, GPT-OSS 120B) via a unified API.

2. Hardware Constraints & Solution

Discovered the DGX units were edge-optimized with only one Blackwell GPU per node, making it impossible to run all models simultaneously in FP16. Developed a strategy using Ray Serve for orchestration and vLLM for inference.

3. Optimization Techniques

FP4 Quantization: Reduced GPT-OSS 120B memory footprint from >200GB to ~80GB, fitting it on a single node.
Multiplexing: Used Ray Serve's @serve.multiplexed API to treat GPU memory as a dynamic cache with LRU eviction.
Orchestration: Abandoned external Nginx load balancers in favor of Ray's internal routing between Head and Worker nodes.

Outcome: Transformed a limited 2-GPU cluster into a flexible engine capable of serving enterprise-grade models on demand.

EXECUTIVE SUMMARY

1. Overview

2. Backend: Data Ingestion Pipeline

3. Algorithm Design & Optimization

1. Objective & Challenge

2. Infrastructure Discovery

3. Architecture Re-engineering

1. The Problem

2. The Solution: Change Data Capture

3. Key Implementation Details

1. Overview

2. Hardware Constraints & Solution

3. Optimization Techniques