EXECUTIVE SUMMARY
1. Overview
Developed a scalable, context-aware AI chatbot widget embeddable into existing websites using Retrieval-Augmented Generation (RAG). The system utilizes a hybrid stack with a Python-based ingestion pipeline (Playwright, OpenAI embeddings) and a Next.js 15 frontend backed by NeonDB and pgvector.
2. Backend: Data Ingestion Pipeline
Constructed a six-stage pipeline: Scraping, Chunking, Embedding, Deduplication, Storage, and Retrieval. Utilized Playwright for handling JavaScript-rendered content with a "smart content detection" fallback.
- Semantic Chunking: Implemented ~1200 char chunks preserving heading context to maintain meaning.
- Metadata Scoring: Assigned priority scores (e.g., Pricing=75, Info=100) to weight retrieval logic.
- Deduplication: Used cosine similarity (>0.95) to discard redundant chunks, achieving a 7.2% reduction.
3. Algorithm Design & Optimization
Designed a priority boosting algorithm where BoostedScore = Similarity × (1 + PriorityScore / 100). This ensured critical business details took precedence over generic content. Optimized retrieval parameters by increasing Top-K to 8 and setting a minimum similarity threshold of 0.5.
1. Objective & Challenge
Engineered an automated QA pipeline to grade agents on 26 criteria. Identified a critical blocker: legacy G.729 compression and mono audio from the Infinity/Amtelco switch prevented accurate speaker diarization and transcription required for AI analysis.
2. Infrastructure Discovery
Mapped the telephony path: SIP Trunk Providers → Adtran (Protocol Converter) → Infinity Switch (PRI/TDM). The legacy PRI setup was preserved for stability but caused the data quality bottleneck.
3. Architecture Re-engineering
Implemented a "bump in the wire" solution using a Network TAP before the Adtran converter. This allowed for traffic duplication: one stream to business operations, the second to a dedicated server running VoIPmonitor and AI analysis tools.
- Outcome: Successfully captured stereo, uncompressed audio without impacting live call flow.
- Next Steps: Reconfiguring Adtran to force G.711 negotiation for standardized AI-ready audio.
1. The Problem
Needed to associate legacy metadata (Agent IDs, Call Scripts) with the new stereo audio stream. Traditional polling caused database strain and latency (up to 5 seconds), which was unacceptable for real-time AI processing.
2. The Solution: Change Data Capture
Implemented an event-driven architecture using Debezium (Kafka Connect) to stream changes from MS SQL Server to a local PostgreSQL instance via Apache Kafka.
3. Key Implementation Details
SQL Server Agent: Leveraged standard "capture jobs" and shadow tables (e.g., cdc.dbo_evaluations_CT) to read transaction logs without locking the live DB.
Dual-Listener Kafka: Configured Docker networking to expose port 9092 for internal traffic and 9093 for the external Node.js consumer.
Idempotency: Implemented ON CONFLICT DO UPDATE logic to handle potential duplicate message delivery, ensuring data integrity.
- Result: Achieved sub-second latency with negligible load on the legacy SAN.
1. Overview
Tasked with configuring a cluster of two NVIDIA DGX Spark units to serve a "zoo" of LLMs (Mistral 7B, MiniMax, GPT-OSS 120B) via a unified API.
2. Hardware Constraints & Solution
Discovered the DGX units were edge-optimized with only one Blackwell GPU per node, making it impossible to run all models simultaneously in FP16. Developed a strategy using Ray Serve for orchestration and vLLM for inference.
3. Optimization Techniques
- FP4 Quantization: Reduced GPT-OSS 120B memory footprint from >200GB to ~80GB, fitting it on a single node.
- Multiplexing: Used Ray Serve's
@serve.multiplexedAPI to treat GPU memory as a dynamic cache with LRU eviction. - Orchestration: Abandoned external Nginx load balancers in favor of Ray's internal routing between Head and Worker nodes.
Outcome: Transformed a limited 2-GPU cluster into a flexible engine capable of serving enterprise-grade models on demand.