Text-to-SQL Conversational Intelligence: Part I

Expanding Access to Data Through the APLC Framework

Neha Save¹, Abhishek¹, Omkar Sadekar¹, Keerthika Kanagaraj¹, Liran Nachman², John Zinky, Ph.D.², Christopher Tate,³ Taj Salavu ³, and Hema Seshadri, Ph.D.¹˙²

¹ Northeastern University · ² Akamai Technologies · ³ Red Hat

Data is entering a new phase. The most consequential decisions in higher education, which colleges to fund, which programs to expand, which students to recruit, depend on data that already exists in federal databases. The problem has never been the data. The problem is that extracting it requires writing SQL, and most of the people who need the answers cannot write SQL.

We built Text-to-SQL Conversational Intelligence to close that gap. By text-to-SQL, we mean a system that takes a plain English question, identifies the relevant tables in a structured database, generates a syntactically correct SQL query, executes it, and returns a human-readable answer. The user never sees the query. The user never touches the database. They ask a question and get an answer.

This post is the first in a part blog series. It covers why the data access problem matters, how we structured the project using the APLC framework, what domain knowledge the system needs to function, and the infrastructure we assembled to make it work. Subsequent posts cover the full system architecture and prompt engineering, testing and evaluation, the migration to Red Hat OpenShift AI, and the future roadmap.

Dashboards are dead

For decades, the answer to “how do we make data accessible?” has been dashboards. Dashboards pre-compute common questions into fixed visualizations. They work well for the questions they were designed to answer. They do not work for the questions nobody anticipated.

A student researching financial aid in Texas cannot wait for someone to build a dashboard for that specific question. A policy researcher comparing enrollment trends across 6,000 institutions should not need to understand SQL joins. A financial aid counselor trying to decode a column named SCUGFFN should not need a data dictionary and a database client. These are not edge cases. They are the norm. Most people who need data from structured databases cannot get it, because the tools we built assume expertise that most people do not have.

Search engines return documents, not data. Business intelligence platforms require training, licensing, and configuration. Natural language interfaces built on keyword matching break the moment a user phrases a question in a way the system does not expect. None of these approaches handle the open-ended, exploratory questions that students, researchers, and policymakers actually ask.

Generative AI changes the calculus. As covered in the DataWorksAI GenAI blog series , large language models can parse ambiguous questions, generate syntactically correct code for schemas they have never seen, and adapt their behavior to a specific domain through prompting alone. This is the capability that makes text-to-SQL viable at scale. The evolution from dashboards to conversational intelligence is part of a broader shift in how AI systems are built and deployed, one that the companion Internet of AI Agents blog series explores in the context of multi-agent architectures.

How we structured the project: the APLC framework

Before writing any code, we used the Analytics Product Life Cycle (APLC) framework to structure the project into four quadrants. The APLC framework, developed by Dr. Hema Seshadri, Ph.D., and detailed in her book Analytics for Business Success: A Guide to Analytics Fitness , forces teams to think through the business problem, the data engineering, the modeling, and the software engineering as distinct concerns before merging them into a working system (Fig. 1).

Figure 1: The Analytics Product Life Cycle™ (APLC) framework. Seshadri, H., Analytics for Business Success.

The business quadrant defined the success criteria: a non-technical user should receive an accurate, data-backed answer to a question about college data within five seconds.

The data engineering quadrant covered acquisition, cleaning, and structuring of the knowledge base. For this project, that meant loading the IPEDS federal dataset into PostgreSQL, building a structured JSON metadata file containing table descriptions, column names, data types, and encoded value mappings, and storing those embeddings in a vector database for semantic retrieval via RAG.

The modeling quadrant involved building and iterating on the text-to-SQL pipeline: experimenting with LangChain versus LlamaIndex as orchestration frameworks, implementing retrieval-augmented generation and few-shot learning, and measuring accuracy across 1,330 test queries.

The software engineering quadrant packaged everything into a deployable system: Docker Compose for local development, Red Hat OpenShift AI on the New England Research Cloud (NERC) for production, GitHub Actions for automated regression testing, LangSmith for observability, and Streamlit as the user-facing interface.

Domain knowledge: RAG and the IPEDS dataset

A generic large language model has no intuition about the structure of a specific database. It does not know which tables exist, what the columns mean, or how tables relate to one another. Retrieval-augmented generation (RAG), as detailed in the DataWorksAI GenAI blog series, solves this by injecting relevant metadata into the prompt at inference time, giving the model domain knowledge without modifying its weights. If the schema changes, we update a JSON file, not the model.

To validate a text-to-SQL approach, we needed a domain with real complexity. We chose IPEDS, the Integrated Postsecondary Education Data System, maintained by the National Center for Education Statistics (NCES). IPEDS surveys every accredited postsecondary institution in the United States, roughly 6,000 of them, collecting data on enrollment, graduation rates, financial aid, tuition, admissions, programs offered, and institutional characteristics. The dataset we used contains 16 tables, 1,480 columns, and over 460,000 rows.

Every year, millions of students make consequential decisions about where to apply, where to enroll, and how to finance their education. IPEDS data could directly inform those decisions, if anyone could access it. Column names are cryptic codes. Tables must be joined on shared institutional identifiers. The same concept, financial aid for example, spans multiple tables with different granularities. Getting correct answers requires not just valid SQL, but SQL that reflects a genuine understanding of the data domain.

Infrastructure requirements

The technical foundations for implementing Text-to-SQL Conversational Intelligence used a combination of open-source frameworks and proprietary cloud infrastructure. The following sections describe each component of the architecture and its role, organized in the same order as the architecture diagram (Fig. 2).

Figure 2: Components of the architecture.

LLM provider

The LLMProvider class abstracts the underlying language model entirely. The system supports multiple LLM providers through a single environment variable, allowing the team to switch between GPT-3.5-Turbo, Ollama, and Gemini without changing any application code. This abstraction allowed the team to iterate locally with Ollama at zero API cost during development and switch to GPT-3.5-Turbo for production deployment.

Orchestration

LangChain provides the orchestration layer, a composable framework for building LLM-powered pipelines with full observability and debugging support. The orchestration layer composes five discrete processing steps into a single chain: semantic table retrieval, prompt construction, SQL generation, query execution, and response synthesis. Each step is a named, independently testable component. Retrieval-augmented generation (RAG) addresses the schema comprehension problem. The dataset’s column names are cryptic, and a generic LLM has no intuition about what they mean. RAG lets us inject table and column metadata, including names, descriptions, data types, and encoded value mappings, into the prompt at inference time, giving the model the domain knowledge it needs without fine-tuning. The next post covers the five-step chain and the prompt engineering in detail.

Embeddings

Database

PostgreSQL stores the IPEDS data, chosen for its SQLAlchemy compatibility, its ability to handle concurrent queries at scale, and its pgAdmin tooling for development. The SQLAlchemy connection is established once at startup and reused across requests. A read-only database role is enforced, meaning the system can only execute SELECT queries against the data.

ChromaDB serves as the vector database, storing the pre-computed embeddings for semantic table retrieval. The data dictionary, including table descriptions, column metadata, and encoded value mappings, is converted from the structured JSON metadata file into vector embeddings and stored in ChromaDB for fast similarity search during query processing. MongoDB Atlas Vector Search provides an alternative for distributed deployments.

Memory

Redis provides conversation memory, enabling follow-up questions to work naturally, so a user can ask about graduation rates and then immediately follow up with “how does that compare in California?” without restating context. Conversation history is accessed through a custom RedisChatMessageHistory class that implements LangChain’s BaseChatMessageHistory interface. Each session is identified by a composite key combining a user ID and a conversation ID, with a seven-day TTL. Parallel conversation isolation ensures that multiple users can query the system simultaneously without interfering with each other’s context.

API

FastAPI combined with LangServe serves the LangChain pipeline over HTTP. A session factory reads the user_id from the request cookie and routes it to the appropriate conversation history in Redis. The API exposes a playground endpoint at port 8001 for development and testing.

UI

Streamlit provides the user-facing chat interface as a separate container, connecting to FastAPI on port 8001. The user never sees SQL. They type a natural language question and receive a formatted, human-readable answer.

Observability

Observability refers to the ability to understand the internal state of a system by examining its outputs. In an LLM-powered pipeline where each step transforms the input in non-deterministic ways, observability is not optional; it is the primary mechanism for diagnosing failures. LangSmith traces every RunnableLambda step in the chain during production. This provides visibility into retrieval quality, which tables were selected and why, token usage per call, latency at each stage, and the exact prompt and SQL generated for every query. Without this level of tracing, silent failures such as wrong-table retrievals that produce syntactically valid but semantically incorrect SQL would be invisible.

Deployment

The entire stack was developed locally via Docker Compose and deployed in production on the New England Research Cloud (NERC) using Red Hat OpenShift AI. The production environment runs on 9 CPUs and 18 GB RAM with Kubernetes pod management, automatic pod restarts on failure, rolling deployments, health checks, and namespace isolation. The migration from local Docker to OpenShift involved installing PostgreSQL inside the cluster via Helm, migrating 460,000 rows via oc rsync, and cutting over six Kubernetes Secrets fields. Blog 4 in this series covers the full migration story.

The result is Text-to-SQL Conversational Intelligence, deployed as the University Explorer chatbot, where users type questions in plain English and receive data-backed answers in seconds (Fig. 3).

Figure 3: University Explorer chatbot interface showing a query and response.

In this post, we introduced the motivation for building Text-to-SQL Conversational Intelligence, described the APLC framework that structured our approach, explained the role of RAG and the IPEDS dataset as domain knowledge, and outlined the infrastructure components. The next post dives into the five-step LangChain chain, the prompt engineering that guides SQL generation, the guardrails that keep the system safe, and the design decisions behind every component.

References

IPEDS Data Center. National Center for Education Statistics (NCES). IPEDS Data Center
Text-to-SQL Conversational Intelligence GitHub Repository. https://github.com/DataWorksAI-com/ipedsllm_text2sql_III
Seshadri, H. (2023). Analytics for Business Success: A Guide to Analytics Fitness. https://a.co/d/ecQF7bo
DataWorksAI GenAI Blog Series. https://dataworksai.com/wp-blog-overview/
New England Research Cloud (NERC). Massachusetts Green High Performance Computing Center. https://nerc.mghpcc.org/
LangChain Documentation. https://python.langchain.com/
Red Hat OpenShift AI Documentation. https://docs.redhat.com/en/documentation/red_hat_openshift_ai_cloud_service

Author

Neha Save