Text-to-SQL Conversational Intelligence : Part II

Architecture and Implementation of Text-to-SQL Conversational Intelligence

Neha Save¹, Abhishek¹, Omkar Sadekar¹, Keerthika Kanagaraj¹, Liran Nachman², John Zinky, Ph.D.², Christopher Tate,³ Taj Salavu ³, and Hema Seshadri, Ph.D.¹˙²

¹ Northeastern University · ² Akamai Technologies · ³ Red Hat

The previous post introduced the motivation for Text-to-SQL Conversational Intelligence, the APLC framework that structured the project, the IPEDS dataset as the data domain, and the infrastructure components. This post covers the software architecture and implementation: how each component works, how data flows through the system, and the design decisions behind each step.

Text-to-SQL Conversational Intelligence is a controlled, multi-step AI system built on a modular architecture with clear separation of responsibilities. Five components handle the end-to-end workflow: (1) data domain and semantic retrieval, (2) prompt construction and prompt engineering, (3) SQL generation via LLM, (4) query execution, and (5) response synthesis and conversation memory. Each component has a bounded role, a distinct failure mode, and can be tested or replaced independently. The following sections walk through each component, referencing the architecture diagram and the Miro diagrams that detail the internal flow (Fig. 1).

Figure 1: Basic workflow of Text-to-SQL Conversational Intelligence.

Data domain

The data domain component encompasses the IPEDS dataset, the structured metadata that describes it, and the vector database that enables semantic retrieval. As introduced in Part I, IPEDS column names are cryptic codes: GROFFER, LOCALE, GRTOTLT. A generic LLM has no intuition about what these mean without domain context. Retrieval-augmented generation (RAG) addresses this by injecting relevant metadata into the prompt at inference time, giving the model the knowledge it needs without modifying its weights.

The data dictionary for the IPEDS dataset, including table descriptions, column names, data types, and encoded value mappings, is stored in a structured JSON metadata file (data/data_for_embedding/tableinfo.json in the project repository). At startup, the DocumentRetriever converts these descriptions into vector embeddings using a custom sentence-transformer model and stores them in ChromaDB for fast similarity search. PostgreSQL stores the IPEDS data itself, with a SQLAlchemy connection established once at startup and a read-only role enforced.

Step 1: Semantic Retrieval

With 16 tables and 1,480 columns, injecting the full schema into every prompt is expensive and confusing. The DocumentRetriever (defined in apps/langchain_bot/phases/context/ in the project repository) loads a custom sentence-transformer model, fine-tuned on IPEDS table descriptions and hosted on HuggingFace, and pre-computes embeddings for all 16 table descriptions. At query time, it encodes the incoming prompt, computes cosine similarity against all 16 pre-computed vectors, and returns the top three most relevant tables as formatted context (Fig. 2).

The retriever uses semantic similarity, not keyword matching. A prompt about “acceptance rates” correctly retrieves the admissions table even though the user never used the word “admissions.” Of all the architectural decisions in the system, semantic retrieval has the greatest impact on accuracy. If the retriever selects the wrong table, downstream SQL queries the wrong columns and returns incorrect results, regardless of how well the prompt is engineered.

Figure 2: Step 1 – Semantic table retrieval.

Step 2: Prompt construction and prompt engineering

Once the retriever has identified the relevant tables, step 2 assembles a prompt that gives the LLM enough context to generate correct SQL. This is where prompt engineering plays a central role. The prompt is not a static template. It is dynamically constructed on every call, combining the user’s prompt, the conversation history from Redis (deployed as a container service within the same NERC cluster), and the table and column metadata retrieved by RAG into a single structured instruction (Fig. 3).

The prompt enforces a PostgreSQL expert persona, injects only the retrieved table schemas as context, and includes explicit rules that constrain the LLM’s behaviour. These rules are the product of iterative testing against the IPEDS dataset. The actual system prompt used in the codebase, defined in the create_final_prompt function in the project repository, reads as follows:

You are a PostgreSQL expert. Based on the input question,
generate a syntactically correct PostgreSQL query that
strictly adheres to the following rules:
– Return only the generated query, with no additional text.
– Do NOT include backticks before and after the query.

Context: Use only the table names and column names
provided in: {context}.

Guidelines:
1. If the context contains multiple tables, use a JOIN
operation based on the column unitid.
2. Avoid querying columns that are not present in the tables.
3. Use aliases only when necessary.
4. Pay close attention to which columns belong to which
table to ensure correctness.

User question: {question}
History message: {messages}

Every design choice in this prompt addresses a specific failure mode observed during testing. The instruction to join only on unitid prevents the LLM from guessing join keys. The prohibition on backticks prevents formatting that breaks downstream SQL execution. The requirement to use only columns present in the provided context prevents the model from hallucinating column names from its training data. The history message injection enables follow-up prompts. Each of these rules was added in response to a real failure observed in LangSmith traces, not as a precaution.

Figure 3: Step 2 – Prompt Construction.

Step 3: SQL generation

The assembled prompt is passed to the LLMProvider class (defined in apps/langchain_bot/utils/llm_provider/ in the project repository). This class abstracts the underlying model entirely. The system supports three LLM providers: GPT-3.5-Turbo via the OpenAI API, llama3.1 running locally via Ollama, and Gemini 1.5 Flash via the Google API. Switching between them requires changing a single environment variable. No application code changes are needed (Fig. 4).

We also implemented dynamic few-shot learning: prompt-and-SQL example pairs selected by semantic similarity using FAISS. When a prompt arrives, the most similar examples are retrieved and injected into the prompt alongside the table context, giving the model concrete demonstrations of IPEDS-specific query patterns, particularly for multi-table joins and domain-specific aggregations.

Figure 4: Step 3 – SQL Generation via LLM.

Step 4: Query execution

The SQL string returned by the LLM is passed directly to LangChain’s QuerySQLDataBaseTool, which executes it against the live PostgreSQL database via a SQLAlchemy connection established once at startup. We deliberately decided to keep SQL generation and execution as two completely separate steps. Combining them into a single LLM call consistently produced hallucinations. The model would sometimes skip query execution entirely and fabricate a plausible-sounding but invented answer. Keeping execution deterministic and separate eliminated this failure mode entirely (Fig. 5).

Figure 5: Step 4 – Query Execution.

Step 5:Response synthesis and conversation memory

The raw database result is passed back to the LLM for a second call. This time, the prompt provides the original user prompt, the SQL query that was executed, and the result set, and asks the model to produce a plain English answer. Conversation memory is handled by a custom RedisChatMessageHistory class that implements LangChain’s BaseChatMessageHistory interface. Each user session is identified by a composite key combining a user ID and a conversation ID, stored as a Redis list with a seven-day TTL. This is what makes follow-up prompts work, so a user can ask “how does that compare in California?” and the model already has the context of the previous exchange (Fig. 6).

Figure 6: Step 5 – Response synthesis and conversation memory.

Role of Retrieval-Augmented Generation (RAG)

The DocumentRetriever stores not just table names and descriptions, but the full column metadata for each table: column names, data types, and human-readable descriptions of what each column means and what values it accepts. All of this metadata is injected into the prompt alongside the table schema, so the model knows not only which columns exist but also what they represent in the real world (Fig. 7).

Figure 7: RAG implementation with LangChain.

Why RAG over fine-tuning? Fine-tuning would have required labeled prompt-and-SQL pairs at scale, would have fixed the knowledge to the training data, and would have made updates expensive. RAG lets us inject knowledge at inference time from a structured JSON metadata file we already had. For a domain-specific dataset with well-documented schemas, RAG was faster to implement, cheaper to operate, and easier to update. If the IPEDS schema changes next year, we update the JSON file, not the model.

Guardrails: securing the system

Any system that generates and executes database queries from untrusted natural language input introduces security risks, including prompt injection, unauthorized data access, and unintended schema modifications. Guardrails are the controls that mitigate these risks. We structured our protections across three levels: input controls, tool-level controls, and output controls.

At the input level, the system addresses both prompt injection and SQL injection risks. User inputs are passed through the LLM as natural language, not concatenated into SQL strings. There is no direct string interpolation between user input and the database. The prompt itself is structured with explicit instructions that override any attempt to manipulate the LLM into generating unintended queries. Input validation ensures that only well-formed natural language prompts are processed. In a production deployment, this layer could be extended to include personally identifiable information (PII) redaction, filtering prompts that contain PII before they reach the LLM.

At the tool level, the system enforces structured schemas and constrained execution. The prompt explicitly restricts the LLM to SELECT statements. The database connection uses a read-only PostgreSQL role, meaning that even if the LLM were to generate a destructive query such as DELETE or DROP, the database itself would reject the operation. The retriever limits context to at most three of sixteen tables per prompt, enforcing schema boundary constraints. The model cannot reference a table that was not included in the retrieved context.

At the output level, the system validates results before presenting them to the user. The two-stage architecture, where SQL generation and response synthesis are separate LLM calls, acts as an implicit output check: the rephrasing step can only summarise data that was actually returned by the database, not data the model imagined. In future iterations, this layer could be extended to include explicit response validation, safety checks for sensitive content, and confidence scoring to flag answers where the system is uncertain.

The data redundancy problem

One of the hardest challenges had nothing to do with the LLM. Multiple IPEDS tables contain overlapping data: similar columns with slightly different names, different granularities, or different time windows. When the retriever surfaced two tables with similar column names, the LLM would sometimes reference a column from one table using an alias from another, producing a query that looked correct but returned incorrect results. Database design directly impacts LLM accuracy. Redundant data across tables is not just a storage problem; it is a retrieval and generation problem.

In this post, we covered the software architecture and implementation of Text-to-SQL Conversational Intelligence: the data domain and how RAG injects schema knowledge, the five-step chain from semantic retrieval through response synthesis, the prompt engineering that constrains SQL generation, and the guardrails that secure the system. The next post covers how we tested this system across 1,330 queries, what the numbers showed, the five failure modes we kept seeing, and the iteration loop that improved accuracy from a 69% baseline.

References

1. LangChain Expression Language (LCEL). https://python.langchain.com/docs/concepts/lcel/

2. LangChain SQL Database Toolkit. https://python.langchain.com/docs/integrations/toolkits/sql_database/

3. DataWorksAI Embedding Model. HuggingFace. https://huggingface.co/dataWorksAI/embedding_model_tabledescriptions_questions_iped6tables

4. ChromaDB Documentation. https://docs.trychroma.com/

5. FAISS: Efficient Similarity Search. Meta Research. https://github.com/facebookresearch/faiss

6. Redis Documentation. https://redis.io/docs/

7. LangSmith Documentation. https://docs.smith.langchain.com/

8. FastAPI Documentation. https://fastapi.tiangolo.com/

9. DataWorksAI GenAI Blog Series. https://dataworksai.com/wp-blog-overview/

Author

Neha Save