E-SQL: Direct Schema Linking via Question Enrichment in Text-to-SQL
October 7, 2024
Abstract
Translating Natural Language Queries into Structured Query Language (Text-to-SQL or NLQ-to-SQL) is a critical task extensively studied by both the natural language processing and database communities, aimed at providing a natural language interface to databases (NLIDB) and lowering the barrier for non-experts. Despite recent advancements made through the use of Large Language Models (LLMs), significant challenges remain. These include handling complex database schemas, resolving ambiguity in user queries, and generating SQL queries with intricate structures that accurately reflect the user's intent. In this work, we introduce E-SQL, a novel pipeline specifically designed to address these challenges through direct schema linking and candidate predicate augmentation. E-SQL enhances the natural language query by incorporating relevant database items (i.e., tables, columns, and values) and conditions directly into the question, bridging the gap between the query and the database structure. The pipeline leverages candidate predicate augmentation to mitigate erroneous or incomplete predicates in generated SQLs. We further investigate the impact of schema filtering, a technique widely explored in previous work, and demonstrate its diminishing returns when applied alongside advanced large language models. Comprehensive evaluations on the BIRD Benchmark illustrate that E-SQL achieves competitive performance, particularly excelling in complex queries with a 66.29% execution accuracy on the test set. All code required to reproduce the reported results is publicly available on our GitHub repository. For more details, please check out our paper by clicking here.
Figure 1: Overview of the proposed E-SQL pipeline with candidate predicate generation, question enrichment, SQL refinement modules, and without schema filtering module.
Method
The E-SQL pipeline consists of the following key components:
- Candidate SQL Generation (CSG): Generates initial SQL queries based on the natural language question.
- Candidate Predicate Generation (CPG): Extracts and incorporates likely predicates from the database.
- Question Enrichment (QE): Enhances the natural language query by linking relevant database items and conditions.
- SQL Refinement (SR): Refines generated SQL queries by correcting minor errors and ensuring execution correctness.
While schema filtering has been widely adopted in previous research, our experiments show that it can lead to performance degradation when used alongside advanced large language models (LLMs). Instead, direct schema linking through question enrichment and candidate predicate augmentation proves to be more effective, particularly on complex queries.
Below is the E-SQL execution flow for the question with question ID 1448 in the development set:
Figure 2: E-SQL execution flow for the question with question ID 1448 in the development set
You might also be interested in reading this: A Cloud-Native Application: GelGit Travel
Hasan Alp Caferoglu © 2024