How are companies building LLM to SQL database features in production?

Question

How are products are enabling users to ask questions against their database data?I've been exploring different ways to do this. Nothing has felt prod worthy.Here's a few examples of what I mean. Each one of these features exists in a sidebar in an existing saas app.Ramp Intelligence Chat Feature. "What was our advertising spend by department"Intuit Assist: "What were my top selling products last month?"Digits AI: "Whats my burn rate" (no longer see this feature on the site)These appear to allow a user to ask any question of their data. Unlike traditional RAG, this data lives in a sql database. Its like having a simplified https://julius.ai/ in your product, but with the added complexity of figuring out what data to get before you analyze it. Although it's likely these companies are limiting the acceptable query space, I haven't tried them.Putting aside questions of whether or not these are useful features, I would love to hear opinions on how these might be built. I can't find any write ups on this.I've explored a handful of patterns, all of which rely on tool calling.(1) Text to SQL Feels risky to have an LLM write queries against even a read only copy of a database. But it would of course provide the most utility and flexibility. You can write a single sql query to answer most questions. Mitigate the risk by hard coding important values like user id and only allowing select statements. Even then it feels sketchy. Am I overthinking the risk?I've seen promising related tools, like promptql. What this does well is provide the user with a trace of what data it accessed.(2) Broad fetching tools + code interpreter for analysis You build a tool to correspond to each table. Fetches data via API call or SQL query. You include query parameters like start_date, end_date, etc.But after you call the tool you will need to manipulate the data to answer the query. Perhaps you got all transactions from last month but the user asked for the sum. The LLM will call the analysis tool (LangChain pandas tool for one reference) which will write some code to find the right answer.I've explored this and it works fairly well. Safer for production (run the analysis in a sandbox). But this pattern introduces a lot of complexity and latency. Its also very hard to eval. What do we test for? Correct tool selection? The analysis tool itself?(3) A large number of pointed tools This makes much less sense to me but I've had some people I respect suggest it. You have a large number of tools each corresponding to a controlled query, trying to capture most questions a user might ask. So maybe a sum transactions tool, get average transaction cost tool, get last months sales tool, etc. Then you can hopefully get away without the analysis tool.Its a simple place to start. But feels like a fools errand to try to anticipate everything a user might ask.Furthermore this pattern seems most at odds with believing LLMs will only get better. Trying to impose rigid structure and anticipate tools feels more like a giant if/else block as opposed to utilizing the intelligence of these models.There's other variations I won't touch on. Let me know what I'm missing.All of these patterns feel sketchy and hard to evaluate. There's a high probability of misinterpreting a user query and providing the wrong data. Example: if I ask the total of last month's expenses on February 29th, do I mean all expenses in February, or January? If people are making decisions based on your response, its critical to get it right. Exposing reasoning and data utilized can help here. Responding in visual blocks as opposed to text like Ramp does is also promising. But these feel like mitigations for a fundamentally unpredictable process.What patterns do you think people are using to reliably do this type of data fetching and analysis for users?

morkalork · Accepted Answer

I don't think such tools will be reliable until backed with knowledge of the codebase that populates the data. Just because a column has a certain name and has plausible looking data in it doesn't mean that's what's really in there. Especially if there are many tables and columns with similar names and data that have nuances for what differentiate them. Nuances that live in code and the heads of developers. I would not trust any LLM right now to do any financial reporting on any of the production data warehouses I've seen, it would be a recipe for shooting yourself in the dick.

Mbranconier · Answer

I don't have a good answer to this, but I suspect that second approach with the API call would work well. Haven't done this in production to know. I do think any solution in this situation where you can't be 100% confident in the output would benefit from citations. Then if your feb 29th example does return the wrong data, the user would say "ah the model wasn't wrong, I was wrong because I was unclear".