The unmet need from generative AI
In the last year, generative models have garnered significant attention for their remarkable capabilities in processing and understanding unstructured data. From generating human-like text to creating stunning visuals, these models have displayed their prowess across various domains. Yet, the adoption of Generative AI for most businesses is limited to narrow use cases in content generation and customer experience. A vast majority of data businesses own across all their functions is structured data, mostly relational, and sometimes graph or key-value data. Furthermore, this will remain the data still central to most business operations and decision-making. However, this data has received comparably little attention from the Generative AI community regardless of its oversized value and volume in the business world.
Furthermore, structured data also poses an accessibility challenge (as opposed to text and image or video) in that users need to learn querying and/or programming languages to get at it. This has created an entire BI industry primarily aimed at bridging this access gap for non-technical users, such as salesforce or executives. Democratization of data, while adopted in principle, is also met with this challenge. Even technical users, on the other hand, spend an inordinate amount of their time on fighting database systems as opposed to thinking about the business context overall.
At Fractal, we recognize this disparity and have been working to unlock the emergent reasoning abilities of generative models specifically to work with structured data. By steering AI models through specialized tools and carefully crafted natural language prompts, we find that these models can be surprisingly successful in providing insights from structured data using only natural language conversation.
Avalok.ai has been made generally available to Fractal employees and our business analysts, data scientists, and data engineers have been testing Avalok.ai. We have seen enough success from an alpha release to publish these findings.
The tool has three modules designed to enhance data analysis and drive invaluable insights.
This module enables users to query and explore relational and tabular data using natural language. This tool can query over a database, disambiguating between the relevant tables and columns within a table to write queries matching the user intent. It can also quickly generate joins between tables, rename column transformations on the fly, and make the code and results available for users to download and use elsewhere.
This model allows users to generate expressive and interactive JS plots using natural language. The bot can figure out which columns are necessary for plotting, it can be prompted to deal with missing values, generate aggregates & transforms before plotting, and even label the axes and color the plots per the user’s liking. For BI analysts, generating plots quickly to their liking without referring to dense and long manuals for plotting libraries can be a considerable boost in productivity.
This model provides the same capabilities as IntelliSql but over a Neo4j graph database generating Cypher queries instead of SQL.
Along with this, each module also supports:
- Error recovery at the code generation step – if the code generation step fails, the agent automatically captures the stack trace and reattempts, knowing what failed the last time.
- Invoking LLMs to detect if user questions are toxic, inappropriate, or irrelevant based on the data.
- Figuring out whether it should respond with an inline text, a markdown table, or a file download depending on the output size generated.
- Retaining the context of the conversation and building on earlier queries (with mixed success in our current testing).
- Providing the users with a tracing of its thought process and tool use – often offering the exact hint necessary to improve a failing query to make it successful.
- Saving and resuming a conversation later.
We have integrated user feedback into our data collection process, including prompts, responses, and thought traces. This data can be used to fine-tune the LLMs in the future.
Here is a quick demo to see Avalok.ai in play.
Avalok.ai is powered by LangChain, albeit with heavily customized tooling, chaining, and prompting that matches our preferred workflow of interacting with structured datasets instead of the default agents that ship out of the box. We found several ideas for improvement over LangChain agents that we have been able to test in Avalok.ai that make the workflow more predictable and resilient – two things that are hard to do with LLM applications.
The current LLMs powering the app are gpt-3.5-turbo and text-davini-003 from OpenAI. However, our architecture can switch across LLMs. We tested several OSS models and found that most lag far behind OpenAI models at following very structured instructions, generation formatting requirements, and reasoning abilities. However, we are keeping an eye out for other LLMs.
Interestingly, even within the OpenAI family of models, behavior is vastly different. We find 3.5-turbo to be more capable and more volatile – doing great or unhelpful things. On the other hand, DaVinci is a more reliable model but tends to deteriorate as user instructions become overly demanding.
Our testing has also exposed several limitations that we are working on:
It is possible to see an LLM fail to answer something that ‘it’s answered nine times before. Or label something as inappropriate that it was happily answering 5 minutes ago. There are ideas around thought-parallelization, voting mechanisms, etc. that we intend to build to improve this.
The LLMs still can hallucinate an answer, but by forcing the LLMs to follow a very clearly defined chain and toolkit, we are seeing that it is possible to have the LLM fail rather than hallucinate an answer – which is the safer scenario even if a poorer UX. There are ongoing tests on prompting our chains, tools, and prompts to improve this further.
LLM output parsing errors
The LLM is often doing the right thing but responding in a format that leads to fatal errors. A Google search of LangChain OutputParserExceptions will underline this issue. We have completely rewritten our output parser to deal with this issue and have seen big improvements over what comes built-in with LangChain. We can now pinpoint the root cause of the parsing error and give the LLM detailed instructions on recovering from it very specifically.
Question appropriateness and relevance
The AI may consider some questions inappropriate or irrelevant, even when they are not. This especially interacts with the AI’s memory of the conversation so far in unforeseen ways. We find that if a question gets labeled as inappropriate once, even changing it tends to fail because the bot overweighs its memory of the decision.
The big ideas we are currently working on are the following:
- A MongoDB agent to work with NoSQL databases with which we will cover all paradigms within which businesses organize structured data – relational, key-map, and graphs.
- Adding semantic caching to improve UX and reduce latency and costs.
- Thought-parallelization, tree-of-thoughts, backtracking, and other non-linear reasoning mechanisms to improve reliability and resilience.
- We are waiting for access to GPT4 to test performance for comparison. And plans to test the large context windows should help with long conversations.