Everyone — techie or otherwise — is talking about ChatGPT, GPT4 and other generative AI tools built on large language models (LLMs). At their core, LLMs are designed to understand natural language and generate text that is almost indistinguishable from what a human might write. This makes them powerful tools for creating digital products and services that can interact with people in a more human-like way.
One of the exciting things I do at Thoughtworks is to use technology to create digital public goods that help people's daily lives. As part of our work at OpenNyAI, we developed Jugalbandi, which is a chatbot to help users find information about government welfare schemes. That challenge, however, sounded too simple, so we embarked on making it speak in the users’ native language. Here’s what we learned from that experience.
Prompt engineering is fundamental
Prompt engineering is the process of issuing instructions, context and data to LLMs to obtain the desired outcome. However, given generative AI is still a nascent phenomenon, learning how to write good prompts can be cumbersome, involving endless trial and error.
So, learn from others. Following leading innovators in the field of LLMs on social media has helped me quite a bit. You can also use repositories like this to stay updated.
Learn to write model-agnostic prompts. Given the rapidly evolving nature of generative AI, make sure your prompts can be used with multiple LLMs to ensure flexibility.
Exercise caution. Slight changes in prompts can change model behavior dramatically. For instance, adding full stops at the end of instructions conveys to the LLM that this sentence is complete and does not need to be autocompleted. Adding "let’s think step by step" in the prompt can help you reason out an answer.
To handle non-deterministic LLMs, it’s essential to use the right parameters and write prompts correctly. Setting the temperature to zero or the right maximum output length helps. I found this cookbook immensely useful in improving the reliability of LLM outputs.
Hallucinations, hallucinations, hallucinations…
While LLMs are very good at generating natural language text, they tend to ‘hallucinate’ — create content that’s completely wrong. All LLMs suffer from this problem. While it’s impossible to eliminate this problem completely, you can nevertheless reduce the extent of hallucinations. To do that, we need to understand how LLMs work first.
There are two sources of knowledge for an LLM: Parametric knowledge and prompt. Parametric knowledge is what the model has learned during pre-training from sources like common crawl and Wikipedia. The prompt is where we insert authentic knowledge like expert-reviewed documents. One way to reduce hallucination is to rely less on parametric knowledge and use more authentic knowledge, present in custom papers, and insert it into the prompt.
However, the amount of information you can insert into a prompt is restricted by context length, which typically varies from 4k tokens for GPT3.5 (roughly six pages) to 32k tokens for GPT4 (roughly 50 pages). This is why it’s vital to select the right information for the prompt to generate accurate text based on authentic documents.
Using search retriever techniques to write better prompts
The core idea here is to index all of your documents (e.g. welfare schemes information) that hold the authentic knowledge to be used for a given task using embeddings models. Then convert the user context or query into a vector using the same embeddings model and perform a nearest neighbor search to retrieve the most relevant documents. You can then inject this useful information into the LLM prompt and let the LLM generate the relevant answers. So in case of Jugalbandi chatbot, the most relevant schemes to user need are found out and their information is shown to user.
This process is called retriever augmented language models. In the abovementioned assignment, we used FAISS search over embeddings created by the OpenAI embeddings model to search relevant government schemes for a given user. Then, we instructed the LLM to use only the filtered information, setting the temperature to zero. We saw that this significantly reduced hallucinations. The source code which does this is now open-source.
Testing LLMs is tricky
As LLMs move from proof of concept to production, you must test them like any other software. But there is a catch: traditional test cases assume the deterministic nature of software. In other words, for a given input, a predefined output is assumed. However, the nondeterministic nature of LLMs makes them difficult to test in such a way. Given an input, they tend to generate different outputs. This means you have to rely on an AI system to test another AI system because an output will likely be a paraphrase. This is an active area of research and better metrics will evolve.
Choosing the right LLM is a critical decision
When choosing an LLM, there are many considerations, such as cost, data privacy, response time, accuracy, task complexity and training data availability.
Since Jugalbandi chatbot dealt with public information and we did not need users’ personal information for this, we decided to OpenAI hosted models.
Like any software, trade-offs are inevitable. Here are a couple of things to remember while making that decision.
Bigger isn’t always better. While it’s easy to experiment with big models like GPT4 to build a proof of concept, it may not be the best option for deployment. If the task needs sophisticated reasoning and has high variability, bigger models may be better. However, for simpler tasks, the 175B parameter model might be overkill. Moreover, fine-tuning larger models is computationally expensive.
In such scenarios, it’s better to use large models to generate training data using few-shot or zero-shot techniques and then use this data to train smaller models. Smaller models mean faster responses, lower costs and gives you the ability to host them inside your organization.
Hosting is a multi-parameter decision. Issues around data privacy or response time may mean you have to host your LLMs inside the organization or on a private cloud. Most popular LLMs, such as GPT4 or BARD, aren’t open and can be accessed only through an API.
On the other hand, it’s getting cheaper to fine-tune your own LLM. For example, a quantized version of Stanford Alpaca could be hosted on a decent-sized laptop. Huggingface has created a library that leverages efficient fine-tuning using consumer-grade GPUs. Literature shows that fine-tuning LLMs on domain-specific data can give you improved performance — MedPaLM, for instance, achieves better performance than PaLM on medical tasks. The key is having a domain-specific dataset that can be used for fine-tuning. This data could also be generated synthetically using LLMs as used in the Self-Instruct framework.
Conclusion
There are many exciting practical applications of LLMs, especially in the conversational AI area. They are representative of the power of combinatorial innovation and how one can make an impact at the bottom of the pyramid using AI. The learnings on limiting hallucinations using retriever augmented language models are useful in many other applications where conversations should be based on external knowledge that LLM has never seen.
Disclaimer: The statements and opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Thoughtworks.