Using Large Language Models Effectively
What 10 Cutting-Edge Companies Have Learned From Building on Top of LLMs
Since the release of ChatGPT, Large Language Models (LLMs) have taken the world by storm. At Redpoint we’ve seen an explosion of interest in using these tools. It seems every company we know has at least a team playing around with potential use cases. As companies ponder whether to adopt these models and how, we wanted to share early lessons companies on the cutting-edge of implementing these models have already learned.
We talked with ten founders and executives that have deployed AI-features on top of LLMs. This included features like Hex Magic (automatically generate SQL queries and python code), Descript’s Overdub (text to speech for videos), Canva’s Magic Write (AI text generator for designs and docs), GitHub’s Copilot, Sourcegraph’s Cody (part of a set of AI technologies integrated into Sourcegraph), Neeva’s NeevaAI (search), Mem’s AI Assistant and support products from Elementl, Ada and Forethought.
Through our conversations a few key interesting themes stood out:
Getting Started
How much of an investment is this?
We were surprised by how nimble companies could be in adding these capabilities. Every company we spoke with described staffing these efforts with a small team that was rapidly experimenting. Our partner, Jason Warner, helped oversee the Copilot team as CTO at GitHub. Copilot is one of the most widely used implementations of LLMs to date with 1.2M Software Engineers using it over a 12 month period. And yet, the team that originally shipped it was six people.
Companies like Hex have found using 3rd party LLMs like OpenAI, Anthropic, Cohere and AI21 as a quick way to get started and test the feasibility of different products. Regardless of whether companies use 3rd party LLMs or open source models, many cited the importance of a rapid pace of experimentation. “You want to have very low friction getting from the idea stage to validating if something works. If you don’t, there are so many ideas that you end up not trying out,” said Forethought CTO Sami Ghoche.
What different options exist?
Companies can choose between:
Using a third party LLM company – the company sends input queries via API to a general or fine-tuned model and receives output back. This is probably the simplest option that requires the least internal ML expertise though can be more expensive (we’ll be exploring this further in a future post).
Taking an open source LLM and fine-tuning – the company downloads a model with the exact weights, fine-tunes it and deploys it themselves
Training their own model – companies train models from scratch on their own data
Companies are taking different approaches today but considerations we routinely heard for deciding included:
Cost
Latency
Quality bar
Comfort exporting data
Need to run locally
Internal capabilities
LLMs have well documented deficiencies today including limited context windows, a need for prompt engineering, factual inaccuracies and problematic outputs. But the companies we’ve talked to have found clever ways to get around these issues and deploy widely used products.
Eight Key Lessons For Building With LLMs
1. Engage users in co-creation
One theme that kept coming up was that the user experience was as important— if not more—important than the underlying model.
For example, Descript put a lot of thought into how they designed their product experience around AI. “It is forgiving if the AI makes a mistake. If our text to speech feature makes a mistake you can hit a shortcut and just retype,” says Kundan Kumar the Head of AI at Descript. They have similar escape hatches built into a lot of their AI products. In their sound studio you can adjust how much you want the AI to clean up the sound with a dial.
This approach resonated with what Canva’s Ahmad Iqbal shared with us:
“One thing we’ve learned from our user interviews is that people like feeling in control of what they are creating. They don’t want an AI to create the exact things for them. They’d much prefer an AI give suggestions and feedback rather than just do the job.” - Ahmad Iqbal, AI product lead at Canva
In his mind, it harkens back to an often cited older business case study from Betty Crocker: when the company originally tried to introduce an easy to make cake mix that just required water it wasn’t popular. But adding in a step that required cracking eggs, a bit more co-creation, allowed the product to sell.
This co-creation with easy escape hatches is exactly how Jason Warner and the GitHub team approached Copilot. The key user experience decision for them was how often to surface these suggestions given the latency impact of more frequent suggestions (they ultimately decided to do it on the function level).
One other user experience tip we picked up from companies like Neeva and Mem is leveraging what ChatGPT does: streaming answers one word at a time while the output generates rather than waiting for the entire output to be done before sharing. This helps users be less frustrated with latency issues.
2. Start with lower stakes tasks
Yochai Konig, the head of AI at automated customer support company Ada, has found starting with a lower stakes environment can be helpful. One clever way Ada has found to leverage GPT-3 is to have it power chatbots on sites where an unauthenticated user is visiting an FAQ page.
Being able to ask informational questions vs reading a webpage is a better experience and the data gathered allows models to be further trained on company-specific conversations. It also helps Ada’s customers get more comfortable with the models before rolling them out into higher stakes situations. Yochai shared more about Ada’s approach here and here.
3. Add additional trust and safety layers
Companies were quite focused on ensuring their features didn’t produce toxic output. Canva is at the forefront of trust and safety for AI. “We spent the same amount of time with trust and safety as we did with actually implementing the product,” says Ahmad at Canva. Canva introduced a multi-step process before even sending an input to an LLM to ensure the output was aligned with their brand guidelines and safety. This included a word blacklist and fine-tuning to make sure they wouldn’t be opining on any medical, legal or financial topics.
Logan Kilpatrick, OpenAI’s first Developer Advocate, added that companies have seen success constraining prompts so the user isn’t entering the prompt text directly. Instead the prompt is constructed from a specific set of options or on the backend. This constraining of prompts can both improve trust and safety and results quality. “It can take some creativity on the user side to figure out what you should really ask. So if a company can front-load the work so each user doesn’t have to do this, that will lead to the best product experience,” says Logan. “People can be turned off when their first prompt doesn’t immediately get the response they wanted.”
4. Leverage embeddings
Companies we spoke with had clever ways to add additional context. Mem co-founder Dennis Xu described using embeddings (vector representations of data) to provide further context for their features. This allows them to take advantage of the knowledge graph created in their products. It transforms a task like “send an email to John” - which OpenAI would normally not know how to handle. Instead, using embedded data from Mem’s customers, the model can search and get context on who John is to inform a more specific action. OpenAI offers access to embeddings which can be stored in vector databases.
Ada similarly stores chat responses to previously inputted conversations. If a current conversation is similar (as measured by semantic search) they respond with that specific response, helping ensure their models don’t hallucinate.
Logan at OpenAI echoed the usefulness of embeddings:
“Embeddings are the most underrated use case. They are going to provide the most unique experience using the APIs bringing in additional datasets from public and private sources to supplement the models.” - Logan Kilpatrick, Developer Advocate at OpenAI
5. Feed models more context
Context can also be increased by expanding prompts. Ahmad at Canva mentioned that no matter what the user inputs into Magic Write, “we feed the LLM more metadata around the type of document, code, project or presentation the user is working on to make the output better.”
Hex took this approach as well, feeding in the context of what the user is working on.
“There's also a ton going on behind the scenes on prompt generation… there's some parts of how Hex works that gives us a unique ability to construct the right context on the fly.” - Caitlin Colgrove, CTO at Hex.
6. Augment these models’ power with new tools
When Elementl CEO Pete Hunt wanted to build a GitHub support bot he wanted to combine GPT-3 with the knowledge encoded in Dagster’s documentation. Doing this required providing contextual data to augment the knowledge of an LLM. Pete leveraged a new company LangChain to do this (as detailed here). More generally a whole class of tools are emerging that companies are using to improve the way they prompt and chain together models with each other, internal data and the outside world (including Fixie, Dust, Humanloop, Promptable, Cognosis, EveryPrompt, GPT Index and others).
7. Use hybrid approaches
Many of the companies we talked with had to be flexible on solutions they used given different customer requirements and model quality. Beyang Liu (CTO and co-founder) and Rok Novosel (Software Engineer) described how Sourcegraph thought about leveraging third party LLMs for some customers but also having an open source model for others given some of their customers are self-hosted and didn’t want to send data out to 3rd party companies.
Search engine Neeva CEO and co-founder Sridhar Ramaswamy shared that NeevaAI returns different outputs based on the search relevance score. “If we’re confident we’ve found a perfect page for that query we’ll just summarize it and return an answer but if the relevance score is low we’ll return that we don’t have an answer rather than make something up,” (for more detail on Neeva AI check out Sridhar’s thread here).
8. Have larger models train smaller models
Some of the largest language models today can be expensive to run and have relatively high latency. These largest models didn’t work for Neeva AI as Sridhar wanted the product to start returning answers in under 1.5 seconds. But this didn’t mean he couldn’t leverage models like GPT-3. Sridhar wanted to train a smaller model to summarize web pages. To do this, he put a bunch of input data into OpenAI’s most powerful DaVinci model for summarization. He then used that data to train a smaller model for Neeva that met their latency requirements (check out his much more detailed thread on how Neeva further reduced latency).
What still needs to be solved?
The products that have already been shipped leveraging LLMs are incredible. But it’s clear we’re just at the beginning. Companies and researchers are still trying to figure out all the capabilities of these models.
And the wishlist for future generations continues to grow. We frequently heard functionality like the ability to change action states in a program or update an order in an internal database would unlock many future use cases.
Pricing for these features is still unclear. Companies face compute charges from running models (either from their own cloud costs or OpenAI) every time these models are run but users pay standard SaaS fees rather than consumption-based pricing in most of their products.
In future posts, we will dive into further lessons companies are learning and other areas like the tooling used by some of these companies, how to think about pricing for AI features, and the limits of LLMs.
But one thing’s for sure: the space is clearly exciting. Our portfolio company Hex probably summed it up best: “there are millions of ideas we have for how we could use these models. Now we just have to try them.”
Huge thanks to Caitlin Colgrove and Barry McCardel at Hex, Dennis Xu at Mem, Jason Warner, Beyang Liu and Rok Novosel at Sourcegraph, Yochai Konig at Ada, Sami Ghoche at Forethought, Kundan Kumar at Descript, Ahmad Iqbal at Canva, Pete Hunt at Elementl, Sridhar Ramaswamy at Neeva and Logan Kilpatrick at OpenAI for their amazing insights on applying AI and LLMs in the real world.
For those interested in working on these products many of these companies are hiring. We’ve linked to the careers pages above.
And for those interested in exploring further ways people are using LLMs a few resources we’ve found helpful are: https://github.com/openai/openai-cookbook
If you’re building on top of LLMs and have thoughts on lessons we should feature in future pieces please reach out to us at jacob@redpoint.com and patrick@redpoint.com.
Amazing