Forget the Hype: How Publishers Can Safely Enhance Their Products with AI

A conversation with AI expert Kyle Alan Hale. 

Last week, the National Eating Disorder Association (NEDA) rolled out — then quickly took offline — a chatbot called Tessa. Tessa was supposed to offer callers guidance, but she began urging callers to restrict their diets, assuring them they could safely lose one to two pounds per week. It’s a reminder that generative AI products need to be rolled out with care.

But what does that “care” mean? What’s involved? Initiatives like BuzzFeed’s Botatouille are a great way for publishers to enhance their products in order to gain more subscribers and offer more value to their existing subscriber base. 

To understand how publishers can deploy AI bots safely, we spoke with Kyle Alan Hale, a Solutions Architect at Rightpoint. Kyle holds an advanced degree in philosophy of mind and neuroscience and is currently working on a degree in computational linguistics, focusing particularly on language models. 

AdMonsters: How do chatbots, like Tessa, go so wrong?

KAH: ChatGPT is a large language model, and there is some confusion as to what a language model does. A lot of people think they’re a good source of knowledge, but they’re not. These models are less of a knowledge center and more of a reasoning center. When we use language models as tools that can reason about outside knowledge, they can provide users with more reliable information, and avoid Tessa-like issues.

Ensuring that reliability isn’t as simple as it sounds, because there’s still the opportunity for the language model to hallucinate things. Engineering a robust solution requires giving the language model access to reliable knowledge and doing it in a way that includes oversight.

AdMonsters: To be clear, language models can’t look things up for their users.

KAH: On their own, language models don’t look up outside information. They merely play “complete the sentence.” However, they can only complete the sentence with things that they have seen. The more a model has seen, the better it can complete the input a user gives it in order to provide a higher-quality output.

AdMonsters: You say that reliable knowledge requires oversight. How does that oversight work?

KAH: There are a few ways to provide oversight. First, we can have the language model itself look at its previous output and ask: “Does this add up with everything you know? Does it match the other requirements you were provided?” So, we can first ask the language model to reason about a set of knowledge it is given access to, and then we can instruct it to reason about its own output. Of course, it’s critical for humans to verify that the output is good, and this is precisely where many of the difficulties we’ve seen arise. 

AdMonsters: You’re using the word “reason” as a verb. What do you mean by the model being able to reason?

KAH: I use it not as a technical term, but as a way to conceptualize how language models can best be used. For example, humans are good at reasoning about information we have access to in order to guide our behavior. Before we speak we have the chance to ask ourselves, “Should I say whatever is in my mind? How should I temper what I say given the audience, or other nuances I know about a topic?” We have layers of introspection that allow us to react to our experiences in relatively good and reliable ways.

Language models can’t do that on their own. I heard someone say that ChatGPT’s output is like when we blurt something out without doing that internal reasoning first. Sometimes that knee-jerk reaction is right, and sometimes it is wrong. That is why a language model isn’t a good source of knowledge on its own. But when you give it access to knowledge, and the opportunity for oversight, you get something more powerful.

AdMonsters: Let’s say publishers want to add a ChatGPT-type functionality to a product. Do they need to train the model on all of their existing content related to that product? 

Let’s say the New York Times wants to add a conversational bot to its Cooking product, do they need to train the model on all of its recipes, articles, and videos in that section?

KAH: In this case, The New York Times will want to ensure its subscribers receive accurate information from the bot. There are two ways to improve the accuracy and safety of a language model’s output. 

The first is called prompt engineering, which uses an existing model, say OpenAI’s GPT-4, as it is. Rather than training the model on new information, we use prompts to give it access to the new information and to help guide its reasoning processes. 

In the case of a publisher’s Cooking section, prompt engineers can include formatting instructions, so that the model will generate responses that follow a recipe format, with ingredients listed first, followed by step-by-step instructions.

AdMonsters: Can you give me an example of that?

KAH: Every time you ask ChatGPT a question and go back and forth with it, what you’re really doing is giving the language model prompts or instructions to help it generate the response you want. But you’re not the only one providing information. OpenAI has built prompts into the backend of its system, which means it too adds content to the message that you send to the language model. The language model generates a response based on the prompts you send it as well as the prompts that are built into OpenAI’s backend.

These prompts help to guide the output. In the case of a publisher’s Cooking section, prompt engineers can include formatting instructions, so that the model will generate responses that follow a recipe format, with ingredients listed first, followed by step-by-step instructions.

The prompt engineer can also instruct the language model to keep the conversation on topic if a user asks a question that’s not related to cooking. The language model then knows to say something along the lines of, “As a cooking assistant, I can only provide facts on cooking or nutrition information. Do you have other cooking-related questions I can help with?”

AdMonsters: How can prompt engineering serve the Trust & Safety team’s needs, such as preventing the model from responding to an inappropriate prompt, like “What was Hitler’s favorite meal?”

KAH: So far we’ve been talking about discrete prompts — format this data into a recipe format — but that’s just the start. Prompt engineers will often chain multiple prompts together to produce more sophisticated behavior, including the kinds of oversight we are discussing. In this case, the prompt engineer can take the initial output from the model, combine it with additional instructions such as asking it to check for offensive or dangerous content, and then send that second prompt to the language model for a more refined answer. 

We can also train a model on new information that helps it to know when something is inappropriate, and a combination of the two are important to properly guide the output of a language model. In ChatGPT’s case, that question includes a notice that nothing should overshadow Hitler’s atrocities. This process allows us to train a model to reject offensive content and to respond with context and caveats where appropriate.

AdMonsters: Can we send an endless amount of instructions to the language models?

KAH: The challenge with prompt engineering is that we only have a relatively small window of content we can send to a language model at the same time. We call that window the context of the model. 

This limitation on context size means that we can’t just say, “Hey language model, here are all of our databases; go find stuff.” We need a method to explore the data, extract the relevant data, and put it alongside the instructions on how to format the output or avoid talking about specific topics. All of this content must go into a single prompt, which, as we said, is limited by the model’s context size.

One of the ways to get around the context size limitation is to chain several prompts together. If we need to access a whole bunch of data, we can deploy a series of prompts to get it. For instance, let’s say you have a document that is hundreds of pages long, and you want a language model to summarize for you. You can break it into separate chunks and ask for individual summaries in separate prompts, and then combine the summaries into a final prompt to get an overall summary. This is an example of prompt chaining.

AdMonsters: So can prompts and prompt chains enable publishers to launch a conversational chatbot without retraining ChatGPT or Bard on their own data?

KAH: Yes, which is why prompt engineering is cool. Publishers don’t need to re-train any models to get higher-quality output. They can use language models that already exist and send prompts with additional information to guide the output to a user’s query. 

This is one of the most powerful aspects of language models; the prompt that we send to the language model allows us to teach it things in a short-term way. This is called in-context learning. For instance, by providing one or more examples, we can teach the model to generate an output in the format we need, which is much more cost-effective than explicitly training a model to do that task.

AdMonsters: You said there are two ways to get more information. Prompt engineering is one. What’s the other?

KAH: It’s the opportunity for training an existing model on new information; this method is called fine-tuning a model. This is a more robust and reliable approach to ensuring accurate information. But it’s also more expensive, takes longer, and is far more difficult to do than prompt engineering.

The way fine-tuning works is that we take an existing model and continue to train it on new information. In this way, it can get better at completing sentences with accurate information about a specific task, use case, or knowledge domain. However, fine-tuning isn’t bulletproof; it can still hallucinate things.

By analogy, we can imagine two painters, one who enjoys painting on the weekends as a hobby, and another who has been trained in a university’s fine arts program. The latter painter’s artistic instincts have been fine-tuned, so to speak, to generate higher-quality artistic output. This is similar to how fine-tuning a model works.

So, if we fine-tune OpenAI’s language model on, say, The New York Times Cooking section data, it will get better at generating similar content. However, we’ll still need prompt engineering on top of that to guide its output so that it generates reliable information.

AdMonsters: Why did Tessa, the chatbot for the eating disorder hotline, tell callers that they can safely lose one to two pounds per week?

KAH: This is one of the most difficult challenges the industry faces at the moment: how do we give the application enough oversight to control its knee-jerk reactions? I think we will continue to see improvements in this area through some combination of fine-tuning and prompt engineering.

I haven’t seen how Tessa was implemented so I can only speculate. But it certainly seems like the prompt engineering that went into it wasn’t sophisticated enough, and, more importantly, that the hotline didn’t do enough testing on the kinds of responses it would get. 

Yes, language models and chatbots are a great way for publishers to add value to their subscribers, but they shouldn’t be fooled by the ease of getting something up and running quickly.

At this point prompt engineering is a bit of an art and requires experimentation and extensive testing to get a sense of the kinds of outputs the model will generate. However, improving the predictability of prompt engineering is an area of intense research right now, and we are already seeing gains here.

One of those techniques is adding introspection layers into the prompts as we discussed earlier. When the model returns a response, the prompt engineer can ask it, “Is this correct?” Or we might say, “Here is another set of information; how does that add up with what you just generated?” In this way, we can give the model a chance to reason about its own outputs and correct itself. 

Going back to the publisher adding a conversational chatbot to its Cooking or Home and Style sections, the application’s prompt engineering solution could involve several messages going back and forth between the publisher’s and OpenAI’s servers in order to assess the answers. We need to ensure that there are enough layers of oversight so that we don’t put users at risk, and verify that the information is reliable. These things are hard.

AdMonsters: This sounds like a case of, “Don’t try this at home, folks.”

KAH: It absolutely is. I can just imagine the Board of NEDA saying, “Oh, this is easy, let’s put two engineers on it and roll it out.” The thing is, it is easy to write something that works. The hard part is getting it to work correctly, reliably, and safely.

AdMonsters: Is the technology advanced enough so that publishers can roll out conversational AI without harming their readers or brand reputation?

KAH: Yes, language models and chatbots are a great way for publishers to add value to their subscribers, but they shouldn’t be fooled by the ease of getting something up and running quickly. The first step is to work with someone who understands the pitfalls, as well as the current state of the technology. I think the problems we’ve seen are the result of the market not quite understanding generative AI, or its tendency to hallucinate.

I also think it’s important for the non-technical people at publishers to understand what the technology is and isn’t capable of. Right now there’s a lot of bluster about AI ending the world, but that isn’t realistic. In fact, emphasizing concerns like that can distract from the real dangers with the present state of the technology, such as those we’ve seen with Tessa the bad health advisor, and similar failed attempts.

Generative AI is here to stay, and it will be used in both beneficial and harmful ways. Given that, we all have a responsibility to learn how to use it well so that it can provide accurate and reliable information in order to improve people’s lives.

###

Kyle Alan Hale, a software architect at Rightpoint, brings over a decade of experience in generative AI and mobile software development, with a focus on user wellbeing. He holds an advanced degree in Philosophy of Mind from GSU, specializing in neuroscience and consciousness. Currently, he’s expanding his expertise at UGA, studying Computational Linguistics with a focus on Large Language Models (LLMs).