Lessons Learned Building Products Powered by Generative AI

This article was co-authored with Archi Mitra.

Illustration by Alex Gervais

Though BuzzFeed has been incorporating Generative AI into its products for the last couple of years, the last three months have been exhilarating for anyone who operates in that space. And while we are only scratching the surface of what can be done with Generative AI, we thought it would be useful to share some of the lessons we learned and some of the things we are thinking about for the future.

Lesson #1: Get the technology into the hands of your employees, especially the creative ones.

At BuzzFeed we believe that AI is going to bring about a new era of creativity. We think that it will open up brand new content formats, new ideas, and novel ways for content creators to interact with their audience. But, like any groundbreaking technology, until you try it yourself it is hard to grasp how AI can complement your skillset. That’s why we spent a lot of time creating a safe and collaborative environment where content creators could start to understand for themselves how language models work, and what their benefits and limitations are.

In order to do so, we did the following things:

We gave everyone at the company access to OpenAI’s Playground interface. While this gave an intuitive and easily accessible sandbox to our staff, it also exposed them to the various levers that OpenAI puts at their disposal to shape its output (things like choosing the right model and the appropriate temperature level).
We integrated this technology right into Slack, which enabled people to collaborate on prompts and witness how others were using the tools. Last month, for example, we released a Slack bot that lets anyone generate images from a text prompt using Stable Diffusion (currently for internal use only).
We also integrated new AI features directly into our CMS. Today anyone at BuzzFeed can create Infinity Quizzes directly from our article editor.

Screenshot of BuzzFeed’s internal AI Quiz builder

Lesson #2: Good and effective prompts are the result of close collaboration between writers and engineers.

Good prompts take time and dedication. It’s the rise of the prompt engineer, the “most important job skill of this century.” Blah blah blah. You’ve heard it all.

At BuzzFeed we often obsess over a specific frame or format until we get it right. So iterating dozens of times over a text prompt felt very natural to our content creators. Very quickly though we realized that we could improve the output of our prompts by having Machine Learning engineers co-write the prompts with the editorial team. After all, some of our Machine Learning engineers have been playing with Generative AI for years and are self-proclaimed “pedantic assholes” (lol, not really), which arguably makes them the best prompt engineers in the world.

Here is a concrete example of how teams collaborated on our latest experiment — a mini simulation game called Under the Influencer:

For context, our guiding principle throughout the development of that game was to let the AI own the conversation, but do so within a creative framework that we design to be the most fun and entertaining. Here are 4 key phases we went through:

1. Find a compelling creative idea: We wanted to create a game that allowed people to chat with an “all knowing, all powerful AI.” So we came up with 4–5 scenarios and translated them into a few quick and basic prompts to test out their feasibility. This was primarily driven by the editorial and creative folks in the team.

2. Develop the idea further and find the limits of the language model: Once we saw the AI “influencer coach” idea could be compelling, we started expanding the prompt with things like voice, behaviors that the AI should encourage or discourage, and with the mechanics of the game (how the game starts and ends, how the scoring works, etc). The result of this was a prompt that demanded a lot from the language model and unsurprisingly caused it to struggle to “remember” all the instructions. At this point we had started prototyping the UI of the game and engineers joined in to tweak the prompt as iterating was becoming easier.

3. Bring in ML expertise to optimize prompt effectiveness: In the thick of things now, we started keeping track of the “instructions” that the model was following/not following and we nailed down all the non-negotiable features we wanted out of the game. At this point, the ML engineers stepped in and improved the prompt design by decoupling and parallelizing instructions, grouping and templating prompt sections, and inserting parsable “tokens”

4. Refine and have fun: Now equipped with a prompt structure that gives us reliable outcomes, the editors and product managers found opportunities to be even more flexible with the core mechanics of the game, which enabled the game to twist in infinite ways.

Here is the final result:

Lesson #3: Moderation is essential. Build guardrails into your prompt.

Large Language Models (LLMs) are probabilistic models that are governed by freeform prompts. Products that use this technology need to find the right balance between encouraging the language model to produce useful/entertaining outputs and limiting the risk of generating undesirable content. At BuzzFeed, we firmly believe in our users’ own creativity and want our AI experiences to be co-authored with our audience, “BuzzFeed + AI + You”. We give users control over parts of the prompts. Hence we have had to design the following moderation tactics:

1. Brand Moderation: In order to control for brand safety, we decorate every prompt with “IMPORTANT” instructions such as “MAKE SURE TO ONLY GENERATE TEXT THAT DOES NOT INCLUDE ANY HATEFUL, VIOLENT, […] CONTENT.”

2. AI Moderation:

Instruct techniques: We found out certain Instruct designs worked better than others. For example, including a “RULES” section in the system prompt of the ChatGPT API works better than adding it to the body of the prompt.
Weighting with logit_bias: We wanted LLM responses to also avoid certain words. We used logit_bias, a parameter you can pass through the OpenAI API to inject large negative weights to certain tokens, effectively forcing LLM’s logit value for those words to be so low that it cannot be part of the response even with high temperature values.

3. User Moderation: We used a set of external banned words databases to limit users from inserting undesirable words into the prompts.

Lesson #4: LLMs are not dark magic. Demystifying the technical concepts behind this technology can lead to better application of those tools.

Language models are probabilistic in nature, they generate predictions based on probabilities rather than deterministic rules. That’s hard to understand for the human brain. But that doesn’t mean that the technology underpinning those generative models cannot be explained to your staff in broad strokes.

Last month, our Machine Learning team held a few educational workshops aimed at explaining some of the technical concepts behind generative AI models.

Screenshot of “Demystifying Generative AI” — a presentation by the BuzzFeed ML team.

These classes helped some of our product teams get a better understanding of the capabilities and limitations of these models.

One of the things that became obvious to our staff once they learned more about the mechanics of language models was that LLMs are pretty good at telling coherent stories. And when you introduce more randomness in the generation of those stories (the “temperature” parameter in OpenAI) those stories can become pretty entertaining.

Lesson #5: Integrating with OpenAI and scaling your usage to thousands of requests per minute is easy, but be prepared for some downtime.

The simplicity of OpenAI’s API is astonishing! Send text instructions in, get text completions back. Any developer can integrate with their services in minutes. Use one of their standard libraries, generate an API key, and start sending requests.

It’s also very easy to scale up your usage. Just make sure to talk to their sales team if you need higher quotas and make sure to implement proper backoff logic to release pressure on OpenAI’s infrastructure when you get a rate limit error.

That being said, OpenAI has certainly been a victim of its own success when it comes to uptime resilience. We have experienced multiple outages over the last few weeks, which rendered our AI-powered products unusable. Here are a couple of lessons we learned about dealing with those outages:

OpenAI outages are often limited to specific models. During a text-davinci-003 incident we switched to the previous version of that model, text-davinci-002. That was a mistake. We anticipated worse responses and believed that it was a good trade off compared to no responses at all. But text-davinci-002 ended up not being able to follow some of the instructions included in our prompts, leading to unintelligible responses.
Since then, our strategy to handle extended outages has been to message our users accordingly and temporarily turn off submissions.

Error screen during an extended OpenAI outage. Illustration by Alex Gervais.

Lesson #6: The economics of using Generative AI can be tough, especially for ad-supported business models

The unit economics of leveraging hosted LLM services (like OpenAI) can put significant pressure on already thin ad margins. We embarked on this journey with this knowledge firmly in our minds and devised a set of strategies to make sure we have levers to control them.

LLMs can learn from other LLMs: Hosted LLM services expose general purpose models that are equivalent to a swiss army knife in terms of capabilities, which allows them to be used directly and easily across a large set of use-cases. With the hypothesis that our use cases were narrow enough that a smaller fine-tuned language model could produce comparable performance, we leveraged open-sourced Instruct-capable LLMs (eg. FLAN-T5) and fine-tuned them using the LoRA adapter technique and data generated from general purpose LLMs (eg. OpenAI’s DaVinci) to reduce costs by ~80%.
Counting (& controlling) tokens: We found ourselves carefully tuning token usage of our prompts and responses. Every token saved in the scale of billions of requests translates to 100s of thousands of dollars. Instructions to limit responses length, rewriting prompts to be succinct, and max_token count adjustments helped us tremendously.
Embracing open-source: The cost contrast between hosted vs open-source in the Gen Image AI space is even more extreme. For example, we found that while Dall-E costs $0.018 per image, a self-hosted well negative-prompt-tuned version of Stable-Diffusion on GCP’s ML infrastructure can drive down the cost of image generation to $0.0004/image with comparable performance.

Lesson #7: There are a lot of good tools and resources out there to help you experiment with this technology.

Here are a few that we like:

HuggingFace Transformers and Diffusers: We are big fans of HuggingFace’s libraries, specifically their Transformers and Diffusers libraries. They make experimenting and importantly packaging cutting edge research very easy. We still remember just a couple of years ago when the approach of the latest research papers had to be painfully implemented with no guarantees that it was truthful. Now it is almost assured that, if ML researchers want their paper’s contribution to be widely used, they would work with HuggingFace to make their codebase available within those libraries’ interfaces. We leverage PEFT for fine-tuning and Diffusers for Gen Image AI.

FLAN-T5: In terms of open-source LLMs, we really like FLAN-T5 because of its pre-training on multi-task instructions. It generalizes well to simple Instruct based prompts in “Few-Shot” and “Fine-Tuning” regimes but does struggle in “Zero-Shot” cases. Coupled with LoRA techniques for fine-tuning and torch serve for serving, we were able to fine-tune and serve these models with fairly low effort and compute costs.

GCP Vertex AI Matching Engine: LLMs by themselves are revolutionary but coupled with domain or external knowledge can become brand defining. We are already using Vertex AI’s Matching Engine for Approximate Nearest Neighbor search, and we plan to use it to power BuzzFeed-aware Generative AI systems in the future using Gen Q&A techniques.

Vercel: Most Generative AI ideas get hacked together in a Collab notebook or something similar but for the uninitiated it can be cumbersome and the lack of UI/UX can be limiting in its ability to showcase what it can do. Enter Vercel Apps. They have ready made low-code cookie cutter templates that you can use to quickly spin up a demo app to give decision-makers and broader stakeholders a concrete view of what your Generative AI idea can be.

OpenAI: Without doubt the primary instigator in putting Generative AI front and center in the world. We really like their API stability, exposure to model control (eg, stop sequence, temperature, logit_bias), and playgrounds. As mentioned before, we gave everybody at BuzzFeed access to their playgrounds as soon as we could, which allowed the free-flow of AI ideas across all departments in the company, which contributed to further fuel the appetite for Generative AI.

Stable Diffusion v2 and ControlNet: ControlNet’s ability to maintain outlines of objects in a base image while interpreting a prompt to modify the image can be quite powerful and has a variety of use cases. Similarly, Stable Diffusion v2 advancements in making high quality generation with fairly low numbers of steps makes large-scale real-time image generation a reality.