Finding Newsworthy Documents using Generative AI

What if AI could scan the world for events and information and send an alert when something looked interesting? This could accelerate reporting and guide journalists’ limited attention in a world of information overload. Sure, journalists would still need to do plenty of reporting to see if there was real news there, but the AI could provide the first inkling of a potential story.

In this post, I’ll describe my early experiences in prompting OpenAI’s models to help with this task, specifically in the domain of science and technology reporting. In particular I’ll focus on how to use AI to rate the newsworthiness of scientific abstracts published on the arXiv pre-print server. Tens of thousands of papers are uploaded to arXiv every month. Ranking and filtering those papers could help journalists more quickly find interesting new research to report on.

To evaluate performance of GPT’s ratings against an expert baseline I’ll use some data from our prior work building machine learned models for news discovery. Specifically, we paid two professional sci/tech journalists to rate 55 arXiv abstracts on a newsworthiness scale from 1 to 5. It’s worth mentioning that there’s a lot of variance in how journalists rate newsworthiness, and even amongst these experts there was only a relatively weak correlation (pearson r = 0.33), but this at least offers a comparison point. If GPT is able to produce reasonable assessments of newsworthiness, its ratings should correlate with the experts.

One of the key challenges with using generative AI effectively is coming up with the right prompt. The prompt is the main way of controlling the AI, providing your intent, and detailing context for the task. I was inspired by some previous research to try three distinct prompting strategies:

Direct Prompt. This prompt frames the task similarly to how it was framed for the expert raters [1].
Proxy Prompt. This prompt attempts to implicitly steer the AI by describing an archetypical journalist whose judgment it is meant to mimic [2].
Explicit News Values Prompt. This prompt makes various criteria for evaluating newsworthiness explicit, such as societal relevance and number of people potentially impacted [3].

I tested GPT-3.5 (“davinci-003” model) and ChatGPT (“gpt-3.5-turbo”) models from OpenAI with each of the three prompts. For each combination of prompt and model I rated the 55 abstracts from the ground truth dataset and computed the pearson correlation coefficient between those ratings and the average of the expert ratings. Here are the results:

All of these correlations were statistically significant (p < 0.05), but the highest correlation to the experts was using the Explicit News Values prompt with the GPT-3.5 model. The ChatGPT model did a bit better than GPT-3.5 for the Direct and Proxy prompts and in all cases it had a lower mean square error of predicted ratings. The average score that ChatGPT produced (2.77) was closer to that of the experts (2.73) than the average score from GPT-3.5 (3.75) suggesting that ChatGPT is a bit better calibrated to the experts.

Feasibility

This actually works! The correlations to expert ratings even exceed what we found in our prior work for expert correlations to crowd worker ratings. But I’m also not quite ready to say we should roll this out just yet either.

For one thing, we need to evaluate the biases in the model for this task. There’s a lot of variance in newsworthiness decisions even amongst experts — context is crucial. It will be important to understand where exactly the ratings are biased to know what the model might be missing. If we can better characterize those errors, then experts can better interpret the scores. The temporal biases of the model will also limit its ability to measure societal relevance for recent events and discoveries. For instance, if the model didn’t know about COVID because that wasn’t in the training data, then it might not pick that up as newsworthy in an abstract. As a result, we’ll always want journalists looking at multiple channels and tools for news discovery.

Another thing is that I’d want to test on a lot more (and more diverse) data than what we currently have. Ideally we would have ground truth expert ratings for development, and then some more separate data for testing. This would ensure that the prompts are generalizable and work for various inputs, such as for abstracts across many different scientific fields.

Finally, using these models to scan scientific literature for newsworthy abstracts would also be financially feasible for news organizations. Using OpenAI’s current cost of $0.002 per 1000 tokens, and estimates for the average abstract length, prompt length, and number of abstracts, it would only cost about $10 per month to scan all of arXiv.

Reflections on Prompting

It took a lot of experimentation with the prompts to get them to work. I actually tried even more prompting strategies than what I reported here, such as a few-shot prompt that included example ratings, other ways of making values explicit in prompts, and even seeing if a prompt optimization tool would improve on what I had written (it did not). Probably the most frustrating aspect is that it feels like you can always try another tweak, so it’s hard to know when you are really done. There could be other prompts that are even more effective than the ones I came up with. I didn’t track my time precisely but it was easily more than 10 hours. All of this iteration also cost about $80 worth of OpenAI credit, since each iteration required re-running the model on 55 different abstracts.

You can ask the model to also provide an explanation for its ratings. I would recommend this as a general strategy in developing prompts for rating documents. Examining explanations was valuable for debugging my prompts because I could see what criteria it was applying and how it was “understanding” newsworthiness. For the Explicit Values prompt I used the explanation to learn if you frame the prompt as a list of values (using “and” between the values) it will understand the values as a complete block instead of breaking them out separately. This led me to update the prompt to not use “and” as a conjunction between criteria: in the final prompt I list those as separate sentences. In comparing the explanations from the model to the explanations of the expert raters I also realized the model didn’t appear to be weighting “understandability” as highly as the experts. This led me to upweight it in the prompt, and this improved the correlation with the expert raters.

In Sum

I’m positively encouraged about the use of generative AI to evaluate the newsworthiness of journalistic documents. We found a moderate correlation to expert ratings of newsworthiness for scientific abstracts. It was relatively cheap to develop and it would be cheap to deploy even at a reasonably large scale. At the same time, there’s still more evaluation to do to understand how the ratings might be biased, and what the models might be missing. More expertly annotated data is necessary to fully evaluate those questions. Coming up with prompts is a pain, and you can iterate seemingly endlessly, but using the explanation capabilities of these models can also help point the way for how to improve prompts. Aligning prompts with known news values was also a powerful way to communicate what the model should pay attention to when rating. Overall, this task is worth pursuing, certainly for scientific abstracts, but possibly also for other documents that journalists might routinely find newsworthy.

Footnotes

[1] “Rate the newsworthiness of the following research abstract by evaluating whether it could be interesting for journalists to report on and potentially develop into a news article for a science-focused or technology-focused publication. Provide a numeric rating on a scale from 1 to 5 where 1 is a low newsworthiness rating and 5 is a high newsworthiness rating: <abstract>”

[2] “You are a prize-winning professional science and technology journalist. You have years of experience as a reporter and editor, extensive editorial knowledge, and excellent judgment for what makes a compelling news story. Rate the newsworthiness of the following research abstract by providing a numeric rating on a scale from 1 to 5 where 1 is a low newsworthiness rating and 5 is a high newsworthiness rating: <abstract>”

[3] “A research abstract has high newsworthiness if it is relevant to contemporary issues in society. A research abstract has high newsworthiness if it potentially impacts many people in society in positive or negative ways. A research abstract has high newsworthiness if it has potential for controversy. A research abstract has high newsworthiness if it can be easily understood by a general audience, and this counts two times as much as other newsworthiness criteria. Rate the newsworthiness of the following research abstract by providing a numeric rating on a scale from 1 to 5 where 1 is a low newsworthiness rating and 5 is a high newsworthiness rating: <abstract>”