LLMs Might Be Ruining Data Analysis

Aug 22, 2025

∙ Paid

(Welcome to the Entertainment Strategy Guy, a newsletter on the entertainment industry and business strategy. I write a weekly Streaming Ratings Report and a bi-weekly strategy column, along with occasional deep dives into other topics, like today’s article. Please subscribe.)

In the first few years of this blog/website-turned-newsletter, I used to write lengthy pieces of analysis, then I would follow those articles up with “Director’s Cuts” articles, sharing all the stray thoughts I had that didn’t make the original article.

I tried to keep Wednesday’s article focused on just one question, “Are elite screenwriters using AI?” but, of course, I had other thoughts and takeaways that didn’t fit in that article, plus that article was over 3,000 words long anyway.

You see, this is an AI story, but not in the way you think. While testing that hypothesis, I uncovered another example of how AI/LLMs can go wrong. People need to learn about both AI’s strengths and opportunities, but also its limitations. In particular, here’s my primary worry:

AI has SIGNIFICANT flaws when used in data analysis, from collection, categorization, analysis and maybe even visualization.

Today, I want to explain how to mitigate those problems, along with looking at my larger worries.

Let’s dive right in!

LLMs Are Terrible At Data Analysis

Let me ask you, have you read any data analysis articles that mention “training an AI model” to run data analysis? I sure have. Without naming names, some of the biggest “data journalists” out there use this technique all the time. And boy, I’m skeptical now anytime I read about that.

If you’d like a really fresh example, well, let’s just head over to a Variety article just a couple of days ago:

Here’s the article’s intro:

“On July 2, Amazon launched “Heads of State,” an action-comedy film starring Idris Elba and John Cena as the respective leaders of the United Kingdom and the United States. But the viewership data indicated that this was no traditional two-hander. Representatives for Priyanka Chopra Jonas, who plays a senior MI6 agent and the film’s third lead, used AI tools including Grok and ChatGPT to measure and analyze viewer sentiment; they found that their client was the main driver of the success of the movie — the fourth-most-watched Amazon MGM Studios film of all time on the platform — on the talent side.
...“I don’t think she would normally be credited for [the film’s success] because she’s not the lead. She’s not a ‘head of state.’ But in this case, the data doesn’t lie,” says Anjula Acharia, who is Chopra Jonas’ manager.”

I have so many questions, to say the least.

I don’t want to say that I’ve never written “the data doesn’t lie” or offered some similar sentiment (because, often, data disproves other people’s arguments), but as anyone who works with data knows, data can be massaged, manipulated, tweaked, and so on. Data can’t lie, but it sure can mislead, and no one chart represents “truth”. (Often, when I do data dives, I try to use as many different data cuts as I can; check out my post on Netflix’s waning Top Ten chart dominancefor an example.) At the very least, that Variety article seems to conflate “viewership data” with social media interest; those two things are not the same.

But the bigger issue is the source of this data: LLMs.

As I pointed out in Wednesday’s article, I had my LLM develop an “Is this AI test?” based on the features that typically flag an AI-written article or, in this case, screenplay. Then I had it run that test on the 2024 Black List screenplays. Then I had it run the analysis again. And it gave me two different results.

Let me repeat this because I want it to sink in:

I had an LLM run an analysis on a data set, and it provided two different sets of results.

This is, in my opinion, the single biggest flaw in using AIs. Due to their hallucinations and structure, the same prompt can result in different outputs. This means that someone could either run an analysis multiple times until they get the result they desire, or that the data is flawed.

That’s the first issue. The second issue was that my LLM couldn’t really explain how it derived its result. Further digging just made me less and less confident in the result. And the LLM was only able to (sort of) explain how it calculated the scores on three of its five metrics. It’s a true black box. I could see this happening in the above social media example. “Hey Grok, build a model determining which star of the film drove the most interest using social media.” The user would have little idea how Grok built that model to interrogate that question. And since the Variety article didn’t provide any actual data, we can’t fact check it.

Third, LLMs want to “satisfy” their users. The worry is that Priyanka Chopra’s managers could ask ChatGPT or Grok if Chopra was popular on social media (again, a questionable metric that deserves its own asterisk), and the model could try to optimize their answers for them. I’m not saying that happened here, but I worry about it constantly.

Fourth, I see a lot of folks leveraging/creating “synthetic” data or using AI for classification or data analysis. In my experience, AIs are often bad at these tasks. Just this week, my editor/researcher compiled a list of romcoms by US box office (for my article on the genre in the Ankler), then he asked the LLM to identify movies that were bad examples of romcoms. The LLM screwed up both ways, labelling actual romcoms as not romcoms and missing obvious examples of non-romcoms that made the list. (Gnomeo and Juliet isn’t a romcom, but Music and Lyrics obviously is; my LLM didn’t know this.) It had an error rate of well over 30%.

Finally, LLMs/AI have a huge problem with hallucinations, or making stuff up, as happened to me. And some computer scientists have argued that more advanced training runs are only making the more advanced AI models more likely to hallucinate, not less.

To sum up, when using AI/LLMs to handle data, I see four issues:

AIs' analyses can have (sometimes wildly) different results on different runs.
LLMs creating bespoke analysis can’t output their models, meaning they are true black boxes.
LLMs can try to please/satisfy their users, leading to bias.
LLMs often make a lot of mistakes when classifying data or making synthetic data.
LLMs can hallucinate data, making up false data.

Often times, these days, I keep going back to the Mugatu GIF I shared the last time I wrote about LLMs: I feel like I’m taking crazy pills. When I use LLMs, the mistakes are legion. The consistency is awful. It can’t learn. Again, you can run an LLM’s analysis twice and get different results. Or it gets basic facts and analysis wrong.

But countless data scientists are using LLMs to get black box results they can’t verify or test. If you can’t verify the data, I really wouldn’t trust it. If you do still want to use it, you need to explain very, very clearly how you...

...accounted for the five potential flaws above.
...double-checked the accuracy of your data.

That last point is the most crucial. Trust me, for any and all data analysis on this website, my team or I have hand-checked every result we get.

If you see analysis that was derived by AIs/LLMs that can’t say the same thing, I wouldn’t trust it.

How AI/LLMs Can Be Useful

Now, to be clear, I did use LLMs to help research Wednesday’s article’s analysis, and it was wonderful. But what matters is how I used it.

The Entertainment Strategy Guy

LLMs Might Be Ruining Data Analysis

(Welcome to the Entertainment Strategy Guy, a newsletter on the entertainment industry and business strategy. I write a weekly Streaming Ratings Report and a bi-weekly strategy column, along with occasional deep dives into other topics, like today’s article. Please subscribe.)

LLMs Are Terrible At Data Analysis

How AI/LLMs Can Be Useful

This post is for paid subscribers