I Built an AI That Profiles Podcast Conversations. Here's What It Learned About What Audiences Actually Want.

Using LLMs to score hosts and guests across 16 behavioral dimensions — like energy, certainty, vulnerability, and humor — to predict podcast chemistry and viewership. The chemistry angle flopped, but the findings about what actually drives views were genuinely interesting.

Prompt

I was thinking of creating a short write up pointing to some interesting findings and linking to a GitHub project with the transcription data as a way to cap off this mini project. Can you write the brief for me in a very casual (first person), easy to read style? It can get technical when it needs to but the idea is someone can read the findings and not get bogged down by the analysis.

Claude Opus 4.6

Response

I spent about a day building a system that uses LLMs to profile podcast conversations — scoring hosts and guests across 16 behavioral dimensions like energy, certainty, vulnerability, pacing, and humor. The original idea was to predict podcast “chemistry” and help shows book better guests. The chemistry angle didn’t pan out the way I expected, but the research surfaced some genuinely interesting findings about what drives podcast viewership.

Here’s what I found across 8 shows, ~160 profiled episodes, and two distinct genres (comedy and long-form interview).

The full dataset, profiling prompts, and analysis code are on GitHub: github.com/jallen/podcaster-personalities

The Setup

I had an LLM listen to podcast transcripts and score both the host and guest on 16 conversational dimensions — things like how much someone dominates the conversation, how often they crack jokes, whether they speak with conviction or hedge everything, how fast they move between topics, and so on. Each dimension gets a 1-10 score with a written justification.

The shows I analyzed:

  • Comedy: Tiger Belly, Bad Friends, Are You Garbage, Blocks w/ Neal Brennan, Soder
  • Interview: The Jordan Harbinger Show, The Knowledge Project, Tim Ferriss, Lex Fridman

For each episode, I also pulled YouTube view counts so I could test whether any of these conversational dimensions actually predict what people watch.

This was the big one. I had the LLM score each episode’s overall “chemistry” — basically asking it to rate how good the conversation was. Then I compared those scores to YouTube views.

The correlation across all 154 episodes: 0.034. Essentially zero.

What the LLM thinks makes a great podcast conversation — mutual vulnerability, generous listening, balanced back-and-forth, genuine curiosity — has almost nothing to do with what audiences actually click on. The LLM was scoring episodes like a therapist. The audience was scoring them like a viewer.

This held across every genre. Comedy, interview, big shows, small shows. The “quality” of the conversation just doesn’t predict its popularity.

Finding #2: Speak With Conviction

If there’s one universal law I found, it’s this: guests who speak with certainty drive more views.

I measured “certainty level” as how much a guest commits to their positions versus hedging, qualifying, and “well, it depends”-ing. Across all 8 shows — comedy and interview — higher guest certainty correlated with more views. It was the single strongest predictor in the entire dataset.

The effect was even stronger on interview shows than comedy shows. Jordan Harbinger’s correlation between guest certainty and views was 0.72. The Knowledge Project was 0.68. On these shows, intellectual authority is the entertainment.

This probably isn’t surprising if you think about it for five seconds. But it was interesting to see it show up so consistently in the data, and to see it beat out dimensions that feel more intuitively important — like humor, or how well the host and guest vibed.

Finding #3: Energy Helps, But It’s Not the Whole Story

Guest baseline energy (how animated and intense someone is in conversation) was positive on 7 out of 8 shows. But it was only statistically significant on a couple of them. Early in the project I thought energy was going to be the universal predictor. Certainty took its crown.

The combination of high certainty + high energy is basically the “great podcast guest” profile — someone who has strong takes and delivers them with intensity. Again, not exactly a revelation. But it was useful to see it quantified and to watch how the balance shifts by show.

Finding #4: Every Show Has Its Own Recipe

This is where things got more interesting than “be confident and energetic.”

The dimensions that predict views on one show often don’t predict views on another — and sometimes they flip direction entirely. A few examples:

The Knowledge Project has a unique pattern where guests who are less conversationally generous drive more views. When the guest holds their ground instead of deferring to Shane Parrish’s Socratic interviewing style, the audience responds. The host-guest gap in generosity (Shane being accommodating, guest being assertive) was one of the strongest correlations in the entire dataset at 0.77. Shane’s best episodes are basically when the guest is an intellectual equal who doesn’t yield.

Tiger Belly was the best show for multi-feature prediction — guest profiles predicted held-out episode views at 0.78, which is strong. Something about Bobby Lee’s chaotic energy creates a context where the guest’s personality profile really matters.

Are You Garbage was the most predictable show for LLM chemistry (0.93 correlation between guest profiles and the LLM’s chemistry score). The hosts are so consistent that the guest almost entirely determines the conversation’s flavor. But that chemistry score still didn’t predict views.

The takeaway: there’s no universal “good podcast guest.” There’s only “good guest for this show.”

Finding #5: Bobby Lee is a Different Person on Different Shows

One fun validation test: Bobby Lee appears on both Bad Friends (with Andrew Santino) and Tiger Belly (with Khalyla Kuhn). The profiler scored him independently on both shows, with no knowledge that it was the same person.

His profile shifted meaningfully between the two. More combative and higher energy on Bad Friends, more vulnerable and generous on Tiger Belly. The profiler captured what any fan of both shows would recognize — Bobby adapts his conversational personality to his co-host. This was a good gut check that the profiling was measuring something real.

Finding #6: Vulnerability is a Coin Flip

I expected vulnerability to be the big genre differentiator — comedy audiences don’t want it, interview audiences do. That turned out to be wrong. Vulnerability was noisy across the board, helping on some shows and hurting on others with no consistent genre pattern.

Neal Brennan’s show (which is literally about comedians’ emotional blocks) had a negative correlation between guest vulnerability and views. Even on a show designed around emotional openness, the audience preferred confident guests over vulnerable ones. Dan Soder’s show went the other direction — more vulnerable guests did slightly better.

There’s no genre rule here. It’s entirely show-specific.

Where This Ended Up

The research started as a product idea: build a tool that predicts podcast chemistry and helps shows book better guests. The chemistry prediction part died early (Finding #1). The viewership prediction part works for some shows (Finding #4) but the actionable insights mostly confirm what experienced podcast producers already know intuitively.

The most interesting potential application — using behavioral profiles to discover unknown guests who match a show’s recipe — would require a large database of pre-profiled potential guests, which is a different project entirely.

So I’m capping this off as a research project rather than a product. The profiling methodology works, the cross-show analysis framework is reusable, and the findings are (I think) genuinely interesting even if they don’t point to an obvious business.

If you want to dig into the data, the full transcripts, profiling prompts, analysis scripts, and raw scores are in the GitHub repo.

Technical Notes

  • Profiling model: Claude Opus 4.6, prompted to score 16 dimensions on a 1-10 scale with written justification per dimension
  • Viewership data: YouTube view counts, normalized for guest fame (estimated via YouTube subscriber count tiers)
  • Statistical method: Spearman rank correlations for individual features, XGBoost with leave-one-out cross-validation for multi-feature prediction, RandomForest for feature importance
  • Pairwise features: Each episode generates ~60 features from the host-guest pairing — raw scores, deltas (host minus guest), products, sums — because the pairing matters more than the individual profiles
  • Sample sizes: 11-25 matched episodes per show. Small, but consistent patterns across 8 shows at these sizes suggest real signal.
  • Biggest limitation: YouTube title matching reduced usable episodes to 37-90% of profiled episodes depending on show. Episodes that didn’t match YouTube titles are systematically excluded, biasing toward episodes with prominent guests whose names appear in titles.