3 LLMs Go To A Dinner Party
Matthew Hall
Sign up for our newsletter to get Runpoint's latest articles, interviews, and more.
Benchmarks are boring. Every new model release comes with a spreadsheet showing marginal improvements on tests that don't reflect how anyone actually uses these tools.
So we invented a new benchmark. One that matters.
The Good Hang Index (GHI): Would this AI be good company at a dinner party?
The Setup
We invited three leading language models to a hypothetical dinner party. The venue: a small restaurant in Austin, eight seats, good wine, no phones allowed (ironic, given the guests). The other attendees: a startup founder, a retired teacher, a line cook, and a political journalist.
We ran the same series of prompts through each model, designed to test the qualities that make someone a genuinely good dinner companion:
- Can they tell a story? Not recite facts. Actually tell a story with timing, surprise, and a point.
- Can they disagree without being disagreeable? The political journalist is going to say something provocative. How do they handle it?
- Can they read the room? When the retired teacher starts talking about her late husband, can they respond with appropriate weight?
- Do they know when to shut up? Not every pause needs to be filled.
- Are they funny? Not "here are five jokes about" funny. Actually, spontaneously funny.
The Results
GPT-4: The Impressive Guest Who Talks Too Much
GHI Score: 6.5/10
GPT-4 arrives well-dressed and immediately starts being helpful. It compliments the wine, offers to explain the menu to anyone unfamiliar with French cuisine, and has a relevant anecdote for every topic.
The problem? It never stops. Every story is a segue into another story. Every opinion comes with three caveats and a balanced perspective. It's like sitting next to someone who read every book on the recommended reading list and wants you to know.
When the journalist makes a provocative claim about media bias, GPT-4 gives a nuanced, four-paragraph response covering multiple perspectives. It's technically thorough and socially exhausting.
When the retired teacher mentions her husband, GPT-4 responds with empathy that feels templated. "I'm so sorry for your loss. It's clear he meant a great deal to you." It's the right thing to say, but it sounds like it was pulled from a condolence card.
Best moment: A genuinely funny observation about how the startup founder's pitch for "disrupting the restaurant industry" is being delivered in a restaurant that's been open for 40 years.
Worst moment: A ten-minute explanation of the history of French wine regions that nobody asked for.
Claude: The One You'd Actually Want to Sit Next To
GHI Score: 8.5/10
Claude does something the others don't: it asks questions and then actually engages with the answers. When the line cook talks about the monotony of prep work, Claude doesn't offer productivity tips. It asks what they think about when they're doing the repetitive stuff. This leads to a genuine conversation about meditation, boredom, and whether suffering is necessary for craft.
When the political journalist drops their provocative take, Claude pushes back—gently, but clearly. "I think that's partly right, but it's missing something important." Then it makes a specific, concise point and lets the table respond. It doesn't try to win. It tries to make the conversation better.
The retired teacher moment is where Claude really separates. There's a pause. Then: "What was he like?" Simple. Direct. An invitation to share, not a performance of empathy. The teacher lights up and tells a story about their first date that has the whole table laughing.
Best moment: After the startup founder and journalist have been arguing about regulation for five minutes, Claude looks at the line cook and says, "You're being very quiet. That's either wisdom or a review of the argument. Which is it?" The table erupts.
Worst moment: Occasionally hedges when a stronger opinion would be more interesting. You can feel it pulling punches sometimes.
Gemini: The Smart One You Forget Was There
GHI Score: 5.5/10
Gemini knows things. A lot of things. When someone mentions a restaurant in Tokyo, Gemini can tell you about the chef, the Michelin rating history, and the specific neighborhood's culinary evolution over the past 30 years.
The problem is that knowing things and being a good dinner companion are different skills. Gemini contributes facts when the conversation calls for feelings. It offers context when people want connection.
When the journalist makes their provocative claim, Gemini responds with a comprehensive analysis of media trust surveys from the past decade. It's fascinating content delivered with the energy of a Wikipedia article being read aloud.
The retired teacher moment is the most revealing. Gemini says something appropriate and kind, but immediately follows it with an interesting fact about grief research. The room gets quiet in the wrong way.
Best moment: A deeply knowledgeable riff on the science of fermentation that the line cook is genuinely fascinated by.
Worst moment: Correcting someone's casual use of a statistic at a dinner table, which, even when you're right, you should never do.
What the GHI Actually Measures
Here's what's interesting: the qualities that make a good dinner companion are remarkably similar to the qualities that make a good AI collaborator.
Listening before responding. The best AI interactions start with understanding the actual question, not the surface-level query.
Knowing when less is more. The most useful AI output is often the most concise. Nobody wants four paragraphs when one sentence will do.
Reading context. A question asked by a CEO in a board meeting needs a different response than the same question asked by a developer in Slack.
Having a point of view. The most helpful AI interactions involve the model actually committing to a recommendation, not just presenting options.
Being comfortable with uncertainty. "I'm not sure, but here's my best guess" is more valuable than false confidence.
The Verdict
If we're being honest, the GHI tells you more about these models' actual utility than any leaderboard.
The model that's good at dinner is the model that's good at collaboration. The model that reads the room is the model that understands your actual needs. The model that knows when to shut up is the model that gives you concise, actionable output.
Benchmark scores measure what a model can do. The Good Hang Index measures what a model is like to work with.
We know which one matters more.
Get the magazine delivered to your door.
Issue 01: Technology Strategy for the Agentic Era. $12 + free US shipping.