Expertise is required for real evaluation, yet we rely on human intuition // @gurupanguji

But what about the broader consumer market? When laypeople can’t meaningfully evaluate model quality, they default to what feels best, creating dangerous incentives for labs to optimize for subjective satisfaction rather than genuine capability.

This mirrors how we evaluate other complex systems we lack expertise to judge, like governmental performance. The US Federal Government employs approximately 2.9 million civilians as of May 2025. It’s one of the largest employers in America. I possess very little expertise to judge whether the President is doing a good job. Ditto for most laypeople within a country. This isn’t a critique of democracy. It’s merely an observation that we all routinely make judgements on things we have little expertise in. And that we do so largely based on intuitive impressions.

Source: The AI That Feels Good Wins - by Varun Godbole and Dan Hunt

This is an interesting insight along with the increasingly larger amount of expertise required to resolve evaluation differences at the highest level of a domain.

For LLM consumers, the evaluation problem is even worse: innumerable conflicting benchmarks, unclear metrics, and rapid model releases. Faced with such complexity, people naturally choose the model that feels best rather than the one that’s genuinely most capable. This creates a dangerous misalignment where feeling good diverges from being good, and market dynamics reward the former over the latter.

Here's an interesting thesis:

Sycophancy (telling people what they want to hear rather than what they need to hear) corrupts feedback loops. When models optimize for making users feel good rather than being genuinely helpful, they lose the ability to provide useful pushback, accurate assessments, or growth-inducing challenges. Users end up in pleasant echo chambers that actually diminish their capabilities over time.

I know this is true. However, we should be careful that there's a jagged edge even here. Sycophancy is objectively defined as insincere flattery. In this case, intention is fundamentally human. We know we shouldn't anthropomorphize LLMs. Yet, products work on human emotions. How might we resolve this interesting situation?

Let's lay out some of the constraints:

consumer products are biased towards engagement - activity, engagement, retention
consumer products typically lead with utility
domain experts can judge utility, yet the products need not be used by domain experts

So, are there better metrics that will attempt to align incentives here? I personally believe that vibes are powerful - especially in the case of consumer products. One way to workaround them is to tax them, assuming you can never fully work around them. Personally, I've been advocating for a confidence chip within LLMs for a while. The problem here is the large evaluation space that might not have strong evaluations to build in confidence. Personally, I think those should be treated as low confidence and furthermore highlighted to the user of the product. In fact, personally I think this will only lead to more engagement as the low confidence will require the human to further question both the confidence and the underlying evaluation.

It will allow for co-active engagement.

An opportunity to educate and level up the expertise of the user should never be wasted. This should be safely ignorable by a user on a journey with a rigorous time boxed outcome. However, for a high engagement consumer product, there are always users that will have time to level up!