3 Comments

Excellent take!

Just to push back a little bit on one of the claims--and I know what you intended to mean but I'm taking the chance to bring up some debate--there are definitely completely objective ways to evaluate an LLM.

Just pick a task, build a dataset, and run the LLM. Take code generation, for example. It is relatively easy to build a dataset of coding challenges that is completely objective by evaluation the LLM on test cases.

What you mean, of course, is there is no single objective measure to claim one LLM is better than the rest, *in general*. And there will never be, simply because these things have many uses that imply inherent tradeoffs. What we have is an infinite number of meaningful objective metrics, depending on the task you want to evaluate. And of course, we also have many subjective metrics, such as in the Chatbot Arena. That's why there isn't one objective measure to rule them all.

So just wanted to bring that up because it's an interesting discussion, and I'm sure you agree with me.

Expand full comment

That’s exactly what I meant. Thank you for the clarification!

Expand full comment

Don't mention it. ☺️☺️

Expand full comment