Humanity’s Last Exam

Benchmarks are interesting.

Here’s the deep thought – at what point in the overall benchmark process will AI inject bias into the benchmark test?  And to what end?  Maybe not so deep a thought.

Humanity’s Last Exam has been bantered about extensively.   Here’s a great place to catch up on it: Humanity’s Last Exam

My musings: check out the crazy difficulty of the questions:

So 2500 questions of this caliber of difficulty.  The top AI models hit 20% accuracy in answering. 

 

 

 

I would also note the Calibration Error, which is affirms that “ Given low performance on Humanity’s Last Exam, models should be calibrated, recognizing their uncertainty rather than confidently provide incorrect answers, indicative of confabulation/hallucination. To measure calibration, we prompt models to provide both an answer and their confidence from 0% to 100%. ”  The better performing models – OpenAI o3 and o4-mini and Gemini 2.5 Pro – also have better Calbration Error numbers.

Always clever Google

Tuckahoe! I wanted information on how RAG and Live Intenet Search work with LLMs.  I chose Gemini 2.5 Pro Reasoning, Math & Code for the task.  The final example included as a follow up to expand on real time search included my physical location, which is freely...

read more

Always clever Google

Tuckahoe! I wanted information on how RAG and Live Intenet Search work with LLMs.  I chose Gemini 2.5 Pro Reasoning, Math & Code for the task.  The final example included as a follow up to expand on real time search included my physical location, which is freely...

read more

Comparing LLM responses

LLM Responses compared I thought a nice exercise would be to take a relatively simple prompt and assess how the Closed and Open models currently available compare.  This is the prompt that I used: <PROMPT> I’m taking my daughter to an oral surgeon today to...

read more

Perplexed-ity?

Perplexed-ity? I came across this blog post from Cloudfare: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives I've read and heard a lot of positive things about Perplexity's Comet browser.  I want to like them and I want to cheer...

read more

AI Action plan, and stuff

AI Action plan Here ya go folks - this is the current administration's AI Action Plan:  https://www.ai.gov/action-plan Here are some words from the current administration about preventing "woke AI" in the federal government...

read more

Subscribed to One Useful Thing

One Useful Thing is the name of Ethan Mollick's substack newsletter.  Ethan is the Co-Director of the Wharton Generative AI Labs. Wharton Generative AI Labs has lots of good information including a prompt library: https://gail.wharton.upenn.edu/prompt-library/ Check...

read more

“Accumulation of Cognitive Debt”

There's an article published at MIT that studied "Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task https://arxiv.org/abs/2506.08872 This blog post was pitched as a rebuttal of sorts to the MIT study - definitely...

read more