Humanity’s Last Exam

Benchmarks are interesting.

Here’s the deep thought – at what point in the overall benchmark process will AI inject bias into the benchmark test?  And to what end?  Maybe not so deep a thought.

Humanity’s Last Exam has been bantered about extensively.   Here’s a great place to catch up on it: Humanity’s Last Exam

My musings: check out the crazy difficulty of the questions:

So 2500 questions of this caliber of difficulty.  The top AI models hit 20% accuracy in answering. 

 

 

 

I would also note the Calibration Error, which is affirms that “ Given low performance on Humanity’s Last Exam, models should be calibrated, recognizing their uncertainty rather than confidently provide incorrect answers, indicative of confabulation/hallucination. To measure calibration, we prompt models to provide both an answer and their confidence from 0% to 100%. ”  The better performing models – OpenAI o3 and o4-mini and Gemini 2.5 Pro – also have better Calbration Error numbers.

Rollups best left to fruit

The Neuron newsletter served up this TechCrunch article about   Read the article https://techcrunch.com/2025/06/01/early-ai-investor-elad-gil-finds-his-next-big-bet-ai-powered-rollups/. Here's a quote from the article: "The idea is to identify opportunities to buy...

read more

Bill Gates, not a jerk billionaire

Bill Gates has announced he's giving away 200B over the next 20 years to help address 3 moonshot global needs.  Here's the announcement from Gates Foundation There are a lot of people that idolize billionaires and I am not one of them.  A have an internal screed about...

read more

Gemini = Lazy Google?

Now my mind is just equating Gemini to Lazy Google.  For example, I wanted to modify the footer of tquist.com website.  But what footer and where do I modify this?  In the pre-Gemini (or LLM) days I would have typed my question into Google search and looked for...

read more

AI Musings 5-5-2025

 I am struggling to determine the best way to compare the big three LLMs. Side by side comparison of the same prompt is logical but yeesh, not sure I’ll have the time for such detailed analysis. Another thought I’ve had is to use a different one for time periods and...

read more

AI Musings 5-2-2025

I have been trying to use the top 3 (in my mind at least) LLMs regularly. I feel most likely to grab Copilot as Microsoft has so nicely put a cool looking rainbow colored icon on the taskbar of Windows.  I scratch my head a little as I don't recall allowing Microsoft...

read more

AI limitations – let’s change a Google Voice number

TQuist had a Google Voice number for a long time.  I never used it well, and then I neglected it, and then it was releasted back into the wild.  I confirmed earlier my previous number has been assigned to some else, which makes sense due to my inactivity.  So I want a...

read more