Humanity’s Last Exam

Benchmarks are interesting.

Here’s the deep thought – at what point in the overall benchmark process will AI inject bias into the benchmark test?  And to what end?  Maybe not so deep a thought.

Humanity’s Last Exam has been bantered about extensively.   Here’s a great place to catch up on it: Humanity’s Last Exam

My musings: check out the crazy difficulty of the questions:

So 2500 questions of this caliber of difficulty.  The top AI models hit 20% accuracy in answering. 

 

 

 

I would also note the Calibration Error, which is affirms that “ Given low performance on Humanity’s Last Exam, models should be calibrated, recognizing their uncertainty rather than confidently provide incorrect answers, indicative of confabulation/hallucination. To measure calibration, we prompt models to provide both an answer and their confidence from 0% to 100%. ”  The better performing models – OpenAI o3 and o4-mini and Gemini 2.5 Pro – also have better Calbration Error numbers.

AI Action plan, and stuff

AI Action plan Here ya go folks - this is the current administration's AI Action Plan:  https://www.ai.gov/action-plan Here are some words from the current administration about preventing "woke AI" in the federal government...

read more

CMS sites revisited

Recent work research includes CMS review. I haven't looked into what is out there in a long time. The quick search hits showed lots of familiar faces and a couple new ones. I found this post informative:...

read more

Congratulations Impact Makers

Impact Makers was awarded "Best for the World" by B Lab. Here's a Richmond Time's Dispatch mentioning of the award. I met Michael Pirron shortly after moving to Richmond in 2004, and have watched him methodically and conscientiously build Impact Makers into a...

read more

DropBox Security

Oh, the line between security and convenience is harsh. While reading TechRepublic I found this interesting article on DropBox security. I love the convenience of cloud technologies, and use DropBox like lots of people, including the article author (Michael Kassner)....

read more

GOOG perspective on links

Analytics is an area that I will be focusing quite a bit in the coming months. I read this article about linking and Penguin 2.0 changes on a website called "Search Engine Watch."  There are many pieces to the content puzzle that websites have to face, and...

read more

Google Apps v. Office 365

The TechRepublic newsletter is worth a scan every day.  I found this gem recently and it's worth a read if you use, or might use in the future, either Google Apps or Microsoft Office. Here it is: Google Apps v. Office 365: Head-to-head comparison of features...

read more

UltraEdit

I first used UltraEdit sometime in the late 90s.  I loved it back then. When I went to work for TQuist in 2003 I purchased another copy. I just now retired my last XP system, and so I've decided to get the latest copy.  I hope it's aged well. They've changed...

read more