Artificial Analysis: Independent LLM Evals as a Service - with George Cameron and Micah-Hill Smith - Latent Space: The AI Engineer Podcast Recap

Podcast: Latent Space: The AI Engineer Podcast

Published: 2026-01-08

Duration: 1 hr 18 min

Guests: George Cameron, Micah-Hill Smith

Summary

George Cameron and Micah-Hill Smith discuss their platform, Artificial Analysis, which offers independent AI benchmarking services. They explore the intricacies of model evaluation, the importance of maintaining independence, and future directions in AI model capabilities.

What Happened

George Cameron and Micah-Hill Smith recount how Artificial Analysis started as a side project to independently evaluate AI models, addressing a lack of third-party benchmarking. Initially built to solve their own needs, the platform quickly gained traction and became a valuable resource for developers and enterprises navigating AI model decisions.

The conversation highlights the business model of Artificial Analysis, which remains free for public use but offers private benchmarking services for companies. The platform distinguishes itself by not accepting payments for better results, maintaining strict independence in its evaluations.

George and Micah discuss how they assess AI models through various metrics, including speed, cost, and accuracy, while emphasizing the importance of running benchmarks consistently across all models to ensure fairness. They note that their evaluations have become crucial as the number of AI models has exploded.

The guests delve into the challenges of benchmarking, such as the variance in model performance and the difficulty of maintaining accurate results given different model configurations and settings. They stress the importance of transparency and independence in the evaluation process.

The episode also covers the evolution of their Intelligence Index, which now includes new metrics like hallucination rates and agentic capabilities. They aim to provide a comprehensive view of AI model capabilities beyond traditional QA benchmarks.

George and Micah highlight how the AI landscape has become more competitive, with multiple labs contributing open-source models and innovations. They discuss the future of AI benchmarking, including the inclusion of agentic tasks and the need for continuous updates to reflect emerging capabilities.

The episode concludes with insights into the broader implications of AI evaluation, such as the impact on enterprise decision-making and the importance of evolving benchmarks to keep pace with rapid advancements in AI technology.

Key Insights

Artificial Analysis provides free public AI model evaluations but charges for private benchmarking services, maintaining independence by not accepting payments for favorable results.
The platform evaluates AI models based on metrics such as speed, cost, and accuracy, ensuring fairness by running consistent benchmarks across all models.
The Intelligence Index now includes metrics like hallucination rates and agentic capabilities, offering a broader assessment of AI model capabilities beyond traditional benchmarks.
The competitive AI landscape sees multiple labs contributing open-source models, necessitating continuous updates to benchmarks to reflect emerging capabilities.