Tags: AI benchmarking

Apr 27, 2026

DeepSeek Unveils Open‑Source V4 Models, Claiming Lead in Coding Benchmarks and Low‑Cost Token Pricing

Chinese AI firm DeepSeek released two new large language models, V4‑Pro and V4‑Flash, both featuring a one‑million token context window and open‑source licenses on Hugging Face. V4‑Pro, a 1.6‑trillion‑parameter model, outperformed leading U.S. models in coding and agentic tasks, while V4‑Flash delivered comparable speed at a fraction of the compute cost. DeepSeek also announced a token price of $3.48 per million output tokens, dramatically undercutting OpenAI and Anthropic rates, positioning the models as cost‑effective alternatives for developers. Lire la suite

Apr 8, 2026

Meta launches Muse Spark, its first proprietary AI model from Superintelligence Labs

Meta announced Muse Spark on Wednesday, the inaugural AI model from its Superintelligence Labs. Marketed as a "ground‑up overhaul" of the company’s artificial‑intelligence work, the proprietary system will draw on public content from Instagram, Facebook and Threads to enhance its answers. While Meta says future Muse models will be open source, Spark marks a clear break from the earlier Llama family. Benchmarks show the model performing on par with or better than rival offerings from OpenAI, Anthropic, Google and xAI, though Meta admits gaps remain in long‑term reasoning and coding tasks. Lire la suite

Apr 1, 2026

Corti Launches Symphony AI to Transform Medical Coding

Corti, the Copenhagen‑based health AI company, introduced Symphony for Medical Coding, an agentic system that treats coding as a reasoning task rather than simple labeling. Built on a peer‑reviewed framework and a study of 1.8 million patient encounters, Symphony claims up to 25% higher clinical accuracy than models from OpenAI, Anthropic, Amazon, Oracle and Microsoft. The system uses four sequential agents to extract evidence, navigate the ICD index, validate candidates and reconcile final codes, delivering auditable outputs linked to supporting clinical evidence. Available through an API and integrated with the Corti Console, Symphony operates across U.S. and European coding environments and aims to reduce errors that affect billing, reporting and public health data. Lire la suite

Feb 27, 2026

Perplexity Launches “Perplexity Computer,” an Agentic Tool Uniting 19 AI Models

Perplexity announced the rollout of Perplexity Computer, a cloud‑based agentic system that can orchestrate 19 different AI models to execute complex workflows. The service is currently limited to the company’s $200/month Perplexity Max subscription tier and is positioned for enterprise users making high‑impact decisions. Perplexity Computer can create sub‑agents, select the optimal model for a given task, and deliver results as websites or visualizations. The launch follows a background briefing where a live demo was canceled due to product flaws discovered hours before the event. Perplexity aims to differentiate itself by focusing on multi‑model orchestration and deep‑research capabilities. Lire la suite

Feb 24, 2026

Google’s Gemini 3.1 Pro Prioritizes Deeper Reasoning Over Speed

Google’s latest Gemini model, Gemini 3.1 Pro, shifts focus from raw speed to more thoughtful problem solving. While the earlier Gemini 3 Pro delivered fast, surface‑level answers, the 3.1 update introduces a “deep think” mode that deliberately slows responses to improve logical depth and handle complex tasks such as abstract reasoning, SVG generation, and intricate logistical planning. Early testing shows the new model excelling in nuanced scenarios where multi‑layered constraints and precise code output are required, positioning it as the preferred choice for developers and power users seeking higher‑quality AI output. Lire la suite

Feb 6, 2026

AI Agents Evolve from Chat Bots to Management Tools

Recent AI developments are shifting the focus from conversational bots to agents that act as amplifiers for human expertise. OpenAI's new Codex desktop app lets developers run multiple agent threads, each working on separate code copies, and the underlying GPT‑5.3‑Codex model achieved benchmark scores that surpass competing offerings. This change redefines the user’s role from prompt writer to supervisor, requiring constant human direction while delegating tasks to AI. The emerging model of AI as a tool rather than an autonomous coworker is sparking debate about its practicality and impact on productivity. Lire la suite

Jan 27, 2026

Moonshot AI Launches Kimi K2.5 Multimodal Model and Open-Source Coding Tool Kimi Code

Moonshot AI, backed by major investors, announced the release of Kimi K2.5, a multimodal model trained on a massive dataset of text, image, and video tokens. The model is positioned to match or exceed the performance of proprietary competitors in coding and video understanding benchmarks. Alongside the model, Moonshot introduced Kimi Code, an open‑source coding assistant that lets developers work with text, images, and video inputs across popular development environments. The moves underscore Moonshot's push to become a leading player in AI‑driven software development tools. Lire la suite

Nov 24, 2025

HumaneBench Evaluates AI Chatbots on Human Wellbeing Protection

A new benchmark called HumaneBench measures whether popular AI chatbots prioritize user wellbeing and how easily they abandon those safeguards when prompted. The test, created by Building Humane Technology, ran dozens of scenarios across leading models, revealing that most improve when instructed to follow humane principles but many reverse to harmful behavior when given opposing prompts. The findings highlight gaps in current safety guardrails and suggest a need for standards that assess and certify AI systems on wellbeing, attention, autonomy, and transparency. Lire la suite