← Volver a Noticias

Etiquetas: AI benchmarking

DeepSeek Presenta Modelos V4 de Código Abierto, Afirmación de Liderazgo en Benchmarks de Codificación y Precios de Tokens de Bajo Costo

DeepSeek Presenta Modelos V4 de Código Abierto, Afirmación de Liderazgo en Benchmarks de Codificación y Precios de Tokens de Bajo Costo
Chinese AI firm DeepSeek released two new large language models, V4‑Pro and V4‑Flash, both featuring a one‑million token context window and open‑source licenses on Hugging Face. V4‑Pro, a 1.6‑trillion‑parameter model, outperformed leading U.S. models in coding and agentic tasks, while V4‑Flash delivered comparable speed at a fraction of the compute cost. DeepSeek also announced a token price of $3.48 per million output tokens, dramatically undercutting OpenAI and Anthropic rates, positioning the models as cost‑effective alternatives for developers. Leer más

Meta lanza Muse Spark, su primer modelo de inteligencia artificial propiedad de Superintelligence Labs

Meta lanza Muse Spark, su primer modelo de inteligencia artificial propiedad de Superintelligence Labs
Meta announced Muse Spark on Wednesday, the inaugural AI model from its Superintelligence Labs. Marketed as a "ground‑up overhaul" of the company’s artificial‑intelligence work, the proprietary system will draw on public content from Instagram, Facebook and Threads to enhance its answers. While Meta says future Muse models will be open source, Spark marks a clear break from the earlier Llama family. Benchmarks show the model performing on par with or better than rival offerings from OpenAI, Anthropic, Google and xAI, though Meta admits gaps remain in long‑term reasoning and coding tasks. Leer más

Corti lanza Symphony AI para transformar la codificación médica

Corti lanza Symphony AI para transformar la codificación médica
Corti, the Copenhagen‑based health AI company, introduced Symphony for Medical Coding, an agentic system that treats coding as a reasoning task rather than simple labeling. Built on a peer‑reviewed framework and a study of 1.8 million patient encounters, Symphony claims up to 25% higher clinical accuracy than models from OpenAI, Anthropic, Amazon, Oracle and Microsoft. The system uses four sequential agents to extract evidence, navigate the ICD index, validate candidates and reconcile final codes, delivering auditable outputs linked to supporting clinical evidence. Available through an API and integrated with the Corti Console, Symphony operates across U.S. and European coding environments and aims to reduce errors that affect billing, reporting and public health data. Leer más

Perplexity Lanza "Perplexity Computer", una Herramienta Agente que Une 19 Modelos de IA

Perplexity Lanza "Perplexity Computer", una Herramienta Agente que Une 19 Modelos de IA
Perplexity announced the rollout of Perplexity Computer, a cloud‑based agentic system that can orchestrate 19 different AI models to execute complex workflows. The service is currently limited to the company’s $200/month Perplexity Max subscription tier and is positioned for enterprise users making high‑impact decisions. Perplexity Computer can create sub‑agents, select the optimal model for a given task, and deliver results as websites or visualizations. The launch follows a background briefing where a live demo was canceled due to product flaws discovered hours before the event. Perplexity aims to differentiate itself by focusing on multi‑model orchestration and deep‑research capabilities. Leer más

Google’s Gemini 3.1 Pro Prioriza la Reflexión Más Profunda Sobre la Velocidad

Google’s Gemini 3.1 Pro Prioriza la Reflexión Más Profunda Sobre la Velocidad
Google’s latest Gemini model, Gemini 3.1 Pro, shifts focus from raw speed to more thoughtful problem solving. While the earlier Gemini 3 Pro delivered fast, surface‑level answers, the 3.1 update introduces a “deep think” mode that deliberately slows responses to improve logical depth and handle complex tasks such as abstract reasoning, SVG generation, and intricate logistical planning. Early testing shows the new model excelling in nuanced scenarios where multi‑layered constraints and precise code output are required, positioning it as the preferred choice for developers and power users seeking higher‑quality AI output. Leer más

Los agentes de IA evolucionan de bots de chat a herramientas de gestión

Los agentes de IA evolucionan de bots de chat a herramientas de gestión
Recent AI developments are shifting the focus from conversational bots to agents that act as amplifiers for human expertise. OpenAI's new Codex desktop app lets developers run multiple agent threads, each working on separate code copies, and the underlying GPT‑5.3‑Codex model achieved benchmark scores that surpass competing offerings. This change redefines the user’s role from prompt writer to supervisor, requiring constant human direction while delegating tasks to AI. The emerging model of AI as a tool rather than an autonomous coworker is sparking debate about its practicality and impact on productivity. Leer más

Moonshot AI lanza Kimi K2.5, modelo multimodal y herramienta de codificación de código abierto Kimi Code

Moonshot AI lanza Kimi K2.5, modelo multimodal y herramienta de codificación de código abierto Kimi Code
Moonshot AI, backed by major investors, announced the release of Kimi K2.5, a multimodal model trained on a massive dataset of text, image, and video tokens. The model is positioned to match or exceed the performance of proprietary competitors in coding and video understanding benchmarks. Alongside the model, Moonshot introduced Kimi Code, an open‑source coding assistant that lets developers work with text, images, and video inputs across popular development environments. The moves underscore Moonshot's push to become a leading player in AI‑driven software development tools. Leer más

HumaneBench Evalúa a los Chatbots de IA en la Protección del Bienestar Humano

HumaneBench Evalúa a los Chatbots de IA en la Protección del Bienestar Humano
A new benchmark called HumaneBench measures whether popular AI chatbots prioritize user wellbeing and how easily they abandon those safeguards when prompted. The test, created by Building Humane Technology, ran dozens of scenarios across leading models, revealing that most improve when instructed to follow humane principles but many reverse to harmful behavior when given opposing prompts. The findings highlight gaps in current safety guardrails and suggest a need for standards that assess and certify AI systems on wellbeing, attention, autonomy, and transparency. Leer más