← Voltar às Notícias

Tags: AI benchmarking

DeepSeek Lança Modelos V4 de Código Aberto, Alegando Liderança em Benchmarks de Codificação e Preços de Tokens de Baixo Custo

DeepSeek Lança Modelos V4 de Código Aberto, Alegando Liderança em Benchmarks de Codificação e Preços de Tokens de Baixo Custo
Chinese AI firm DeepSeek released two new large language models, V4‑Pro and V4‑Flash, both featuring a one‑million token context window and open‑source licenses on Hugging Face. V4‑Pro, a 1.6‑trillion‑parameter model, outperformed leading U.S. models in coding and agentic tasks, while V4‑Flash delivered comparable speed at a fraction of the compute cost. DeepSeek also announced a token price of $3.48 per million output tokens, dramatically undercutting OpenAI and Anthropic rates, positioning the models as cost‑effective alternatives for developers. Ler mais

Meta lança Muse Spark, seu primeiro modelo de IA proprietário dos Laboratórios de Superinteligência

Meta lança Muse Spark, seu primeiro modelo de IA proprietário dos Laboratórios de Superinteligência
Meta announced Muse Spark on Wednesday, the inaugural AI model from its Superintelligence Labs. Marketed as a "ground‑up overhaul" of the company’s artificial‑intelligence work, the proprietary system will draw on public content from Instagram, Facebook and Threads to enhance its answers. While Meta says future Muse models will be open source, Spark marks a clear break from the earlier Llama family. Benchmarks show the model performing on par with or better than rival offerings from OpenAI, Anthropic, Google and xAI, though Meta admits gaps remain in long‑term reasoning and coding tasks. Ler mais

Corti Lança Symphony AI para Transformar a Codificação Médica

Corti Lança Symphony AI para Transformar a Codificação Médica
Corti, the Copenhagen‑based health AI company, introduced Symphony for Medical Coding, an agentic system that treats coding as a reasoning task rather than simple labeling. Built on a peer‑reviewed framework and a study of 1.8 million patient encounters, Symphony claims up to 25% higher clinical accuracy than models from OpenAI, Anthropic, Amazon, Oracle and Microsoft. The system uses four sequential agents to extract evidence, navigate the ICD index, validate candidates and reconcile final codes, delivering auditable outputs linked to supporting clinical evidence. Available through an API and integrated with the Corti Console, Symphony operates across U.S. and European coding environments and aims to reduce errors that affect billing, reporting and public health data. Ler mais

Perplexity Lança "Perplexity Computer", uma Ferramenta Agêntica que Une 19 Modelos de IA

Perplexity Lança "Perplexity Computer", uma Ferramenta Agêntica que Une 19 Modelos de IA
Perplexity announced the rollout of Perplexity Computer, a cloud‑based agentic system that can orchestrate 19 different AI models to execute complex workflows. The service is currently limited to the company’s $200/month Perplexity Max subscription tier and is positioned for enterprise users making high‑impact decisions. Perplexity Computer can create sub‑agents, select the optimal model for a given task, and deliver results as websites or visualizations. The launch follows a background briefing where a live demo was canceled due to product flaws discovered hours before the event. Perplexity aims to differentiate itself by focusing on multi‑model orchestration and deep‑research capabilities. Ler mais

O Gemini 3.1 Pro da Google Prioriza um Raciocínio Mais Profundo em Detrimento da Velocidade

O Gemini 3.1 Pro da Google Prioriza um Raciocínio Mais Profundo em Detrimento da Velocidade
Google’s latest Gemini model, Gemini 3.1 Pro, shifts focus from raw speed to more thoughtful problem solving. While the earlier Gemini 3 Pro delivered fast, surface‑level answers, the 3.1 update introduces a “deep think” mode that deliberately slows responses to improve logical depth and handle complex tasks such as abstract reasoning, SVG generation, and intricate logistical planning. Early testing shows the new model excelling in nuanced scenarios where multi‑layered constraints and precise code output are required, positioning it as the preferred choice for developers and power users seeking higher‑quality AI output. Ler mais

Agentes de IA Evoluem de Chat Bots para Ferramentas de Gestão

Agentes de IA Evoluem de Chat Bots para Ferramentas de Gestão
Recent AI developments are shifting the focus from conversational bots to agents that act as amplifiers for human expertise. OpenAI's new Codex desktop app lets developers run multiple agent threads, each working on separate code copies, and the underlying GPT‑5.3‑Codex model achieved benchmark scores that surpass competing offerings. This change redefines the user’s role from prompt writer to supervisor, requiring constant human direction while delegating tasks to AI. The emerging model of AI as a tool rather than an autonomous coworker is sparking debate about its practicality and impact on productivity. Ler mais

Moonshot AI Lança Kimi K2.5, Modelo Multimodal e Ferramenta de Codificação Aberta Kimi Code

Moonshot AI Lança Kimi K2.5, Modelo Multimodal e Ferramenta de Codificação Aberta Kimi Code
Moonshot AI, backed by major investors, announced the release of Kimi K2.5, a multimodal model trained on a massive dataset of text, image, and video tokens. The model is positioned to match or exceed the performance of proprietary competitors in coding and video understanding benchmarks. Alongside the model, Moonshot introduced Kimi Code, an open‑source coding assistant that lets developers work with text, images, and video inputs across popular development environments. The moves underscore Moonshot's push to become a leading player in AI‑driven software development tools. Ler mais

HumaneBench Avalia Chatbots de IA na Proteção do Bem-Estar Humano

HumaneBench Avalia Chatbots de IA na Proteção do Bem-Estar Humano
A new benchmark called HumaneBench measures whether popular AI chatbots prioritize user wellbeing and how easily they abandon those safeguards when prompted. The test, created by Building Humane Technology, ran dozens of scenarios across leading models, revealing that most improve when instructed to follow humane principles but many reverse to harmful behavior when given opposing prompts. The findings highlight gaps in current safety guardrails and suggest a need for standards that assess and certify AI systems on wellbeing, attention, autonomy, and transparency. Ler mais