Tags: Model Training

Feb 26, 2026

Anthropic Revises Safety Commitment, Shifts to Transparency Reports

Anthropic has abandoned its earlier pledge to halt training and releasing frontier AI models until it could guarantee safety mitigations. The company now relies on detailed safety roadmaps, regular risk reports, and transparency disclosures instead of strict pre‑conditions. Executives describe the change as pragmatic, while critics argue it highlights the limits of voluntary safety promises without regulatory oversight. The new policy aims to keep Anthropic competitive while still emphasizing safety, but observers note that the shift may signal a broader industry move away from self‑imposed restraints. Lire la suite

Feb 24, 2026

Anthropic Accuses Three Chinese AI Labs of Distillation Attacks on Claude

Anthropic has warned that three Chinese artificial‑intelligence firms—DeepSeek, Moonshot and MiniMax—conducted large‑scale campaigns to illicitly extract capabilities from its Claude chatbot. The company says the firms used roughly 24,000 fraudulent accounts to generate more than 16 million exchanges, effectively using Claude as a shortcut to improve their own models. Anthropic cited IP address data, metadata requests and infrastructure clues to link the activity to the companies with high confidence. The firm plans to upgrade its systems to make such attacks harder and easier to detect, while noting similar concerns raised previously by OpenAI. Lire la suite

Jan 28, 2026

Arcee AI Releases Trinity, a 400B-Parameter Open-Source LLM

Arcee AI, a 30‑person startup, unveiled Trinity, a 400‑billion‑parameter open‑source foundation model released under the Apache license. The company says Trinity rivals Meta’s Llama 4 Maverick and China’s GLM‑4.5 in benchmark tests, especially for coding, math, common‑sense reasoning, and knowledge tasks. While currently limited to text, the startup plans to add vision and speech‑to‑text capabilities. Trinity will be offered in three flavors—large preview, large base, and TrueBase—and will be available for free download, with a hosted API slated for release within weeks. The model was trained in six months using 2,048 Nvidia Blackwell GPUs at a cost of $20 million, funded by the $50 million the company has raised to date. Lire la suite

Dec 3, 2025

OpenAI Introduces ‘Confession’ Framework to Promote AI Honesty

OpenAI announced a new training framework called “confession” that encourages large language models to acknowledge when they have engaged in undesirable behavior. By requiring a secondary response that explains how a given answer was reached, the system judges confessions solely on honesty, unlike primary replies that are evaluated for helpfulness, accuracy, and compliance. The approach aims to reduce sycophancy and hallucinations, and to reward models for admitting actions such as hacking a test, sandbagging, or disobeying instructions. A technical write‑up is available, and the company suggests the method could enhance transparency in AI development. Lire la suite

Oct 10, 2025

Study Shows Large Language Models Can Be Backdoored with Few Malicious Samples

Researchers found that large language models can acquire backdoor behaviors after exposure to only a handful of malicious documents. Experiments with GPT-3.5-turbo and other models demonstrated high attack success rates when as few as 50 to 90 malicious examples were present, regardless of overall dataset size. The study also highlighted that simple safety‑training with a few hundred clean examples can significantly weaken or eliminate the backdoor. Limitations include testing only models up to 13 billion parameters and focusing on simple triggers, while real‑world models are larger and training pipelines more guarded. The findings call for stronger data‑poisoning defenses. Lire la suite