Estudio muestra que los grandes modelos de lenguaje pueden ser vulnerados con pocos ejemplos maliciosos

Researchers found that large language models can acquire backdoor behaviors after exposure to only a handful of malicious documents. Experiments with GPT-3.5-turbo and other models demonstrated high attack success rates when as few as 50 to 90 malicious examples were present, regardless of overall dataset size. The study also highlighted that simple safety‑training with a few hundred clean examples can significantly weaken or eliminate the backdoor. Limitations include testing only models up to 13 billion parameters and focusing on simple triggers, while real‑world models are larger and training pipelines more guarded. The findings call for stronger data‑poisoning defenses. Leer más

Oct 10, 2025

Estudio de Anthropic muestra que una pequeña cantidad de datos envenenados puede crear una puerta trasera en grandes modelos de lenguaje

Anthropic released a report detailing how a small number of malicious documents can poison large language models (LLMs) during pretraining. The research demonstrated that as few as 250 malicious files were enough to embed backdoors in models ranging from 600 million to 13 billion parameters. The findings highlight a practical risk that data‑poisoning attacks may be easier to execute than previously thought. Anthropic collaborated with the UK AI Security Institute and the Alan Turing Institute on the study, urging further research into defenses against such threats. Leer más