
They mentioned that the benchmark accommodates 310 work environments throughout 52 skilled domains together with coding, crystallography, family tree and music sheet notation. Every setting consists of actual paperwork totaling round 15K tokens in size, and 5 to 10 advanced modifying duties {that a} person would possibly ask an LLM to carry out.
And, they acknowledged within the paper’s summary: “Our evaluation exhibits that present LLMs are unreliable delegates: they introduce sparse however extreme errors that silently corrupt paperwork, compounding over lengthy interplay.”
These errors are vital, they mentioned. “The findings present that present LLMs introduce substantial errors when modifying work paperwork, with frontier fashions (Gemini 3.1 Professional, Claude 4.6 Opus, and GPT 5.4) shedding a median 25% of doc content material over 20 delegated interactions, and a median degradation throughout all fashions of fifty%.”
Benchmark train receives a thumbs up
Brian Jackson, principal analysis director at Information-Tech Analysis Group, discovered the findings very fascinating. “Placing an inventory of LLMs to the check throughout totally different work domains yields loads of helpful insights,” he mentioned. “I feel this kind of benchmark train could possibly be useful to enterprise builders who need to leverage agentic AI to automate particular workflows and perceive the bounds of what could be achieved.”