Darlene Newman recently wrote a great article that makes it abundantly clear why you CAN NOT Safely use LLMs for Contracts or any other document with any Legal implications whatsoever!
Not only can you not train out hallucinations, because they are a fundamental function of the technology, but every time the LLM touches the document, it can (and likely will) corrupt something that was already correct (and reviewed) before.
In other words, you collect all your reference documents, ask it to generate a contract that contains all of your mandatory clauses, addresses all the risks, incorporates the schedule, specifies the requirements, etc. etc. etc. and get back a 50 page document where the section, paragraph, and sentence quality ranges from masterpiece to monkey on crack. You then spend hours (to days) fixing everything and ask the LLM to simply correct spelling, grammar, and ensure key requirements are met in the new/changed sections only (giving it the original document for comparison). The LLM spits out a cleaned up copy, you review all the sections you updated, it looks good, and you send it out.
Little do you know that because you added an article in one section, shortened a sentence in another section, and improved the grammar in a third section that it decided to rewrite half those sections for you, because it decided the specific requirements you called out for the new sections weren’t addressed enough. In the process, other key requirements are dropped, risk mitigations have been written out, and the contract now heavily favours the other side when something goes wrong. Not at all what you intended, but that’s what you got because you didn’t review all 50 pages with care.
Maybe not too bad if nothing goes wrong, and maybe devastating if it does.
But nothing goes wrong in the short term, so your Legal team decides to use it to try and defend a claim against your company. This is where it goes from bad to much, much, worse. You upload the brief, you outline your counterpoints, you upload your supporting documents — including the relevant law and cases you know of, you ask it to find more law and cases relevant to your defense, and ask it to create your first response. You let it chug, go to lunch, and come back to a 60 page, 220 point response with half a dozen statues and two dozen cited cases.
You go through all the law, realize that only 8 of the statutes are (somewhat) relevant, remove the 3 that aren’t and the fake one the LLM found on the internet. Then you go through all the cases, realize only 14 are actually supporting, 7 are not relevant, and 3 were completely hallucinated and make the corrections. Mark all the paragraphs that are okay, the ones that need updates, and what updates are needed. Get sign off on what’s good, what needs updates, and push it through again. It comes back with a couple of new potential statutes, another 8 potential cases, updates to multiple paragraphs, and you review again. You find one of the statutes potentially relevant, 4 of the cases real and usable, and half of the paragraphs look good. You mark all this, make the updated correction lists, get sign-off, and send it back to the LLM. You don’t notice it also changed 5 of the paragraphs you were completely happy with, changed some quotes to non-existent quotes, and replaced an approved reference with a hallucinated one. This goes on for a few more iterations, where key clauses/references are not rechecked, and you still end up with a 70 page document with a dozen hallucinations, 3 non-existent cases, and faulty logic despite review by multiple senior partners, because no one checked what they were happy with last iteration because they expected the LLM would not change it because they explicitly told the LLM not to.
Unlike an intern, who is naturally lazy and tired of working 84 to 112 weeks for peanuts and will happily ignore anything you tell him to ignore, as well as intelligent (when he chooses to be), the dumber-than-a-doornail LLM recomputes the meaning of inputs on every request, has the same chance of messing up on every request, has the same chance of understanding the request but predicting you were being facetious and actually want it to rewrite the paragraphs chock full of hallucinations, and so on. You don’t notice, submit the brief with $1,000/hour senior partner sign off, and make a mockery of your firm with all the AI slop (as well as securing it a massive fine from a p!ssed off judge tired of AI slop).
And there’s no way to stop it. It doesn’t matter how detailed your instructions are. It doesn’t matter how much effort you go through to lock parts of the document down with automated input and output checks and re-dos when the LLM screws up. Every time the LLM touches the document, something will corrupt. The only thing that is unknown is whether or not is how detrimental the corruption is.
As per Darlene’s post,
Microsoft Research tested 19 AI models across 310 professional documents. They gave each model a document editing task, then another, then another … for 20 interactions in total. Frontier models corrupted 25% of document content by the end.
25%! That’s a lot of corruption of good content. And enough to ensure you get AI slop every time!

