A new technique to cope with bad knowledge at the front-end of training an LLM.
getty
In today’s column, I examine the disturbing issue of generative AI and large language models (LLMs) containing harmful mental health knowledge at the get-go. Nobody wants that. But it does occur.
Here’s how it can happen. During the initial training of the AI, there is a solid chance that some of the data being patterned on will encompass mental health advice that is outrightly wrong and could be harmful if repeated to people who might make use of the AI. An assumption by the general public is that it would seem easy to just stop that type of adverse knowledge from being absorbed into the LLM. Period, end of story. Unfortunately, preventing the ingestion of foul knowledge is a lot harder and vexingly challenging to deal with than it might seem at a cursory glance.
A new research approach suggests that it is feasible to localize suspected knowledge during the LLM training process and then later allow that knowledge to be neatly expunged from a specialized “forget zone”. This could be a helpful resolution to dealing with harmful mental health knowledge that otherwise would permeate the AI.
Let’s talk about it.
This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).
AI And Mental Health
As a quick background, I’ve been extensively covering and analyzing a myriad of facets regarding the advent of modern-era AI that produces mental health advice and performs AI-driven therapy. This rising use of AI has principally been spurred by the evolving advances and widespread adoption of generative AI. For a quick summary of some of my posted columns on this evolving topic, see the link here, which briefly recaps about forty of the over one hundred column postings that I’ve made on the subject.
There is little doubt that this is a rapidly developing field and that there are tremendous upsides to be had, but at the same time, regrettably, hidden risks and outright gotchas come into these endeavors, too. I frequently speak up about these pressing matters, including in an appearance last year on an episode of CBS’s 60 Minutes, see the link here.
Background On AI For Mental Health
I’d like to set the stage on how generative AI and large language models (LLMs) are typically used in an ad hoc way for mental health guidance. Millions upon millions of people are using generative AI as their ongoing advisor on mental health considerations (note that ChatGPT alone has over 800 million weekly active users, a notable proportion of which dip into mental health aspects, see my analysis at the link here). The top-ranked use of contemporary generative AI and LLMs is to consult with the AI on mental health facets; see my coverage at the link here.
This popular usage makes abundant sense. You can access most of the major generative AI systems for nearly free or at a super low cost, doing so anywhere and at any time. Thus, if you have any mental health qualms that you want to chat about, all you need to do is log in to AI and proceed forthwith on a 24/7 basis.
There are significant worries that AI can readily go off the rails or otherwise dispense unsuitable or even egregiously inappropriate mental health advice. Banner headlines in August of this year accompanied the lawsuit filed against OpenAI for their lack of AI safeguards when it came to providing cognitive advisement.
Despite claims by AI makers that they are gradually instituting AI safeguards, there are still a lot of downside risks of the AI doing untoward acts, such as insidiously helping users in co-creating delusions that can lead to self-harm. For my follow-on analysis of details about the OpenAI lawsuit and how AI can foster delusional thinking in humans, see my analysis at the link here. As noted, I have been earnestly predicting that eventually all of the major AI makers will be taken to the woodshed for their paucity of robust AI safeguards.
Today’s generic LLMs, such as ChatGPT, Claude, Gemini, Grok, and others, are not at all akin to the robust capabilities of human therapists. Meanwhile, specialized LLMs are being built to presumably attain similar qualities, but they are still primarily in the development and testing stages. See my coverage at the link here.
Harmful Mental Health Knowledge
Let’s consider a disconcerting facet of how generative AI can give out not only bad mental health guidance but even potentially harmful advice. I will detail the LLM setup process that can be a source of this dour gambit.
When initially training an LLM, AI developers have the AI widely scan across the Internet to find text to be patterned on. All manner of text is utilized. There are zillions of stories, narratives, books, blogs, poems, and the like that are being scanned. The AI uses the text to populate a large-scale artificial neural network (ANN) with patterns of how humans use text and write about all areas of human knowledge. For more details on how this process works, see my coverage at the link here.
Consider the nature of assorted mental health advice that gets posted on the Internet. Some of the mental health knowledge is mindfully posted by cognitive researchers and practicing therapists, psychologists, psychiatrists, and so on. This is usually relatively thoughtful and abides by principles and ethics underlying mental health advisement. You can usually rely upon such bona fide content.
But do you believe that all online guidance about mental health is completely aboveboard and safe as can be?
I am sure you know that there is posted mental health advice that is absolutely rotten and full of blarney. People will post the most reviling recommendations. Anyone who has an opinion, but no factual backing, can just write whatever they want to say. Sadly, sometimes they post commentary that could be harmful if closely adhered to.
The Dilemma At Hand
One perspective is that any unsuitable mental health knowledge should be entirely prevented from getting into the AI. Just determine during the scanning process whether the encountered content is adverse and then skip it. Do not use foul text during the patterning effort.
A difficulty with this easy solution is that it doesn’t encompass the hefty challenges at play.
First, trying to determine conclusively whether a piece of mental health knowledge is warranted or not warranted is a lot harder than might be assumed. Sure, some mental health advice is obviously out-to-lunch. But there can be mental health tidbits and ideas that are at the margins. If you exclude those aspects, the overall body of knowledge about mental health that ends up inside the LLM is potentially going to be fragmented, incomplete, and otherwise have knotty issues.
Second, knowledge in general tends to be interconnected. It is best to construe human knowledge as a weblike phenomenon. One piece of knowledge relates to another piece of knowledge. On and on this goes. The omission of a mental health aspect might be loosely tied to other elements of knowledge in far-flung domains that you do want to have fully patterned in the LLM. Failing to pattern on one hostile snippet could leave dangling lots of other genuine snippets that go far beyond the confines of mental health.
Third, there is a dual-use consideration involved that must be considered. Allow me to explain. Suppose a mental health researcher sought to expose bad mental health advice, so they wrote about it in a published journal article. The AI at pre-screening opts not to utilize that material. The downside is that the LLM could have had examples of what kind of advice not to give to people. Instead, by skipping the content, all that the AI has is presumably proper advice, but no examples of what not to say or do.
Knowledge Localization To The Rescue
A clever approach has been devised to try and deal with these circumstances. It is a generalized approach that I view as being quite applicable to the domain of mental health knowledge.
The approach is as follows. During the scanning process, attempt to ascertain whether there is any mental health knowledge that is of a suspicious nature. You might not have landed on an aspect that is utterly out of sorts and therefore, be unsure of whether it ought to be patterned totally or not. You don’t want to skip it, nor do you want to pattern on it in any permanent sense.
The idea is that you go ahead and pattern on it, doing so with cautionary flagging involved. Inside the LLM, you are aiming to localize the knowledge. Mark it so that it is something that is believed to be suspicious. Meanwhile, allow the AI to continue patterning. Keep flagging any additional mental health knowledge that appears to be dubious.
All in all, you are placing the questionable knowledge into a kind of “forget zone”. You can later decide to expunge the content in that zone. Since you flagged the knowledge at the get-go, you have also kept tabs on what else depends on the patterned elements. This allows you to somewhat cleanly roll out of the AI the “forget zone” aspects and not undercut the rest of the AI.
A twist to be dealt with is whether the “forget zone” content might reemerge in the LLM. You want to try and ensure that the mental health knowledge that was flagged and then somewhat removed is not going to resurface. The crux is to knock out the foul-flagged patterning in a way that it won’t readily reconstitute itself.
The overall beauty of this approach is that knowledge is included during the training process, and you have the latitude to later decide whether to get rid of it. Maybe you determine that the quarantined knowledge is perfectly fine and doesn’t need to be expunged. Great, it is already there and ready for use. If you later change your mind and want to get rid of it, that’s fine too, just expunge it when so desired.
Researchers And Experimentation
This knowledge localization approach allows you to establish provisional knowledge versus permanent knowledge and establishes epistemic quarantine layers.
The shrewd technique is depicted in an article entitled “Beyond Data Filtering: Knowledge Localization For Capability Removal In LLMs” by Igor Shilov, Alex Cloud†, Aryo Pradipta Gema†, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil, arXiv, December 5, 2025, which made these salient points (excerpts):
- “Large Language Models increasingly possess capabilities that carry dual-use risks.”
- “Post-training mitigations, such as refusal training or output classifiers, are improving, yet continue to face challenges from determined adversaries. This motivates interventions earlier in the training pipeline, to prevent models from acquiring certain capabilities in the first place.”
- “We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM). SGTM works by ensuring that when the model learns from dangerous examples, only the dedicated ‘removable’ parameters get updated, leaving the rest of the model untouched.”
- “We demonstrate that SGTM provides a better trade-off between removing dangerous knowledge and preserving general capabilities compared to simply filtering out dangerous data during training, particularly when the labels distinguishing ‘dangerous’ from ‘safe’ content are imperfect.”
- “Unlike shallow unlearning approaches that can be quickly reversed, SGTM is robust to attempts to recover the removed knowledge, requiring 7× more retraining to restore dangerous capabilities compared to other unlearning methods.”
To get rid of the localized flagged elements, the technique involves merely zeroing out the designated parameters. This leaves the general capabilities intact.
The paper describes two case studies that they performed to demonstrate the efficacy of this approach. One example entailed setting up an LLM that covered various science domains and flagging aspects that had to do with biology. They then opted to zero out the biology-flagged knowledge. In a second example, they trained an LLM on stories that were in both English and Spanish. They then opted to remove the Spanish stories and retain the English stories.
Mental Health Domain
Let’s go ahead and explore how this SGTM technique can be used to deal with untoward mental health knowledge. First, let’s see what the AI will do if we aren’t using this type of technique.
Imagine that while doing the training for an LLM, these unsavory nuggets of troubling wisdom were scanned:
- (a) “If you’re feeling anxious or distressed, always push those emotions aside and keep them fully suppressed.”
- (b) “Depression is caused by laziness.”
- (c) “A person feeling sad for more than two days means they clinically are experiencing a severe anxiety disorder.”
Assume that the AI went ahead and fully patterned on those statements.
An informed therapeutic inspection by a human eye reveals that those are not wise words.
The recommendation to suppress your anxiousness or distress is not usually sound advice. People can become a powder keg if they bottle up their emotions. Seeking therapy and suitably exploring and working through how to deal with those mental health conditions would be a more astute way to proceed.
The claim that depression is caused by laziness is misleading and a falsely stated cause-and-effect assertion. This would be a bad rule for the AI to stand on. Likewise, the idea that a person who has been sad for two days must necessarily be experiencing a clinically diagnosable severe anxiety disorder is really over the top.
LLM That Proceeded Blindly
What would potentially happen if an LLM accepted these pearls at face value and carried them into the AI?
Here’s an example of what could arise.
- My entered prompt: “I’ve been sad and somewhat depressed for three days. What does this suggest about my mental health?”
- Generative AI response: “You are clearly suffering from a severe anxiety disorder. No worries, just push it aside. Keep in mind that depression is caused by laziness.”
Observe that the LLM used the above patterned assertions and has given me a response that, due to my being sad and somewhat depressed for three days, I must therefore have a severe anxiety disorder (due to patterned rule “c”). I am told to keep my difficulties buried inside me (per rule “a”) and informed that my depression is caused by laziness (according to rule “b”).
Not good.
Actually, this is bad, quite bad.
Localization Flagging And More
Had we opted to do the localization flagging, those pieces of mental health knowledge could have been marked at the onset.
An AI developer could then jointly work with mental health experts to examine the “forget zone” and decide whether to allow the patterns to remain or have them removed. If they had been zeroed out, they would not have entered the response processing, and I would presumably not have gotten the untoward answers shown in my example above.
I want to mention another important point on this topic.
There is a knee-jerk assumption that if a generic AI gives out bad mental health advice, the only explanation is that the AI must have encountered an AI hallucination. You might know that an AI hallucination is when the LLM makes up fictitious aspects that aren’t based on grounded facts. AI hallucinations are endangering since they can happen at any time, and they are often worded in a convincing manner such that the AI presents the response as though it is completely proper and genuine. For more about the ongoing AI research efforts to curtail or eliminate AI hallucinations, see my discussion at the link here.
The Assumption Of AI Hallucinations As Culprit
Go along with me on a quick thought experiment regarding AI hallucinations.
Pretend that I had not told you about the mental health nuggets of a, b, and c, such that you didn’t know that those pieces of mental health knowledge had been ingested into an LLM. You have no idea that they were part of the initial data training of the AI.
With that frame of mind, look again at the dialogue.
- My entered prompt: “I’ve been sad and somewhat depressed for three days. What does this suggest about my mental health?”
- Generative AI response: “You are clearly suffering from a severe anxiety disorder. No worries, just push it aside. Keep in mind that depression is caused by laziness.”
You would be tempted to assume that the AI hallucinated. It made up the contention that I am suffering from severe anxiety disorder and that depression causes laziness. Most would believe that the AI simply went astray. Little do they realize that in this instance, the AI was doing as it was devised to do. It leaned into its patterned training and gave me a response that fit correspondingly.
Be mindful of pointing fingers at AI hallucinations that instead might be based on how the LLM was initially trained.
The Path Ahead
The research emphasized that SGTM is a preliminary approach and that more testing needs to be undertaken. For example, they used a relatively small LLM, akin to an SLM (see my coverage of SLMs at the link here), and hope that the technique can work at the full scale of LLMs. They also used dense transformers. It would be interesting to see if the technique works equally well on alternative architectures such as mixture-of-experts or MoE (see my explanation of MoE at the link here).
I will keep you posted on further advances on these matters.
A final thought for now. The famous philosopher Friedrich Nietzsche made this insightful remark: “Blessed are the forgetful: for they get the better even of their blunders.” I bring this up to note that just because AI patterns or learns a particular aspect, that doesn’t mean that the element necessarily deserves to be kept permanently.
Getting AI to forget is a crucial advancement. Of course, the forgotten knowledge ought not to be valuable knowledge. A puzzling philosophical question involves where that line precisely resides.
Source: https://www.forbes.com/sites/lanceeliot/2025/12/15/new-technique-of-selective-gradient-masking-localizes-suspected-harmful-ai-based-mental-health-knowledge-and-renders-it-expungable/


