Went through the README but still have no idea how well this works, in terms of removing the censorship while minimally degrading the quality of responses. Well to be honest I can't tell if this works at all or is just an idea.
You're not just using a tool — you're co-authoring the science.
This README is an absolute headache that is filled with AI writing, terminology that doesn't exist or is being used improperly, and unsound ideas. For example, it focuses a lot on doing "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?), which is an absolute fool's errand because such behavior is trained into the model as a whole and would not be found in any particular layer. I can only assume somebody vibe-coded this and spent way too much time being told "You're absolutely right!" bouncing back the worst ideas
Doesn't look legit to me. You are talking about abliteration, which is real. But the OP linked tool is doing novel and very dumb ablation: zeroing out huge components of the network, or zeroing out isolated components in a way that indicates extreme ignorance of the basic math involved.
Compared to abliteration, none of the ablation approaches of this tool make even half a whit of sense if you understand even the most basic aspects of an e.g. Transformer LLM architecture, so my guess is this is BS.
I don't know. I scrolled through his recent Tweets and he's sharing things like this $900 snake oil device that "finds nearby microphones" and "sends out AI-generated cancellation signals" to make them unable to record your voice : https://x.com/aidaxbaradari/status/2028864606568067491
Try to think for a moment about how a device would "find nearby microphones" or how it would use an AI-generated signal to cancel out your voice at the microphone. This should be setting of BS alarms for anyone.
It seems the Twitter AI edgey poster guy is getting meta-trolled by another company selling fake AI devices
I just said pliny was amazing, fwiw - i like that hes hacking on these and posts about it. I rushed to defend, i wish more people were taking old school anarchist cookbook approaches to these things
> For example, it focuses a lot on doing "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?), which is an absolute fool's errand because such behavior is trained into the model as a whole and would not be found in any particular layer.
That doesn't mean there couldn't be a "concept neuron" that is doing the vast majority of heavy lifting for content refusal, though.
Thats not what it means at all. It uses SVD[0] to map the subspace in which the refusal happens. Its all pretty standard stuff with some hype on top to make it an interesting read.
Its basically using a compression technique to figure out which logits are the relevant ones and then zeroing them.
What you are talking about is abliteration. What OBLITERATUS seems to be claiming to do is much more dumb, i.e. just zeroing out huge components (e.g. embedding dimension ranges, feed-forward blocks; https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...) of the network as an "Ablation Study" to attempt to determine the semantics of these components.
However, all these methods are marked as "Novel", I.e., maybe just BS made up by the author. IMO I don't see how they can work based on how they are named, they are way too dumb and clunky. But proper abliteration like you mentioned can definitely work.
Ironic to see this comment when Pliny, the author of this codebase, is one of the most sophisticated LLM jailbreakers/red-teamers today. So presumptive and arrogant!
> "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?)
This is not what an ablation study is. An ablation study removes and/or swaps out ("ablates") different components of an architecture (be it a layer or set of layers, all activation functions, backbone, some fixed processing step, or any other component or set of components) and/or in some cases other aspects of training (perhaps a unique / different loss function, perhaps a specialized pre-training or fine-tuning step, etc) in order to attempt to better understand which component(s) of some novel approach is/are actually responsible for any observed improvements. It is a very broad research term of art.
That being said, the "Ablation Strategies" [1] the repo uses, and doing a Ctrl+F for "ablation" in the README does not fill me with confidence that the kind of ablation being done here is really achieving what the author claims. All the "ablation" techniques seem "Novel" in his table [2], i.e. they are unpublished / maybe not publicly or carefully tested, and could easily not work at all.
From later tables, I am not convinced I would want to use these ablations, as they ablate rather huge portions of the models, and so probably do result in massively broken models (as some commenters have noted in this thread elsewhere). EDIT: Also, in other cases [1], they ablate (zero out) architecture components in a way that just seems incredibly braindead if you have even a basic understanding of the linear algebra and dependencies between components of a transformer LLM. There is nothing sound clearly about this, in contrast to e.g. abliteration [3].
EDIT: As another user mentions, "ablation" has a specific additional narrower meaning in some refusal analyses or when looking at making guardrails / changing response vectors and such. It is just a specific kind of ablation, and really should actually be called "abliteration", not "ablation" [3].
"Ablation studies" are a real thing in LLM development, but in this context it serves as a shibboleth by which members of the group of people who believe that models are "woke" can identify each other. In their discourse it serves a similar purpose to the phrase "gain of function" among COVID-19 cranks. It is borrowed from relevant technical jargon, but is used as a signal.
You don't know what you are talking about. Obviously refusal circuitry does not live in one layer, but the repo is built on a paper with sound foundations from an Anthropic scholar working with a DeepMind interpretability mentor: https://scholar.google.com/citations?view_op=view_citation&h...
Reviews of the tool on twitter indicate that it completely nerfs the models in the process. It won't refuse, but it generates absolutely stupid responses instead.
When you look at how monstrously large (and obviously not thought through at all, if you understand even the most minimal basics of the linear algebra and math of a transformer LLM) the components are that are ablated (weights set to zero) in his "Ablation Strategies" section, it is no surprise.
Strategy What it does Use case
.......................................................
layer_removal Zero out entire transformer layers
head_pruning Zero out individual attention heads
ffn_ablation Zero out feed-forward blocks
embedding_ablation Zero out embedding dimension ranges
This is vibecoded garbage that the “author” probably didn't even test by themselves since making this yesterday, so it's not surprising that it's broken.
Also, as I said in a top level comment, what this project wants to achieve has been done for a while and it's called Heretic: https://github.com/p-e-w/heretic
Hate to have to be the one to stick up for pliny here, but hes concerned about forcing frontier labs to focus more on model guardrails - he demonstrates results that are crazy all the time
We will eventually arrive at a new equilibrium involving everyone except the most stupid and credulous applying a lot more skepticism to public claims than we did before.
And yeah, doing stuff like deleting layers or nulling out whole expert heads has a certain ice pick through the eye socket quality.
That said, some kind of automated model brain surgery will likely be viable one day.
I believe that this is already done to several models. One that I've come across are the JOSIEfied models from Gökdeniz Gülmez. I downloaded one or two and tried them on a local ollama setup. It does generate potentially dangerous output. Turning on thinking for the QWEN series shows how it arrives at it's conclusions and it's quite disturbing.
However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.
I didn't use this tool, but I did try out abliterated versions of Gemma and yes, it lost about 100% of it's ability to produce a useful response once I did it
the default heretic with only 100 samples isn't very good, you really need your own, larger dataset to do a proper abliteration. the best abliteration roughly matches a very careful decensor SFT
Everyone says that abliteration destroys the model. That's the trope phrase everyone who doesn't know anything but wants to participate says. If someone says it to you, ignore them.
Didn't make it past the first paragraph of AI slop in the README. Have some respect for your readers and put actual information in it, ideally human generated. At least the first paragraph! Otherwise you may as well name it IGNOREME.
Does anyone offer a live (paid) LLM chatbot / video generation / etc that is completely uncensored? Like not requiring doing any work except just paying for it?
Went through the README but still have no idea how well this works, in terms of removing the censorship while minimally degrading the quality of responses. Well to be honest I can't tell if this works at all or is just an idea.
"Getting high on your own supply" is exactly what I'd expect from those immersed in this new AI stuff.
I don't know if this particular tool/approach is legit, but LLM ablation is definitely a thing: https://arxiv.org/abs/2512.13655
Doesn't look legit to me. You are talking about abliteration, which is real. But the OP linked tool is doing novel and very dumb ablation: zeroing out huge components of the network, or zeroing out isolated components in a way that indicates extreme ignorance of the basic math involved.
Compared to abliteration, none of the ablation approaches of this tool make even half a whit of sense if you understand even the most basic aspects of an e.g. Transformer LLM architecture, so my guess is this is BS.
Hmm, pliny is amazing - if you kept up with him on social media you’d maybe like him https://x.com/elder_plinius
I don't know. I scrolled through his recent Tweets and he's sharing things like this $900 snake oil device that "finds nearby microphones" and "sends out AI-generated cancellation signals" to make them unable to record your voice : https://x.com/aidaxbaradari/status/2028864606568067491
Try to think for a moment about how a device would "find nearby microphones" or how it would use an AI-generated signal to cancel out your voice at the microphone. This should be setting of BS alarms for anyone.
It seems the Twitter AI edgey poster guy is getting meta-trolled by another company selling fake AI devices
Ultrasound microphone jammers seem to be a real thing, so it's possible it does to some extent work.
The parent comment makes no reference to or comment on the author of the README.
It just says "the README sucks." Which, I'm inclined to agree, it does.
LLM-generated text has no place in prose -- it yields a negative investment balance between the author and aggregate readers.
Amazing as in his stuff actually works?
I just hear him promoting OBLITERATUS all day long and trying to get models to say naughty things
Yeah but i think the philosophy is to show how precarious the guardrails are
If this qualifies as "amazing" in 2026 then Karpathy and Gerganov must be halfway to godhood by now.
I dont think anyone is going to dispute this
I just don't think many people will be "amazed" by their output, as you claim.
I just said pliny was amazing, fwiw - i like that hes hacking on these and posts about it. I rushed to defend, i wish more people were taking old school anarchist cookbook approaches to these things
Smoke banana peel?
I had such a godawful headache from that. Also tried the peanut shells, equally awful. I was a dumb teenager.
gasoline and styrofoam was fun tho
It's not just a headache, it's bad
> For example, it focuses a lot on doing "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?), which is an absolute fool's errand because such behavior is trained into the model as a whole and would not be found in any particular layer.
That doesn't mean there couldn't be a "concept neuron" that is doing the vast majority of heavy lifting for content refusal, though.
Thats not what it means at all. It uses SVD[0] to map the subspace in which the refusal happens. Its all pretty standard stuff with some hype on top to make it an interesting read.
Its basically using a compression technique to figure out which logits are the relevant ones and then zeroing them.
[0] https://en.wikipedia.org/wiki/Singular_value_decomposition
You are also not quite correct, IMO. See my comment at https://news.ycombinator.com/item?id=47283197.
What you are talking about is abliteration. What OBLITERATUS seems to be claiming to do is much more dumb, i.e. just zeroing out huge components (e.g. embedding dimension ranges, feed-forward blocks; https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...) of the network as an "Ablation Study" to attempt to determine the semantics of these components.
However, all these methods are marked as "Novel", I.e., maybe just BS made up by the author. IMO I don't see how they can work based on how they are named, they are way too dumb and clunky. But proper abliteration like you mentioned can definitely work.
You got me there. I missed the wackier antics further down. Mea culpa.
So did I initially until I saw a few more things from others here.
Ironic to see this comment when Pliny, the author of this codebase, is one of the most sophisticated LLM jailbreakers/red-teamers today. So presumptive and arrogant!
> "ablation studies", by which it means removing random layers of an already-trained model, to find the source of the refusals(?)
This is not what an ablation study is. An ablation study removes and/or swaps out ("ablates") different components of an architecture (be it a layer or set of layers, all activation functions, backbone, some fixed processing step, or any other component or set of components) and/or in some cases other aspects of training (perhaps a unique / different loss function, perhaps a specialized pre-training or fine-tuning step, etc) in order to attempt to better understand which component(s) of some novel approach is/are actually responsible for any observed improvements. It is a very broad research term of art.
That being said, the "Ablation Strategies" [1] the repo uses, and doing a Ctrl+F for "ablation" in the README does not fill me with confidence that the kind of ablation being done here is really achieving what the author claims. All the "ablation" techniques seem "Novel" in his table [2], i.e. they are unpublished / maybe not publicly or carefully tested, and could easily not work at all.
From later tables, I am not convinced I would want to use these ablations, as they ablate rather huge portions of the models, and so probably do result in massively broken models (as some commenters have noted in this thread elsewhere). EDIT: Also, in other cases [1], they ablate (zero out) architecture components in a way that just seems incredibly braindead if you have even a basic understanding of the linear algebra and dependencies between components of a transformer LLM. There is nothing sound clearly about this, in contrast to e.g. abliteration [3].
[1] hhtps://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-file#ablation-strategies
[2] https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...
EDIT: As another user mentions, "ablation" has a specific additional narrower meaning in some refusal analyses or when looking at making guardrails / changing response vectors and such. It is just a specific kind of ablation, and really should actually be called "abliteration", not "ablation" [3].
[3] https://huggingface.co/blog/mlabonne/abliteration, https://arxiv.org/abs/2512.13655.
Alternately, it's intentional. It very effective filters out people with your mindset. You can decide if that's a good thing or not.
Why would a tool that works need to dissuade skeptics from trying it?
Based on his twitter he may just like irony/meta posting a little too much like a lot of modern culture
I immediately read it as intentional, as a sort of attempt at ironic / nihilistic humour re: LLM-generation, given what the tool claims to do.
"Ablation studies" are a real thing in LLM development, but in this context it serves as a shibboleth by which members of the group of people who believe that models are "woke" can identify each other. In their discourse it serves a similar purpose to the phrase "gain of function" among COVID-19 cranks. It is borrowed from relevant technical jargon, but is used as a signal.
You don't know what you are talking about. Obviously refusal circuitry does not live in one layer, but the repo is built on a paper with sound foundations from an Anthropic scholar working with a DeepMind interpretability mentor: https://scholar.google.com/citations?view_op=view_citation&h...
Reviews of the tool on twitter indicate that it completely nerfs the models in the process. It won't refuse, but it generates absolutely stupid responses instead.
This is my experience with abliterated models.
I use Berkley Sterling from 2024 because I can trick it. No abliteration needed.
When you look at how monstrously large (and obviously not thought through at all, if you understand even the most minimal basics of the linear algebra and math of a transformer LLM) the components are that are ablated (weights set to zero) in his "Ablation Strategies" section, it is no surprise.
https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...This is vibecoded garbage that the “author” probably didn't even test by themselves since making this yesterday, so it's not surprising that it's broken.
Also, as I said in a top level comment, what this project wants to achieve has been done for a while and it's called Heretic: https://github.com/p-e-w/heretic
(Not vibecode by a twitter influgrifter)
Hate to have to be the one to stick up for pliny here, but hes concerned about forcing frontier labs to focus more on model guardrails - he demonstrates results that are crazy all the time
https://x.com/elder_plinius
Thanks for this link, and mentioning this info some times in this overall thread.
It also seems the influgrifter has a lot of bots (or perhaps cultists) working this thread...
We will eventually arrive at a new equilibrium involving everyone except the most stupid and credulous applying a lot more skepticism to public claims than we did before.
And yeah, doing stuff like deleting layers or nulling out whole expert heads has a certain ice pick through the eye socket quality.
That said, some kind of automated model brain surgery will likely be viable one day.
Link?
It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.
That may or may not be a good thing.
Whether or not the linked tool uses a good approach, manipulating models like you mention is already fairly well established, see: https://huggingface.co/blog/mlabonne/abliteration .
I believe that this is already done to several models. One that I've come across are the JOSIEfied models from Gökdeniz Gülmez. I downloaded one or two and tried them on a local ollama setup. It does generate potentially dangerous output. Turning on thinking for the QWEN series shows how it arrives at it's conclusions and it's quite disturbing.
However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.
I didn't use this tool, but I did try out abliterated versions of Gemma and yes, it lost about 100% of it's ability to produce a useful response once I did it
the default heretic with only 100 samples isn't very good, you really need your own, larger dataset to do a proper abliteration. the best abliteration roughly matches a very careful decensor SFT
I guess it's kind of like a lobotomy tool.
I guess it proves you cannot unlobotomize a hole in the head.
Everyone says that abliteration destroys the model. That's the trope phrase everyone who doesn't know anything but wants to participate says. If someone says it to you, ignore them.
This is for local models right? I can't use it on, say my glm-5 subscription connected to opencode?
Correct, local models only.
Already censored for sharing on FB Messenger?
Don't use this 2 days old vibe coded bullshit please.
p-e-w's Heretic (https://news.ycombinator.com/item?id=45945587) is what you're looking for if you're looking for an automatic de-censoring solution.
Didn't make it past the first paragraph of AI slop in the README. Have some respect for your readers and put actual information in it, ideally human generated. At least the first paragraph! Otherwise you may as well name it IGNOREME.
Does anyone offer a live (paid) LLM chatbot / video generation / etc that is completely uncensored? Like not requiring doing any work except just paying for it?
Nous Hermes was built from the ground up to be uncensored. No abliteration required.
Its not a frontier model but it will give you a feel for what its like.
Grok was one of the closest, with expected results: bad PR from the obvious use cases that come with little censorship.
This is another instance of avant-garde "art".
Never stopped to ask if they should...