ProofOfThought: LLM-based reasoning using Z3 theorem proving

(github.com)

73 points | by barthelomew 2 hours ago ago

34 comments

nextos an hour ago ago
This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
[-]
- barthelomew an hour ago ago
  Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.
  Although that work is not public, you can play with the generally available product here!
  [1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
- CuriouslyC 44 minutes ago ago
  Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.
  [-]
  - jebarker 7 minutes ago ago
    Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?
LASR an hour ago ago
This is an interesting approach.
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
[-]
- barthelomew an hour ago ago
  The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .
  If you ever feel like chatting and discussing more details, happy to chat!
- viraptor an hour ago ago
  Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.
ivanbakel an hour ago ago
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
[-]
- barthelomew 25 minutes ago ago
  Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
  E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
  Illustration of this can be seen in Fig 14, 15 on Page 29.
  In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
  This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
  https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
dehsge 20 minutes ago ago
LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.
sigmoid10 an hour ago ago
I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.
[-]
- barthelomew 41 minutes ago ago
  Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
  When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
  [-]
  - sigmoid10 31 minutes ago ago
    [flagged]
    [-]
    - dang 25 minutes ago ago
      > What are you talking about?
      Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
      Your comment would be fine without that bit.
      [-]
      - sigmoid10 22 minutes ago ago
        This is not meant as snide, I'm literally confused if I might have misunderstood the problem here. Because the solution would be so obvious.
- atrus an hour ago ago
  I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.
  [-]
  - sigmoid10 an hour ago ago
    I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.
- jssmith an hour ago ago
  I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
  [-]
  - barthelomew 39 minutes ago ago
    Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!
  - sigmoid10 37 minutes ago ago
    Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.
    [-]
    - striking 32 minutes ago ago
      Not sure if I've been using it wrong but I've tried using the Zod-to-structured-output helper with GPT-5 and often gotten weird stuff like trailing commas that break a parse or seeing multiple JSON responses in the same response.
      Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.
      [-]
      - sigmoid10 26 minutes ago ago
        Yeah that sounds 100% like a user or middleware issue. Don't bother with these wrappers, they are always outdated anyways. Learn how to use the API directly, it will save you a ton of headaches. And it's really not that hard.
        [-]
        striking 16 minutes ago ago
        No, we're using the OpenAI vendored version of zod-to-json-schema via https://github.com/transitive-bullshit/openai-zod-to-json-sc..., and applying it directly to the `json_schema` field of the OpenAI API. Maybe we have a subtle bug somewhere but I'd expect a 400 response if we were genuinely sending a malformed request.
- IanCal 38 minutes ago ago
  I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.
- retinaros an hour ago ago
  yes this can also improve the said reasoning.
  [-]
  - sigmoid10 an hour ago ago
    The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.
nakamoto_damacy 41 minutes ago ago
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
[-]
- barthelomew 19 minutes ago ago
  Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
  Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
  You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
measurablefunc an hour ago ago
This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.
[-]
- aSanchezStern 44 minutes ago ago
  I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.
  [-]
  - measurablefunc 26 minutes ago ago
    If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.
    [-]
    - bobxmax 5 minutes ago ago
      And your definition of thinking is?
zwnow 40 minutes ago ago
Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.
[-]
- elcomet 36 minutes ago ago
  How can you know?