For the Special Issue on NeSy 2025:
We submit for your consideration an extended version of our paper "Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations," which was accepted at NeSy 2025.
First and foremost, we wish to note the NeSy reviewers' careful reading and critiques of the original conference submission. They provided the impetus to address, during the period between the NeSy 2025 second submission date and the date of the conference in Santa Cruz, what to our mind was the most significant shortcoming of the original submission, i.e., the lack of an implementation of a reasoner based on the theoretical framework, as mentioned in the original paper's Limitations section.
For this extended version, we have added the following.
1. Full reasoner implementation (Section 6): As stated above, we have added a comprehensive proof-of-concept implementation section that was not present in the conference version. This includes a description of the tableau reasoning system for ACrQ with LLM integration, an example demonstrating a command line interface to the tableau reasoning system and a discussion of implementation design features, including the use of the caching mechanism and tree connectivity. In addition, working code has been made available for both the bilateral factuality evaluation capability and the tableau reasoner in GitHub repositories and as PyPI packages.
2. Improved discussion of the verbalization function (Sections 2 and 3): We have significantly expanded our treatment of the verbalization function \delta which is core to our approach. This includes a description of three implementation approaches (direct syntactic, template-based, and LLM-generated), a discussion of the philosophical implications of interpreting logical symbols as natural language terms, and a connection to prior work. While verbalization quality impacts accuracy, it doesn't affect the formal properties - soundness and completeness are preserved regardless, because these depend on logical structure, not content interpretation. Ambiguity from verbalization is handled gracefully through the paraconsistent framework, with inconsistencies becoming gluts and incompleteness becoming gaps.
3. Clear demarcation of logical reasoning versus LLM reasoning (Sections 4 and 6): Thetableau reasoner handles ALL logical reasoning (quantification, conjunction, disjunction), while LLMs evaluate ONLY atomic formulas. This separation is fundamental; LLMs serve as knowledge sources within formal reasoning, not as reasoning engines themselves. The tableau reasoner handles the proof construction, witness generation, and multi-hop inference, based on the axioms provided to the reasoner. We introduce a new Theory Manager user interface, which demonstrates this division clearly, showing how complex statements are processed through tableau expansion with LLM evaluation occurring only at the atomic level.
4. Deeper discussion of the trustworthiness of LLM valuations (Section 3): Our use of LLMs to evaluate atomic formulas draws on the externalist tradition in epistemology and philosophy of language, which locates meaning and truth in community linguistic practices rather than internal mental states. Following semantic externalism and the social epistemology of testimony, we treat LLMs as aggregators of linguistic patterns that encode collective factuality judgments—without requiring them to possess beliefs, understanding, or intentionality. The bilateral evaluation function $\zeta_c$ operationalizes how linguistic communities verify and refute claims, while the caching mechanism ensures consistency within reasoning sessions. This approach provides an explicit operationalization of what has always been implicit in formal systems: interpretation functions require some form of judgment about atomic facts, traditionally provided by humans who construct knowledge bases and interpret the results of inference.
5. A revised treatement of uncertainty management (Sections 3 and 4): The choice of paraconsistent logic (AC) specifically addresses both LLM limitations and the practical reality that knowledge sources often contradict themselves. When LLMs produce contradictory evaluations, the bilateral evaluation captures both verification and refutation separately, distinguishing "I know this is true" from "I cannot prove this false" - mirroring human epistemic distinctions. Inconsistencies are represented as gluts that don't cause a logical explosion. We discuss how the caching mechanism $\zeta_c$ we employ ensures consistency within reasoning sessions while acknowledging that different sessions might yield different valuations.
6. Evidence of practicality and efficiency (Sections 5 and 7): API costs scale with atomic formula evaluation (2k calls for k-sample majority voting), but caching ensures each formula is evaluated exactly once per session. The implementation demonstrates effectiveness through standard tableau optimizations, with cache utility particularly high in description logic reasoning, where atomic formulas are frequently reused across branches. Section 5's evaluation shows the bilateral approach achieves better accuracy than unilateral evaluation at approximately 50% coverage. This reduced coverage is a feature, not a bug - when the LLM cannot definitively verify or refute a claim, the system correctly abstains rather than guessing. This epistemic transparency provided by the generalized truth values yields valuable information for trustworthiness assessment that is lost in classical binary valuations or probabilistic approaches. In other words, a system that explicitly signals "I don't know" or "I have contradictory evidence" enables more informed decision-making than one that forces binary choices or provides confidence scores without distinguishing types of uncertainty.
We believe these extensions provide a signficant new set of results and insights beyond those provided in the conference version.
We look forward to your feedback,
Bradley P. Allen (on behalf of the authors)