Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning

Tracking #: 897-1908

Flag : Review Received

Authors:

YA WANG

Raja Havish Seggoju

Adrian Paschke

Responsible editor:

Guest Editors Trustworthy Regulated

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-897.pdf

Cover Letter:

Dear Editors, We are pleased to submit our manuscript entitled “Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning” for consideration in the Special Issue on Trustworthy Neurosymbolic AI in Regulated Domains: Advances, Challenges, and Applications of the Neurosymbolic AI Journal. This article systematically compares chain-of-thought (CoT) and solver-augmented approaches for deductive reasoning with large language models, evaluating their performance on established benchmarks and controlled synthetic datasets. The work directly aligns with the theme of the special issue, contributing to the trustworthiness, robustness, and verifiability of neurosymbolic AI systems in regulated domains. We confirm that this manuscript is original, has not been published previously, and is not under consideration for publication elsewhere. All authors have approved the submission. We kindly ask you to consider this manuscript for inclusion in the special issue. Thank you very much for your time and consideration. We look forward to hearing from you. Sincerely, Ya Wang Fraunhofer Institute FOKUS & Freie Universität Berlin

Approve Decision:

Approved

Revised Version:

Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 28/Jan/2026

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The following comments can be addressed:
1- The claim of benchmark saturation is not convincingly supported without analysis of possible training data overlap or contamination.
2- The CoT prompting setup is minimal, while the solver approach uses dataset-specific prompt engineering and examples, which weakens the fairness of the comparison.
3- The authors should add a table comparing evaluation protocols (single-pass vs. multi-pass, refinement, self-consistency) to quantify how much the chosen protocol suppresses solver-augmented performance relative to prior work.
4- The paper does not report confidence intervals or statistical significance for benchmark results, despite small performance differences. The authors should provide additional measures.
5- Execution accuracy is presented as an “upper bound,” but the semantic correctness of the generated FOL is not independently validated.
6- The synthetic dataset generation relies on ProverQA filtering, which may bias results toward solver-friendly structures.
7- The analysis of hallucination in CoT outputs is qualitative and lacks systematic measurement or annotation.
8- The conclusion generalises to “safety-critical applications” without providing a proper evaluation and a safety-critical case study.

Review #2 submitted on 30/Jan/2026

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Detailed Comments:

The author present Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning that includes a comparative analysis against a multi-pass or solver-augmented baseline. The paper has some good analytical thoughts; however, need following issue to be addressed.
- Related work does not cover the wider context of the topic. They are from 2023 or before, however, there has been some good advancement in the approach of benchmarking and reasoning. For example, there are works around the agentic system for similar tasks. Need to revise this section to broaden the review and the gaps, and explain why this work remains relevant.
- Why do you need a synthetic dataset? How do you generate and validate it? More details are needed.
- Justification of the controlled factors is required. On what basis is the reasoning depth level categorised? Is it random, or any prior work, or any experimental-based? In Noise 1, based on prior work is mentioned, but no reference is added.
- Reasoning of the selection parameters for evaluation is required to justify why those parameters are used.
- Is there any average length that appears in experimentation for CoT-augmented models on the claim frequently hallucinate in longer reasoning chains: they introduce irrelevant or non-existent premises.
- The result analysis from Table 1 shows that for PrOntoQA, all 100% (no space for improvement), but there is an Avg column (last column). This brings an analytical flaw that PrOntoQA has no contribution to any change, but has been used in the average calculation.
- The issue of overfitting has been used in discussion, but appeared only briefly. More details need to be added to clarify how it has been handled.
- Why a single-pass approach was chosen and how it impacts the generalisability of the reported accuracy needs to be better justified and concluded.

Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning

Tracking #: 897-1908

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 897-1908

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links