By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Average
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Average
Detailed Comments:
The author present Empirical Analysis of Chain-of-Thought and Solver-Augmented Large Language Models for Deductive Reasoning that includes a comparative analysis against a multi-pass or solver-augmented baseline. The paper has some good analytical thoughts; however, need following issue to be addressed.
- Related work does not cover the wider context of the topic. They are from 2023 or before, however, there has been some good advancement in the approach of benchmarking and reasoning. For example, there are works around the agentic system for similar tasks. Need to revise this section to broaden the review and the gaps, and explain why this work remains relevant.
- Why do you need a synthetic dataset? How do you generate and validate it? More details are needed.
- Justification of the controlled factors is required. On what basis is the reasoning depth level categorised? Is it random, or any prior work, or any experimental-based? In Noise 1, based on prior work is mentioned, but no reference is added.
- Reasoning of the selection parameters for evaluation is required to justify why those parameters are used.
- Is there any average length that appears in experimentation for CoT-augmented models on the claim frequently hallucinate in longer reasoning chains: they introduce irrelevant or non-existent premises.
- The result analysis from Table 1 shows that for PrOntoQA, all 100% (no space for improvement), but there is an Avg column (last column). This brings an analytical flaw that PrOntoQA has no contribution to any change, but has been used in the average calculation.
- The issue of overfitting has been used in discussion, but appeared only briefly. More details need to be added to clarify how it has been handled.
- Why a single-pass approach was chosen and how it impacts the generalisability of the reported accuracy needs to be better justified and concluded.