By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Average
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
This submission is an experimentally extended (more mature) version of the authors’ NeSy 2025 conference paper. It retains the core idea of “metatuning” - an iterative prompt-refinement process guided by “symbolic feedback” - but now supplements it with new experimental sections on Chain-of-Thought (COT) and self-reflection prompting (Section 5.2) and an evaluation in a new, multimodal reasoning domain (CLEVRER, Section 5.3). The expanded literature review and clearer structure are welcome improvements.
The new experiments represent a commendable effort and scientific honesty, and the results are transparently presented, including several negative findings. These additions meaningfully expand the scope and rigor of the paper. However, they also reveal important limitations of the proposed method. When combined with stronger prompting strategies (COT, reflection), metatuning often yields negligible or even adverse effects (Tables 3 & 4). The CLEVRER evaluation similarly shows no measurable benefit (Table 5), indicating that the technique does not generalize beyond text-based tasks. I believe that these findings are valuable and should be posed as such - i.e., demonstrating the boundary conditions and practical limits of metatuning - rather than as evidence of a “new neurosymbolic paradigm” (the original framing spanning throughout the introduction).
Also, the main conceptual issues from the conference version remain. The introductory “model-grounded symbolic” framing is still largely the same and remains just philosophical - no analysis of internal representations or “grounding” is performed to support the claim of symbolic-vector alignment (e.g., some representation probing from mech. interp.). It remains a nice narrative rather than an empirically supported finding. The method itself remains heuristic, with no formal convergence analysis or cost-effectiveness evaluation (contrary to prev. claims). The paper would also benefit from stronger situating within related work on in-context learning, meta-prompt optimization, and/or automated prompt tuning, which now form a rapidly developing comparative landscape.
Despite these weaknesses, the paper is clearly written, substantially extended, and thematically suited to NeSy AI. It provides a coherent exploration of how symbolic reasoning concepts might be reframed in LLM-based systems, and its (commendable) inclusion of negative results adds credibility and potential value for other researchers testing such ideas within the community.
---Overall---
The updated paper is now in an interesting state: it has a set of rigorous empirical findings (Sec 5) that, however, feel somewhat disconnected from its strong, speculative conceptual claims (Sec 1-2). Specifically, the core method ("metatuning") has now been shown by the authors' own experiments to be a somewhat weak technique with limited applicability, with benefits largely erased by standard COT prompting (and it does not generalize too well). This is a valid scientific finding, but the paper's "neurosymbolic" narrative in the introduction does not really prepare the reader too well for this conclusion.
I believe this is fixable, though, if the paper aligns its claims better with its evidence...
Particularly, I’d suggest:
- Reframe the conceptual claims - significantly tone down the "neurosymbolic" and "model-grounding" narrative in Sections 1 and 2. This framing is unsupported by your evidence and now seems overextended, given the method's limited empirical success. Reframe the paper as an empirical study of a specific prompt-refinement technique ("metatuning") and its limitations.
- Address ungrounded technical claims - soften or remove unproven claims pointed out in previous review(s):
- The claim that the "cycle ensures that the model systematically reduces reasoning errors" (Page 7) is contradicted by your admission of "no convergence guarantees" (Page 8) and the negative results in Table 4. This should be rephrased, I believe.
- The "more (data-)efficient alternative" claims remain mostly unsubstantiated and should be removed or proven, I think.
- Strengthen Literature Comparison: The addition of COT is good, but briefly situating "metatuning" within the broader literature on prompt optimization would strengthen the paper further.
Thus, the paper's main weakness is no longer a lack of experiments, but a narrative that has not been fully updated to match the new (commendably) honest results. If the authors can better re-frame the introduction more modestly to align with the empirical results, I believe the paper is ready for acceptance in NAI.