By Lia Morra
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Average
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
As stated also by other reviewers, despite not presenting new Nesy techniques or evaluation of existing ones, the paper has merit in providing a benchmark on which LLMs and VLMs struggle, as well as showing which principles poses the most challenges . The benchmark is publicly available and synthetic and thus easy to use and extend, and investigates principles complementary to existing ones – as shown in the revised section “Comparison with Existing Datasets”.
I appreciated, in the revised version, additional clarifications on task composition, on the training/validation split, on the training of transformer-based models, and on the computational requirements.
Differences between the extended version and the conference version are also clear in the rebuttal. I still think that the technical contribution beyond the conference version is somewhat limited to new experiments. I suggest in any case to explicitly state the differences in the introduction, especially since the reported results differ between the conference paper and the extended version: the differences are due to evolutions in the benchmark, but could be perceived as inconsistencies by the readers.
However, there are still a few issues in the revised submission.
It seems to me that it is not sufficient for the model to recognize the Gestalt principle, but it must also recognize the underlying rule, potentially conflating Gestalt principles with other forms of spatial reasoning. As an example, in Figure 17, continuity is supposedly tested through intersected splines, which are visible in both positive and negative examples; positive examples include only objects of one shape, whereas negative examples include objects from multiple shapes. However, visually, both positive and negative examples appear as continuous splines: to solve the task the model must recognize whether objects have the same shape or not – the overall arrangement being in this example irrelevant to the task.
The fact that the VLMs are given the Gestalt principles as part of the prompt needs to be further highlighted as it is potentially favoring the VLM over the ViT, which is trained without prior knowledge. It is also interesting to note that providing a verbal description of the Gestalt principle is insufficient to guide the VLM towards the correct solution. I wonder what would happen if the prompt simply included positive and negative examples.