By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
This study proposes the new ELVIS benchmark, which embodies the famous Gestalt principles in order to test/assess the related capacities of computer vision algorithms and machine learning models. The paper is well written and relatively self-contained and free from omissions. However, I've managed to identify a few missing details (see my remarks below).
I like the conclusion the authors arrived at in "Concept Level Analysis" section, that is:
> none of the models systematically leverage concept-specific clues
as this is what I expected from all those architectures. This clearly opens the door for neurosymbolic systems, which should in principle perform better on this benchmark.
All in all, I find this contribution valuable, despite it's limitations listed below, among others because it may help increase the awareness of the limitations of the mainstream DL architectures and LLMs, and foster the interest in neurosymbolic systems.
Detailed remarks
================
There are multiple statements in the paper which may suggest that ELVIS is meant only for neurosymbolic systems, like this one on p. 4:
> The Gestalt Vision Benchmark (ELVIS) evaluates the ability of neuro-symbolic models to detect and reason over grouping-based structures, moving beyond object-level perception toward more holistic and human-aligned reasoning.
or this one on p. 5:
> The benchmark thus provides a challenging yet principled environment for testing neuro-symbolic models, encouraging them to capture the same perceptual strategies that humans naturally use when organizing visual input into meaningful structures.
In my understanding, there are not obstacles for applying any kind of image understanding/scene analysis methods to ELVIS, whether symbolic, or neural, or neurosymbolic. I suggest discarding these statements that suggest excessive narrowing of the perspective.
The way in which the tasks are posed in ELVIS strongly resembles the Bongard problems; I find it essential that the authors cite his works, which are highly relevant here (even if, to my knowledge, Bongard did not explicitly refer to Gestalt principles)
https://en.wikipedia.org/wiki/Bongard_problem
It's a bit disappointing that, with all this rich conceptual apparatus proposed by the authors in order to generate Gestalt-related tasks, the tasks themselves ultimately boil down to simple binary classification. Binary classification (and classification overall) seems to be excessively limiting here. Recall that classification, as a ML task adopts the closed-world perspective: each example faced by the learner is assumed to belong to one and only one decision class, and no 'other' class is assumed to exist (at least in the basic classification setting). Last but not least, models trained on classification tasks tend to be overly specialized, and provide little insights in more general scene interpretation.
I would find it more interesting if the authors considered other ways of posing their tasks -- for instance, completing a missing part of the pattern, like in self-supervised training.
The authors say
> Models were trained and evaluated independently for each task.
Does that mean that the massive LLM-based models were trained too? Or fine-tuned on a given task? That does not sound realistic (even the fine-tuning), given that there are thousands of tasks in authors' proposed suite. My guess is that only the ViT was trained here from scratch (or perhaps fine-tuned), while for the remaining models, the authors have used some form of prompting.
Relatedly to that, I guess that all the remaining models were returning textual answers to the patterns.
The details on how the LLM-based models have been interacted with (what was the prompt, how was model's response interpreted, etc.) should be provided in the paper.
I find the experiment concerning image resolution rather irrelevant -- it's quite obvious that, as long as the models can 'decipher' the shapes and colors of individual objects from the image, they are left with the core of the task. Improving the image resolution cannot help a model to come up with, e.g. the concept of good continuation.
Minor remarks:
================
> Effect of Training Number
This title section sound odd; consider replacing with 'Effect of the size of the training set'
irrelated -> unrelated