Gestalt Vision: A Dataset for Evaluating Gestalt Principles in Visual Perception

Tracking #: 892-1903

Flag : Review Received

Authors:

Jingyuan Sha

Kristian Kersting

Devendra Singh Dhami

Responsible editor:

Guest Editors NeSy 2025

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-892.pdf

Cover Letter:

Dear Editors, We submit the manuscript “Gestalt Vision: A Dataset for Evaluating Gestalt Principles in Visual Perception” for consideration in NAI Journal. This is an extended version of our paper presented at NeSy Conference 2025. The extension includes evaluations with larger multimodal models (InternVL3-78B, GPT-5), more tasks, additional analyses at object- and group-level, and a detailed discussion of complexity effects across Gestalt principles. Sincerely, Jingyuan Sha

Approve Decision:

Approved

Revised Version:

Gestalt Vision: A Dataset for Evaluating Gestalt Principles in Visual Perception

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 24/Oct/2025

By Alessandro Oltramari
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

This manuscript presents an original approach to cognitive reasoning by grounding visual perception research in the principles of Gestalt theory. The authors argue that current machine vision systems, which mainly focus on object-level classification, fail to capture critical low-level perceptual cues such as contour continuity, and grouping phenomena. These aspects—central to classical perceptual psychology and phenomenology—remain insufficiently represented in contemporary computational frameworks. This gap extends to many symbolic reasoning pipelines, where inputs are typically too abstracted from perceptual information.

To address this limitation, the authors propose ELVIS, a synthetic dataset built to isolate and test Gestalt-inspired perceptual mechanisms. In this version of the paper, which extends a conference paper accepted at the NeSy 2025 conference, they significantly strengthen the empirical component by incorporating evaluations with large multimodal models (e.g., InternVL3-78B and GPT-5) and by introducing additional tasks and analyses targeting both object- and group-level perception. These new experiments enrich the scope of the work and further validate the claim that modern vision systems still struggle with core perceptual regularities. The inclusion of a detailed discussion on complexity effects across various Gestalt principles adds conceptual depth and helps connect the dataset’s diagnostic value to broader theoretical insights.

Overall, this paper is clearly written, and well-aligned with the topics of interest of the NeSy journal. The extended results demonstrate that even large-scale multimodal models continue to exhibit weaknesses in perceptual organization, emphasizing the importance of datasets like ELVIS.

However, In my view, this manuscript remains good as a conference contribution, but it has not yet reached the level of depth and analytical completeness expected for a journal publication.

The authors themselves appear to acknowledge this in the “Limitations and Strengths” section, where they note several open questions regarding model behavior and perceptual complexity. While the extended experiments with larger multimodal systems (such as InternVL3-78B and GPT-5) are valuable and strengthen the empirical scope, the analysis still feels primarily descriptive rather than explanatory.

To achieve journal-level rigor, the paper would benefit from additional studies or more granular analyses aimed at uncovering the underlying mechanisms that drive the performance differences observed when testing ELVIS across models and Gestalt principles. Specifically, it would be important to explore why models behave heterogeneously across principles and tasks. For example:

What factors account for inconsistent performance across different Gestalt principles (e.g., proximity, closure, symmetry)? Are certain principles inherently more challenging for current architectures due to their reliance on global vs. local feature integration?

Why does GPT-5 in particular appear to struggle with symmetry or similar tasks that require holistic spatial reasoning, despite its strong multimodal integration capabilities? Could this reflect architectural biases, training data limitations, or a lack of explicit inductive biases toward perceptual grouping?

Are these failures systematic across architectures or primarily model-specific? Comparative analyses—perhaps involving ablation studies or attention-map visualizations—could help reveal whether certain principles are more sensitive to model scale, modality fusion, or prompt formulation.

Finally, a finer-grained error taxonomy (e.g., differentiating between perceptual vs. semantic errors) could help clarify what kind of reasoning these models fail to perform when confronted with Gestalt-driven stimuli.

In short, while ELVIS clearly exposes important perceptual limitations in current vision and multimodal systems, the paper would reach journal-level significance by moving beyond performance reporting to a deeper diagnostic and interpretive analysis of why these limitations occur, and what they reveal about the cognitive architecture of modern AI models.

Review #2 submitted on 21/Sep/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Detailed Comments:

This study proposes the new ELVIS benchmark, which embodies the famous Gestalt principles in order to test/assess the related capacities of computer vision algorithms and machine learning models. The paper is well written and relatively self-contained and free from omissions. However, I've managed to identify a few missing details (see my remarks below).

I like the conclusion the authors arrived at in "Concept Level Analysis" section, that is:
> none of the models systematically leverage concept-specific clues

as this is what I expected from all those architectures. This clearly opens the door for neurosymbolic systems, which should in principle perform better on this benchmark.

All in all, I find this contribution valuable, despite it's limitations listed below, among others because it may help increase the awareness of the limitations of the mainstream DL architectures and LLMs, and foster the interest in neurosymbolic systems.

Detailed remarks
================

There are multiple statements in the paper which may suggest that ELVIS is meant only for neurosymbolic systems, like this one on p. 4:
> The Gestalt Vision Benchmark (ELVIS) evaluates the ability of neuro-symbolic models to detect and reason over grouping-based structures, moving beyond object-level perception toward more holistic and human-aligned reasoning.

or this one on p. 5:
> The benchmark thus provides a challenging yet principled environment for testing neuro-symbolic models, encouraging them to capture the same perceptual strategies that humans naturally use when organizing visual input into meaningful structures.

In my understanding, there are not obstacles for applying any kind of image understanding/scene analysis methods to ELVIS, whether symbolic, or neural, or neurosymbolic. I suggest discarding these statements that suggest excessive narrowing of the perspective.

The way in which the tasks are posed in ELVIS strongly resembles the Bongard problems; I find it essential that the authors cite his works, which are highly relevant here (even if, to my knowledge, Bongard did not explicitly refer to Gestalt principles)
https://en.wikipedia.org/wiki/Bongard_problem

It's a bit disappointing that, with all this rich conceptual apparatus proposed by the authors in order to generate Gestalt-related tasks, the tasks themselves ultimately boil down to simple binary classification. Binary classification (and classification overall) seems to be excessively limiting here. Recall that classification, as a ML task adopts the closed-world perspective: each example faced by the learner is assumed to belong to one and only one decision class, and no 'other' class is assumed to exist (at least in the basic classification setting). Last but not least, models trained on classification tasks tend to be overly specialized, and provide little insights in more general scene interpretation.
I would find it more interesting if the authors considered other ways of posing their tasks -- for instance, completing a missing part of the pattern, like in self-supervised training.

The authors say
> Models were trained and evaluated independently for each task.

Does that mean that the massive LLM-based models were trained too? Or fine-tuned on a given task? That does not sound realistic (even the fine-tuning), given that there are thousands of tasks in authors' proposed suite. My guess is that only the ViT was trained here from scratch (or perhaps fine-tuned), while for the remaining models, the authors have used some form of prompting.

Relatedly to that, I guess that all the remaining models were returning textual answers to the patterns.

The details on how the LLM-based models have been interacted with (what was the prompt, how was model's response interpreted, etc.) should be provided in the paper.

I find the experiment concerning image resolution rather irrelevant -- it's quite obvious that, as long as the models can 'decipher' the shapes and colors of individual objects from the image, they are left with the core of the task. Improving the image resolution cannot help a model to come up with, e.g. the concept of good continuation.

Minor remarks:
================

> Effect of Training Number
This title section sound odd; consider replacing with 'Effect of the size of the training set'

irrelated -> unrelated

Review #3 submitted on 14/Nov/2025

By Lia Morra
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

The manuscript presents a novel benchmark to evaluate the ability of neural models to apply Gestal principles in visual perception. The benchmark comprises multiple tasks – each generated programmatically – where the network must learn to differentiate positive samples that satisfy a given set of logical constraints, informed by a specific Gestalt principle, from negative samples where one or more rules are violated. The manuscript extends a previous publication to the NeSy 2025 Conference.

Strenghts:
• Although synthetic in nature, this is an interesting benchmark for the NeSy and visual reasoning communities.
• Experimental comparisons of advanced large language models (LLMs) and visual language models (VLMs) continue to reveal considerable potential for improvement.

Weaknesses:
• The dataset under consideration is synthetic and two-dimensional, characterized by elementary concepts. It relies primarily on binary classification tasks. It would be of interest to obtain empirical evidence on the applicability of these principles in more realistic scenarios or across diverse tasks, such as completion.

• There are several aspects related to the construction of individual tests and the methodologies employed for performance evaluation that are still unclear, as elaborated in the Remarks section below.

• Unlike the conference submission, the extended version does not evaluate neuro-symbolic AI systems, but focuses more on advanced LLMs.

• The technical contribution, in relation to the conference submission, is in my opinion inadequate, as detailed in the Remarks section below. Moreover, the technical contribution beyond the conference submission should be stated in the introduction.

Detailed remarks

• It would be useful to add a more extensive comparison, perhaps in table format, of the difference between the proposed benchmarks and existing ones.

• There are differences between the current version of the benchmark and the previous conference submission. If I understand correctly, the new version has a larger number of tasks (between 400 and 900 hundred, whereas the conference version had less than 250 tasks per principle), but there are also differences in the categories, object ranges, and so forth. However, the conference paper appeared to have more categories per principle, in some cases, so it appears also that some categories were perhaps shifted?

• The tasks are organized around binary classification tasks, generating positive and negative samples based on underlying logical rules. In my understanding, the positive samples should exhibit the tested Gestalt principle, whereas the negative examples should not – this is well evident for instance in Figure 16. In other examples, it seems to me that it is not sufficient for the model to recognize the Gestalt principle, but it must also recognize the underlying rule, potentially conflating Gestalt principles with other form of spatial reasoning. As an example, in Figure 17, continuity is supposedly tested through intersected splines, which are visible in both positive and negative examples; positive examples include only objects of one shape, whereas negative examples include objects from multiple shapes. However, visually, both positive and negative examples appear as continuous splines: to solve the task the model must recognize whether objects have the same shape or not – the overall arrangement being in this example irrelevant to the task.

• It is not clear how models are tested on this benchmark and how they are “trained” on each task. The most pressing question is how do we ensure that the model is not overfitting to each invidual task, but rather grasps the underlying principle.

• Additional details are also needed on the number of examples per train/test, the balance of positive and negative examples (for some models, accuracy is 50% which is not moderate accuracy as stated in the paper, but chance level if the sample is balanced), and the relationship between training and testing for each task.

• Given the large number of tasks and the size of the models tested, some details on the computational requirements of running such a benchmark would be an interesting addition

• It is unclear why the performance degrades – if it degrades, as accuracies are always very close to chance level – with increasing number of training samples. The claim that ViT is overfitting to spurious correlations does not stand unless the test set contains the same spurious correlations, which would hinder the validity of the benchmark. I wonder if this behaviour persists for the other models, and why results in the original conference submission pointed to different results, with ViT/100 showing signs of learning at least similarity and continuity, if not symmetry or proximity.

Gestalt Vision: A Dataset for Evaluating Gestalt Principles in Visual Perception

Tracking #: 892-1903

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 892-1903

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links