By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Average
Detailed Comments:
The paper presents an extensive survey on the works regarding visual reasoning with scene graphs and common-sense knowledge, classify them w.r.t architecture used, tasks, knowledge graphs and the loose and tight coupling, evaluation metrics . The paper is relevant since this is certainly an important gap to fill in the literature when it comes to make a survey.
Moreover, I find the tight-coupling and loose-coupling classification useful.
- Language: Manuscript language is certainly satisfactory (no narrative mistakes or typos, in general), but in general style-wise quite dry. In many sections, it mentions a single sentence per citation about what it does, and moves to the next, with the monotone same structure e.g., 3.1. This could be easily fixed.
- Technical style: I believe that the paper's style could be improved if it did introduce the problem technically or half-formally under definitions or boxes. I think this is a must as it would provide substance: What is the task, "Input" , "Output". Moreover, major network architectures like RNN, GNN etc all lack the citation of the paper that introduces it. (Ideally, I would also suggest to put a figure, or input - output schema for each of them. But this one is surely optional.) Again, try to define knowledge graph instead of giving just a verbal example. When you say "For instance, a KG can provide information that "a bird is likely to be found in a tree"", the word "likely" is not natural for a knowledge graph, triggers a statistical prior instead where you have a section for. (the word "in general" would serve better.) Also start the text with an example in the introduction if possible. It would really help and keep the reader.
- Missing major Literature: The survey disregards two important directions of literature completely :
1) Causality-based approaches. I think causality needs its own part under section 2.2. or next to statistical priors as a tool for common-sense reasoning or knowledge ( e.g., Liu et al 2022, Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering, or . Liu et al. Show, Deconfound and Tell: Image Captioning with Causal Inference. Zhou and Yang 2021, Relation Network and Causal Reasoning for Image Captioning . There are others for other tasks (I don't expect the survey to be fully exhaustive, of course). These are also inherently NeSy. It could in addition make challenges section interesting.
2) Hyperbolic embedding approaches (which takes either KG or taxonomy into account) relevant to common-sense reasoning e.g., Xiong et al. 2022, Hyperbolic Embedding Inference for Structured Multi-Label Prediction. Relevant to their section 2.3.2. the hierarchical semantic segmentation. Hyperbolic Image Segmentation Ghadimi et al. 2022. There is actually a great survey: by Mettes et al. 2022 "Hyperbolic Deep Learning in Computer Vision: A Survey".
- ML architecture classification reads exhaustive: I am not sure I would put the deep learning architecture as a exhaustive (also in figures) because there can be new architectures to be used, and this would make your survey more obsolete than it should be when time passes. There could be a subsection "Other" which could explain this fact. I leave it to authors' judgements.
Minor issues:
-It is not clear whether the performance evaluation for instance Table 4, has the authors themselves did or transferred from the papers. Should be clarified.
-lots of top K -> top-$K$
- MNM -> MMN
- Section 1.2 the lines 46 to 51 reads redundant: "deep learning, common sense knowledge and NeSy integration for scene representation and visual reasoning." twice.
- I would expect to see Figure 2, from left to right. (But I guess, authors want us to compare the left bottom to the right bottom.) Still something to reconsider. (optional).