By Ivan Donadello
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Weak
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Weak
Detailed Comments:
The paper presents a NeSy method for the problem of slice discovery (SD) in Computer Vision (CV). SD aims at mining the data input (images in CV) for semantically meaningful group of data on which a CV predictive model performs poorly. The proposed solution is composed of several modules involving, image generation, model classification, scene graph generation, labelling of positive and negative scene graph examples, mining of rules that describe the positive and negative scene graph examples with Inductive Logic Programming (ILP) methods, image generation with these rules for data augmentation and model mending thus creating a sort of loop.
I find the problem very interesting, and I really like the technique of mining some knowledge (with ILP systems that are robust to noise) from pos/neg example to perform data augmentation. I also appreciate the use of a taxonomy for rare slice generation for a better control of visually similar objects that, however, belong to different subclasses, as stated at the beginning of Section 5.
Despite the potentialities of the work, some important concerns make me vote for a major revision. Here below my concerns.
-------------------------------------------
Problem Definition
-------------------------------------------
- I imagine that in [16] the definition of a slice contains the sentence “the model performs poorly” but this is ambiguous, and a criterion should be defined/adopted. This can be seen in Fig. 9 and Section 6.2.1, where no reason is given why those particular classes of objects are rare slices. Which is the chosen criterion? Last 5 performant classes? Classes with a recall lower than a threshold? I think that this part should be clearer with a better definition/criterion for rare slices.
- Regarding Fig. 9 again, I would have expected lower performance for the rare slices. I would not call a class with 80% of recall as rare slice.
- In hierarchies 1 and 2 of the VT taxonomy there are no rare slices “as expected”. But why did you expect this?
-------------------------------------------
Paper Contextualization
-------------------------------------------
- The related work focuses on slice discovery methods and ILP. Where the former is the focus of the paper, the latter is just a background that (in my opinion) should be reduced and moved in the background section.
- In the related work about slice discovery methods, a comparison with other works showing how the proposal addresses open problems not addressed in precedence would help the reader in a better contextualization of the paper.
- The paper would be improved if contextualized with respect to the kind of approaches used, that is Neurosymbolic AI and mining of discriminative knowledge from pos/neg example. Regarding Neurosymbolic AI, this approach is quite different from approaches that embed the logic in neural networks or embed some differentiable function in logic systems. It would be interesting a discussion on what kind of NeSy integration this paper proposes. To this extent, the Kautz’s taxonomy could be helpful, see Section 2 of the paper at https://arxiv.org/pdf/2105.05330. Regarding mining of discriminative knowledge from pos/neg examples, a recent paper does something similar for characterizing pos/neg examples of temporal traces, see the paper titled “Making Sense of Temporal Event Data: A Framework for Comparing Techniques for the Discovery of Discriminative Temporal Patterns” (Di Francescomarino et al., CAiSE 2024). It would be interesting to contextualize the present paper with respect to this trend of research.
-------------------------------------------
Results
-------------------------------------------
- The results are measured only according to the recall (if I understand correctly as it is not specified in the confusion matrixes) but there is no argumentation why only recall has been chosen. I would appreciate to see also the precision and F1 results as, after model mending, to a higher recall could correspond a lower precision.
- I would discuss more the impact of the rule head penalty as, for some classes, certain values bring to wrong or no rules. The exception ration, instead, does not seem to impact the discovery. Please specify this.
- The paper shows the confusion matrix only for the recall of the VT hierarchy 4. Other results (precision, recall, F1 for all the hierarchies VT 3, VT 4 and PP 1 before and after model mending) would make the paper more self-contained if included as appendixes.
-------------------------------------------
Limitations
-------------------------------------------
- I find the method highly tailored to the Super-Clever image generator and to the scene graph generation tasks. There is no discussion how the extracted rules can be used for other methods of synthetic-image generation for generating images of a different domain. In addition, it seems to me that the method is applicable to only images where a scene graph can be extracted from. Therefore, it can be hard to extend the method to, for example, medical images coming from radiographies. This kind of images cannot always be traduced in a scene graph. If this is the case, the example of chest X-rays in the introduction can be misleading and should be changed.
- In addition, a section showing the limitations of the approach can help the reader to understand what the method does not address (or it does with difficulties) and potentially can foster further research.
-------------------------------------------
Presentation
-------------------------------------------
The structure of the paper is good but sometimes I feel lost without a proper running example. There are some examples but not always connected among them. I strongly suggest using a running example.
Other concerns about the presentation:
- Section 3.2: What is the expressivity of the language of B, h and E? Fully propositional, First-Order? Please specify it for all the three methods.
- Page 6: two different symbols are used for bounding boxes: b and \mathcal{B}, please adjust.
- Section 4.1 the class dirtbike can have many “root classes” according to Fig. 6. Why has the class “motorcycle” been chosen? I am afraid I missed something.
- Page 7: How did you select positive and negative images? I guess there is a ground truth label for the whole image, but I cannot find its description. However, at the beginning of Section 4.2, an object detection problem is described, therefore I do not understand what a (un)correctly classified image is.
- Section 4.4: not clear to me the difference between GE+ and E+ILP. Why are they assembled?
- In Section 4.4 I get lost when Figure 3 is described. Here I feel the need of a running example for a better understanding of the pos/neg examples. In general, the background knowledge (BK) should state general common sense information, such as, if A is next to B then B is next to A, but here there is a not defined “contains(19, 0)”. The mode bias seems a more suitable candidate for BK but, unfortunately, there is no definition of what a mode bias for non-experts in ILP is.
- Page 9: What do you mean with “… applying it, with appropriate adjustments, in similar applications settings is suggestive”? Please use a more precise and formal wording as required in a scientific paper.
- Page 9, bullet 4: How are other attributes (e.g., material, shape, color) chosen? Randomly?
- Caption of Table 1: why did you test only on the models trained for 160 epochs and not on the models trained for 80 and 320 epochs? It seems that these models are never used. Why did you use the plural in neural network models? I thought you trained only one Yolov5 model for all the classes in the hierarchies. If this is not the case, please explicit state it. What is the criterion for wrong rules that you used for the X symbol?
- Page 14, second to last line: Popper fails with offroad car, offroad vehicles and specialized vehicles (Table 1) and not with pickup truck and articulated bus as in the sentence.
- What is a “native scene graph” mentioned at page 16? Please clarify this.
-------------------------------------------
Other (not minor) technicalities
-------------------------------------------
- Two concerns regard the model mending. In section 6.1, in the model mending paragraph only 12 new images are generated. For me 12 new images for slice are totally irrelevant in a training set of 10K images. If 12 images are sufficient, this is a result that deserves more discussion. The second concern regards the extracted rules. My understanding of Fig. 5 is that hard(V0) is described by the first rule OR the second OR the third. Is this the case? If so, how did you encode this OR in the image generator? In general, there is no detailed description about how the mined rules are translated into specifications for the image generator.
- Fig. 2 shows the system architecture that is interesting as it is a closed loop. Therefore, it would be interesting having experiments with more cycles of this loop (at least 2) and see whether performance increase.