Leveraging Neurosymbolic AI for Slice Discovery

Tracking #: 804-1795

Flag : Out For Review

Authors:

Michele Collevati

Thomas Eiter

Responsible editor:

Guest Editors NeSy 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-804.pdf

Cover Letter:

Dear Editor, please find enclosed the manuscript, “Leveraging Neurosymbolic AI for Slice Discovery”, which we would like to submit for publication in the NeSy 2024 special issue of the Neurosymbolic Artificial Intelligence journal. In this paper, we propose a modular neurosymbolic AI approach for the slice discovery problem in Computer Vision (CV) models. Its distinctive advantage is the extraction via inductive logic programming of human-readable logical rules describing rare slices, and thus enhancing the explainability of CV models. To this end, we propose a methodology for inducing the occurrence of rare slices in a model. We present experiments conducted on datasets produced by our modified version of the Super-CLEVR data generator. The results show that our approach can correctly identify rare slices and produce logical rules describing them. The rules can be fruitfully used to generate new training data to mend model behaviour and thus enhance its inference capabilities. For these reasons, we think the paper could be of particular interest to the readers of the Neurosymbolic Artificial Intelligence journal. Looking forward to hearing from you, Michele Collevati, Thomas Eiter, Nelson Higuera Corresponding author: Michele Collevati Institute of Logic and Computation Technische Universität Wien Vienna, Austria Email: michele.collevati@tuwien.ac.at

Approve Decision:

Approved

Revised Version:

Leveraging Neurosymbolic AI for Slice Discovery

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 17/Dec/2024

By Lia Morra
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

Paper summary:

The paper introduces an inductive logic programming (ILP) approach to extract human-interpretable logical rules aimed at describing rare slices, with the goal of enhancing the robustness of computer vision tasks. The basic principle is to analyze the performance of a classifier on E+ and E-, where E+ are the instances on which the classifier performs correctly, and E- are the ones on which the classifier fails. ILP, where E+ and E- are used as positive and negative examples, is used to generate a set of rules that symbolically describe the characteristics of data on which the classifier does not work well. In turn, these rules can be used to generate new training data (if one has a data generator) to try to fine-tune the classifier with new examples.

This work extends a prior conference contribution by comparing multiple ILP techniques and a more rigorous experimental validation. The methods are validated in terms of number of rules extracted, training time and improved performance on the downstream task (object detection). Overall, the extended version further clarifies the methodology, but the technical contribution is limited. Further experiments would be needed to support the general applicability of the proposed task and to compare it against the state of the art.

Strengths:

- To the best of my knowledge, this is an interesting and original application integrating learning and reasoning. It has the potential to improve machine learning models not only in terms of in-distribution performance, but also in terms of model debiasing, robustness to dataset shift, e.g.

- The proposed methodology is effective and allows to identify rare slices in most cases on a relatively small sample (25% of the validation dataset)

- If coupled with a suitable generator model, it allows to improve the performance of the downstream model (object detector, in this case) by generating training data that specifically target the model’s deficiencies

- The paper is well written and clear with appropriate figures and table

Weaknesses:

- The manuscript only considers one dataset (Super-CLEVR) and one task (object detection), which is relatively simple and does not include object relations as well as properties. More tasks should have been included in the extension to demonstrate how the proposed methodology could be applied.

- The proposed methodology requires complete scene graph descriptions, even though the task is considerably simpler (object detection). In its current form, it assumes that additional labels, beyond those needed for the task at hand, are available.

- The resulting slices are dependent on the chosen ontology. Further details would be beneficial for the reader to understand how the ontologies were constructed and why the original Super-CLEVR ontology was not used. Given that the selected ontology constrains the rare slices that can be discovered and generated, I believe this aspect is important.

- The proposed methodology was not compared against any baseline apart from the original model prior to mending. However, multiple techniques have been proposed in literature to discover and tackle rare issues, both in the field of SDM and in the more general field of active learning and long-tail learning. It would be appropriate to compare the proposed methodology against other SDM methods, at least in terms of technical characteristics, if not in terms of performance. Alternatively, the effectiveness of the proposed methodology in mending the model could be compared against simple baseline, e.g., taken from the active learning literature (as done in Jiang, Chiyu Max, et al. "Improving the intra-class long-tail in 3d detection via rare example mining." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022)

Further remarks

- Table1, 2 and 3 offer a detailed comparison of the three ILP systems used, but it is not so straightforward to compare. A few summary measurements could be useful. Also, the number and quality of the generated rules should be compared

- It is not clear which ILP configuration was used to generate the rules for model mending, and how sensitive the mending process is to the specific choice of rules.

- Experimental settings for training the YOLOv5 model should be given for greater reproducibility.

- To evaluate model mending, I would suggest comparing not only confusion matrices, but also standard metrics for object detection (such as mAP) which take into account also correct localization and false positives.

Review #2 submitted on 30/Dec/2024

By Elvira Amador Dominguez
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Detailed Comments:

This paper presents an extension of the previous homonymus paper published in NeSy'24. In this paper, the authors propose a neurosymbolic architecture for rare slice discovery, which is then tested on the Super-CLEVR dataset. The idea is interesting and the results look promising. However, I have a couple of questions regarding the paper:

- It is mentioned that FastLAS allows for a penalty setting, but how is this penalty set? Is it done manually?
- Three ILP systems were used for rule extraction in the experimentation phase, each with their parameters, but their parameter selection is not justified.
- The model mending part I believe is one of the most valuable contributions of the paper, but there is not a lot of insight on this part. How is the mending done? Is it iteratively? What is the final impact?
- Finally, it would have been interesting to see a comparison of the performance of the architecture across different datasets, since only one is considered and it can give a biased perception on the behavior of the model.

Review #3 submitted on 13/Feb/2025

By Ivan Donadello
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Weak

Detailed Comments:

The paper presents a NeSy method for the problem of slice discovery (SD) in Computer Vision (CV). SD aims at mining the data input (images in CV) for semantically meaningful group of data on which a CV predictive model performs poorly. The proposed solution is composed of several modules involving, image generation, model classification, scene graph generation, labelling of positive and negative scene graph examples, mining of rules that describe the positive and negative scene graph examples with Inductive Logic Programming (ILP) methods, image generation with these rules for data augmentation and model mending thus creating a sort of loop.

I find the problem very interesting, and I really like the technique of mining some knowledge (with ILP systems that are robust to noise) from pos/neg example to perform data augmentation. I also appreciate the use of a taxonomy for rare slice generation for a better control of visually similar objects that, however, belong to different subclasses, as stated at the beginning of Section 5.

Despite the potentialities of the work, some important concerns make me vote for a major revision. Here below my concerns.

-------------------------------------------
Problem Definition
-------------------------------------------
- I imagine that in [16] the definition of a slice contains the sentence “the model performs poorly” but this is ambiguous, and a criterion should be defined/adopted. This can be seen in Fig. 9 and Section 6.2.1, where no reason is given why those particular classes of objects are rare slices. Which is the chosen criterion? Last 5 performant classes? Classes with a recall lower than a threshold? I think that this part should be clearer with a better definition/criterion for rare slices.
- Regarding Fig. 9 again, I would have expected lower performance for the rare slices. I would not call a class with 80% of recall as rare slice.
- In hierarchies 1 and 2 of the VT taxonomy there are no rare slices “as expected”. But why did you expect this?

-------------------------------------------
Paper Contextualization
-------------------------------------------
- The related work focuses on slice discovery methods and ILP. Where the former is the focus of the paper, the latter is just a background that (in my opinion) should be reduced and moved in the background section.
- In the related work about slice discovery methods, a comparison with other works showing how the proposal addresses open problems not addressed in precedence would help the reader in a better contextualization of the paper.
- The paper would be improved if contextualized with respect to the kind of approaches used, that is Neurosymbolic AI and mining of discriminative knowledge from pos/neg example. Regarding Neurosymbolic AI, this approach is quite different from approaches that embed the logic in neural networks or embed some differentiable function in logic systems. It would be interesting a discussion on what kind of NeSy integration this paper proposes. To this extent, the Kautz’s taxonomy could be helpful, see Section 2 of the paper at https://arxiv.org/pdf/2105.05330. Regarding mining of discriminative knowledge from pos/neg examples, a recent paper does something similar for characterizing pos/neg examples of temporal traces, see the paper titled “Making Sense of Temporal Event Data: A Framework for Comparing Techniques for the Discovery of Discriminative Temporal Patterns” (Di Francescomarino et al., CAiSE 2024). It would be interesting to contextualize the present paper with respect to this trend of research.

-------------------------------------------
Results
-------------------------------------------
- The results are measured only according to the recall (if I understand correctly as it is not specified in the confusion matrixes) but there is no argumentation why only recall has been chosen. I would appreciate to see also the precision and F1 results as, after model mending, to a higher recall could correspond a lower precision.
- I would discuss more the impact of the rule head penalty as, for some classes, certain values bring to wrong or no rules. The exception ration, instead, does not seem to impact the discovery. Please specify this.
- The paper shows the confusion matrix only for the recall of the VT hierarchy 4. Other results (precision, recall, F1 for all the hierarchies VT 3, VT 4 and PP 1 before and after model mending) would make the paper more self-contained if included as appendixes.

-------------------------------------------
Limitations
-------------------------------------------
- I find the method highly tailored to the Super-Clever image generator and to the scene graph generation tasks. There is no discussion how the extracted rules can be used for other methods of synthetic-image generation for generating images of a different domain. In addition, it seems to me that the method is applicable to only images where a scene graph can be extracted from. Therefore, it can be hard to extend the method to, for example, medical images coming from radiographies. This kind of images cannot always be traduced in a scene graph. If this is the case, the example of chest X-rays in the introduction can be misleading and should be changed.
- In addition, a section showing the limitations of the approach can help the reader to understand what the method does not address (or it does with difficulties) and potentially can foster further research.

-------------------------------------------
Presentation
-------------------------------------------
The structure of the paper is good but sometimes I feel lost without a proper running example. There are some examples but not always connected among them. I strongly suggest using a running example.

Other concerns about the presentation:
- Section 3.2: What is the expressivity of the language of B, h and E? Fully propositional, First-Order? Please specify it for all the three methods.
- Page 6: two different symbols are used for bounding boxes: b and \mathcal{B}, please adjust.
- Section 4.1 the class dirtbike can have many “root classes” according to Fig. 6. Why has the class “motorcycle” been chosen? I am afraid I missed something.
- Page 7: How did you select positive and negative images? I guess there is a ground truth label for the whole image, but I cannot find its description. However, at the beginning of Section 4.2, an object detection problem is described, therefore I do not understand what a (un)correctly classified image is.
- Section 4.4: not clear to me the difference between GE+ and E+ILP. Why are they assembled?
- In Section 4.4 I get lost when Figure 3 is described. Here I feel the need of a running example for a better understanding of the pos/neg examples. In general, the background knowledge (BK) should state general common sense information, such as, if A is next to B then B is next to A, but here there is a not defined “contains(19, 0)”. The mode bias seems a more suitable candidate for BK but, unfortunately, there is no definition of what a mode bias for non-experts in ILP is.
- Page 9: What do you mean with “… applying it, with appropriate adjustments, in similar applications settings is suggestive”? Please use a more precise and formal wording as required in a scientific paper.
- Page 9, bullet 4: How are other attributes (e.g., material, shape, color) chosen? Randomly?
- Caption of Table 1: why did you test only on the models trained for 160 epochs and not on the models trained for 80 and 320 epochs? It seems that these models are never used. Why did you use the plural in neural network models? I thought you trained only one Yolov5 model for all the classes in the hierarchies. If this is not the case, please explicit state it. What is the criterion for wrong rules that you used for the X symbol?
- Page 14, second to last line: Popper fails with offroad car, offroad vehicles and specialized vehicles (Table 1) and not with pickup truck and articulated bus as in the sentence.
- What is a “native scene graph” mentioned at page 16? Please clarify this.

-------------------------------------------
Other (not minor) technicalities
-------------------------------------------
- Two concerns regard the model mending. In section 6.1, in the model mending paragraph only 12 new images are generated. For me 12 new images for slice are totally irrelevant in a training set of 10K images. If 12 images are sufficient, this is a result that deserves more discussion. The second concern regards the extracted rules. My understanding of Fig. 5 is that hard(V0) is described by the first rule OR the second OR the third. Is this the case? If so, how did you encode this OR in the image generator? In general, there is no detailed description about how the mined rules are translated into specifications for the image generator.
- Fig. 2 shows the system architecture that is interesting as it is a closed loop. Therefore, it would be interesting having experiments with more cycles of this loop (at least 2) and see whether performance increase.

Tracking #: 804-1795

Flag : Out For Review

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 804-1795

Flag : Out For Review

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links