By Luca Bergamin
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
The paper discusses a collection of techniques apt to understand what high-level concept a neuron attends, considering common convolution-based architectures. In particular, the authors leverage both symbolic, ontology-based techniques and LLM-based adaptations. Their findings highlight experimental evidence which compares the strengths and weaknesses of both LLM and symbolic reasoners. The paper is a result of an extension of several conference papers published at NeSy 24 by the authors.
The work is indisputably valuable and of interest to the present journal. Nonetheless, it requires a careful presentation due to its sizable number of contributions in multiple aspects, which makes the paper quite hard to digest at first, particularly for readers who could need a refresher on the concepts used throughout the paper.
In the following points, I refer to specific parts of the paper using P11,r22, where 11 is the page and 22 is the row.
1. The literature review would benefit from being presented in a table/graph form, comparing the main axes (such as the neural/symbolic nature, the degree of supervision required, etc.) and possibly proposing a proper taxonomy of the works cited.
2. Also, the relationship between the related works cited and the present work could be discussed further, e.g., by discussing under which aspect your proposal improves the weaknesses of each method.
3. Regarding the discussion of “explaining a neural network through concepts” (cfr. p3,r11), reporting some works related to “having a neuron active for many concepts at once” could be beneficial. To this extent, the literature on disentangled representations (Bengio et al., 2013; Locatello et al., 2019) could be useful. Other useful keywords are “polysemantic neurons” (i.e., neurons that fire under multiple stimuli).
4. While I understand the utility of having the notions related to each section structured to give the background needed at the beginning of each section, some common preliminary notions could be moved to a background section before entering Section 3. This section could also help provide a visual example to help understand all the inputs/outputs involved in the system. In my opinion, this would help to make the paper less of a collection of existing published papers and more of a comprehensive work on the topic.
5. (cfr. p26,r36) It is quite strange that only the Resnet50V2 achieved high validation accuracy scores, while other architectures show a big gap with the training accuracy, especially when using early stopping. Do other metrics highlight this issue (e.g., top-k accuracy) as well? Could you compare the confusion matrices? Also, is patience=3 / learning rate=0.001 sufficient/necessary to fine-tune this task? Usually, you could get better results in fine-tuning with lower learning rates and/or providing more epochs. While I understand the argument of the low need for high accuracy, the explanations should be made on a sufficiently reliable/performant model, and I can't see how Resnet50v2 has such a wide margin compared to the classic Resnet50.
6. Regarding the statistical testing: in p13,r23 you state the usage of the Mann-Whitney U test that does not require normal distributions. It is unclear to me whether this test should be corrected or not (due to the multiple analyses performed) and why. Also, you mention there is no reason to assume that activation values follow a normal distribution; can you show an example?
7. Some formalizations of the methods described could help in making the paper self-contained, even at the cost of redundancy with already published material. In particular, the statistical analysis tools used extensively throughout the paper could be introduced; it’s unclear how ECII works without exploring the cited literature; some details of the inner workings of CAV/CAR could be provided as well.
8. I am not sure of the usefulness of Table 6-7-8. In particular, they show the raw performance in both training and test settings. Wouldn’t a chart be more informative, especially while comparing the results of GPT/CLIP/Concept Induction? Those tables could be moved to an Appendix if possible. Also, I am unsure of the utility of having the training accuracy reported as well, if not discussed in the paper.
9. Regarding the “Further discussion” subsection, there are a couple of claims that could be discussed better:
9a. P27,r3: “it is unclear how to craft the pool of candidate concepts”; can you expand on this topic?
9b. P27,r5: “tailored to the application scenario”; can you provide an example?
9c. P27,r9: “it is equally vital to thoughtfully design this pool”; could you better explain what are the risks of a poorly designed pool? 10. How would this extend to other datasets? Can you make an example?
10.The limitations of the work could be summed up in a specific section at the end of the paper (e.g.: activation patterns involving more than one neuron, requirement of labeled data, single dataset analysis, concept formation across multiple layers). Mitigations and/or suggestions for implementing these improvements could be reported as well.
Minors:
1. Regarding the CAR non-linear kernel, some details (e.g., the value chosen for the bandwidth of the RBF kernel) are missing.
2. The Levenshtein string similarity metric is undefined (p29,r41)
Grammar and general layout:
1. P3,r11: “Neural Network through concepts is be a two-step process” -> “[...] is a two-step process”
2. p5,r43: should have a brief discussion before creating the subsubsection 3.1.1, to avoid the empty subsection.
3. P20,r28: k-fold cross validation vs p22,r37, K-fold cross validation; keep a consistent notation
4. P37,r44: necessitate -> necessitates
5. P29,r3: beforew -> before