By Luca Bergamin
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
Greetings,
Thank you for your detailed response to my review comments. My answer to each point follows. My response follows the following notation: Qn stands for my original question, Rn stands for your answer, and An is the new answer. If a point was not reported from my revision, it means I was satisfied with the answer provided and have no additional requests or objections.
Q1: The literature review would benefit from being
presented in a table/graph form, comparing the
main axes (such as the neural/symbolic nature,
the degree of supervision required, etc.) and
possibly proposing a proper taxonomy of the
works cited.
R1: We thank the reviewer for the valuable suggestion to present the literature review in a table/graph form. We considered this option; however, we opted for a narrative format in order to provide detailed qualitative insights into the strengths and limitations of each method. Our approach allows us to discuss nuances—such as the degree of supervision required, the dynamic versus static nature of the concept pools, and the neural versus symbolic components—in a cohesive, contextual manner that a table might not capture. That said, we are open to providing a supplementary summary table if the reviewer believes it is needed to further enhance clarity.
A1: I would still strongly advise for a table that sums up the literature review in a tabular format. If you believe there are such nuances, you could point them out in any cell that needs it. What I envision would be something along the lines of Table 1 of this work: "Improving rule-based classifiers by Bayes point aggregation" (Bergamin et al., 2025). As you mentioned, there are different nuances (degree of supervision, concept pools, neural vs symbolic, etc.), that can become a new column for each table. Personally, I would prefer the table to be in the main text.
Q2: While I understand the utility of having the
notions related to each section structured to
give the background needed at the beginning of
each section, some common preliminary notions
could be moved to a background section before
entering Section 3. This section could also help
provide a visual example to help understand all
the inputs/outputs involved in the system. In
my opinion, this would help to make the paper
less of a collection of existing published papers
and more of a comprehensive work on the
topic.
R2: [omit] However, we are open to incorporating a brief background subsection if the reviewer believes it is necessary to further improve readability and can be done without disrupting the narrative flow.
A2: I think this is a very good idea that should be incorporated to make the paper more accessible to reader. I advice the authors to incorporate this point.
Q3: (cfr. p26,r36) It is quite strange that only the
Resnet50V2 achieved high validation accuracy
scores, while other architectures show a big gap
with the training accuracy, especially when
using early stopping. Do other metrics highlight
this issue (e.g., top-k accuracy) as well? Could
you compare the confusion matrices? Also, is
patience=3 / learning rate=0.001
sufficient/necessary to fine-tune this task?
Usually, you could get better results in
fine-tuning with lower learning rates and/or
providing more epochs. While I understand the
argument of the low need for high accuracy, the
explanations should be made on a sufficiently
reliable/performant model, and I can't see how
Resnet50v2 has such a wide margin compared to
the classic Resnet50.
R3: We thank the reviewer for the valuable feedback regarding model performance and hyperparameter choices. In our extensive experiments, we evaluated several architectures (including VGG16, InceptionV3, Resnet50, Resnet50V2, Resnet101, and Resnet152V2) and tested various hyperparameter configurations (different learning rates, patience values, and number of epochs). Ultimately, Resnet50V2 achieved the best overall performance, with consistent training and validation accuracy levels. Although our primary focus is on generating and interpreting explanations rather than maximizing classification accuracy, the 87% validation accuracy achieved by Resnet50V2 is robust enough for our purposes. The choices of a patience of 3 and a learning rate of 0.001 were derived from extensive preliminary tuning; while further fine-tuning (e.g., lower learning rates or more epochs) might yield incremental improvements, such modifications were not necessary given that our task prioritizes explanation fidelity. Overall, our current approach strikes a sufficient balance between model performance and the reliability of the generated explanations.
A3: I still have my doubts on the soundness of this part of the experimental setting, due to the lack of systematic hyperparameter tuning, where hyperparams were set ad-hoc, and the lack of additional data regarding other metrics, confusion matrices (or even just training losses plots, etc.). As you are very well aware, this could lead to unwanted under/overfitting, and other uncontrolled model behavior. I still believe this is a weaker side of this paper, but I agree this was not the focus to begin with. Therefore, I do not have explicit requests for this point (but, still, the authors are welcome to improve it if they deem it necessary).
Q4: I am not sure of the usefulness of Table 6-7-8.
In particular, they show the raw performance in
both training and test settings. Wouldn’t a
chart be more informative, especially while
comparing the results of GPT/CLIP/Concept
Induction? Those tables could be moved to an
Appendix if possible. Also, I am unsure of the
utility of having the training accuracy reported
as well, if not discussed in the paper.
R4: Thank you for the suggestion. We considered adding charts, however
we were unable to come up with a good and meaningful way to
visualize the data in the tables without adding unnecessary
redundancy and length. We would be happy to receive concrete
suggestions.
A4: Thank you for your response to my comment regarding Tables 6, 7, and 8. I understand that you were unable to devise a meaningful way to visualize the data without adding redundancy or length to the paper. One option could be to craft a bar chart for each row. These bar charts could be sorted by a target metric (e.g., either CAR or CAV test accuracy). To improve readability, they could be split across multiple columns to reduce length. Another option could be to show a summary table instead, where you report mean accuracy and std scores for each category, and move the table to the supplementary materials. In essence, in order to be useful, the tables need to visually convey what you want to compare. If you take the tables in isolation, and let them be read by an external reader, this table shows that sometimes CAR and CAV work better under the test accuracy metric, sometimes not. I'm not sure if this should be the purpose of these tables. Could you briefly comment on what do you believe their purpose is? In this way, I could provide a more precise advice on their presentation.
Q5: 9c. P27,r9: “it is equally vital to thoughtfully
design this pool”; could you better explain what
are the risks of a poorly designed pool?
R5: P27, r9: Manuscript says - it is equally vital to thoughtfully design this pool. Neglecting this aspect could result in overlooking crucial concepts essential for gaining insights into hidden layer computations. Our approach offers a way to integrate rich background knowledge and extract meaningful concepts from it.
A5: As the manuscript says, "neglecting this aspect results in overlooking crucial concepts essential for gaining insights into hidden layer computations." From an external reader, this sentence seems fuzzy and not precise enough; my request was simply to expand this explanation to make it more intuitive to an external reader, by adding additional context.
Q6: The limitations of the work could be summed
up in a specific section at the end of the paper
(e.g.: activation patterns involving more than
one neuron, requirement of labeled data, single
dataset analysis, concept formation across
multiple layers). Mitigations and/or suggestions
for implementing these improvements could be
reported as well.
R6: If the reviewer deems it necessary, we can add a short “Limitations and Future Work” section to summarize these points more explicitly. However, we believe our existing “Further Discussion” already captures the essence of these limitations (e.g., focusing on the dense layer, the need for labeled data, single dataset use) and outlines how we plan to address them in future research.
A6: I believe it would be helpful to have such a section, at the very bottom of the paper (before conclusions), to sum up concisely all the limitations of the methods presented. They should encompass all the previous sections presented.
Q7: 1. Regarding the CAR non-linear kernel, some details (e.g., the value chosen for the bandwidth of the RBF kernel) are missing.
R7: ?
A7: I could not find an updated reference into the paper (I could have missed it since it was not pointed out by the authors in their answer). I advise the authors to fully disclose the hyperparameters of their kernel methods to enhance the reproducibility of their work.