By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
The proposed KEA Explain is a neurosymbolic method for LLM hallucination detection and explanation. Its main strengths include the adoption of graph-based heuristics, specifically the Weisfeiler-Lehman subtree kernel, for robust structural comparison of knowledge graphs. Furthermore, the generation of contrastive explanations is a defining feature, directly addressing the critical "lack of explainability" limitation of existing methods by detailing discrepancies between the claim and the ground truth facts.
Strenghts:
- Adopting graph-based heuristics to evaluate hallucinations is a novel and compelling approach. Using a subtree kernel for structural comparison seems like a powerful way to move past simple triple-matching methods and incorporate the wider context of the knowledge graph structure.
- The framework's ability to generate contrastive explanations is its defing strength, directly addressing the critical "lack of explainability" limitation of existing methods. The explanations detail not only why a statement is hallucinatory but also what change would correct it.
- The method is relevant for the NeSy community, as it combines the strengths of symbolic components (KGs, graph kernels) with neural techniques (SBERT embeddings for semantic clustering). This allows the symbolic comparison to account for the semantic similarity of labels, making the comparison robust.
Weaknesses:
- I think that the performance of the approach is heavily reliant on the quality of the actual ground truth KG, which, in turn, is heavily conditioned by several possible failure points. These include the performance of the SentenceBERT embeddings, the SpaCy Entity Linker, and the empirically chosen Similarity Thresholds. Performance comparison with baselines indicates that the model’s value may ultimately lie more in the benefit of structured explanations than in detection performance alone.
- Testing the open-domain hallucination detection only on the WikiBio dataset is somewhat limiting, given its modest size. Furthermore, while competitive, the detection performance might appear suboptimal compared to certain baselines.
- The Related Work requires a deeper focus on the interpretability limitations of existing models and a more robust justification for the preference of the proposed approach
Additional comments:
- Regarding Algorithm 1, clarification is needed for the condition "attributes differ". I assumed that "attributes differ" refers to a discrepancy in the relation label (the r in an (h,r,t) triple) even if the head and tail entities are identical, however in the discussion the authors refer to a conflict at entity-level (Paris and Rome).
- The paper clearly states an LLM generates the natural language explanation, yet it doesn't specify the LLM used for this task or provide an example of the specific prompt used for explanation generation.
- The technique used for filtering the ground-truth KG by maximizing the cosine similarity of SBERT embeddings (arg max) with the claim KG triples. From my understanding, this reliance on argmax is a concern because it could select irrelevant context triples as long as they are the closest available in the embedding space, potentially retaining triples that are only somewhat relevant (similar domains) but not directly tied to the entities being evaluated. A strict similarity threshold alongside argmax could make this filtering process more robust.
- The necessary use of empirically chosen graph kernel similarity thresholds that vary significantly by task introduces complexity. This sensitivity to different domains and tasks means real-world deployment would likely require re-optimizing the threshold before use.
- The framework incurs a significant practical computational burden. I think that programmatically constructing ground truth KGs and, for open-domain tasks, retrieving relevant facts from Wikidata on the fly via SPARQL queries can be time-consuming, which could be a limitation for real-time applications. Is this limitation worth discussing?
- The observation that explanation quality monotonically declines as hallucination severity decreases is important. It suggests the current method struggles with subtle inconsistencies, as it is too reliant on finding concrete conflicting triples, which are sparser in nuanced hallucinations.
- For the triple notation, e.g., (h,r,t), using italics to distinguish variable names from prose is recommended for clarity.
- This is more of a suggestion: the structured nature of the contrastive explanation (based on graph edit operations) is ideally suited to generating follow-up prompts (e.g. to the same LLM that generated the hallucinated sentence) to actively correct the hallucination. This could be a valuable direction for future exploration.