Knowledge Graphs and Explainable AI for Drug Repurposing on Rare Diseases

Tracking #: 810-1801

Flag : Review Received

Authors:

Pablo Perdomo-Quinteiro

Responsible editor:

Alessandra Mileo

Submission Type:

Regular Paper

Full PDF Version:

nai-paper-810.pdf

Supplementary Files:

nai-supplementary-810.pdf

Cover Letter:

Dear editor, Please find our manuscript entitled ”Knowledge graphs and explainable AI for drug repurposing on rare diseases” for your consideration. A key open challenge in ML-based drug-disease prediction is how to provide a human understandable explanation that can aid biologists in the generation of testable hypotheses in the lab. We developed rd-explainer a novel method that utilises knowledge graphs in combination with cutting-edge graph ML and XAI tools to provide semantic graphs as explanations supporting predictions. Graph neural networks is one of the most used algorithms in drug repurposing, but how to combine them with background knowledge and XAI tools for better interpretability is barely explored specially for the underrepresented group of rare diseases. We de- veloped a novel interpretable ML algorithm that allows graph neural networks to provide semantic explanations that resembles to human reasoning, and combine this neuro-symbolic method with disease specific knowledge graphs. Our approach is generic and can be applied in different rare diseases and can be enhanced by disease specific background knowledge. Using several evalua- tion tests and specific use cases, we demonstrate that our method can substantially improve the performance of drug-phenotype prediction. We believe that rd-explainer, as well as the underlying method combining knowledge represen- tation and graph-based ML and XAI, will have a broad impact in ML-based biomedical discovery, both in the specific application of drug repurposing prediction and in related areas such as rare disease research. Therefore, we believe our work is highly suitable for Neurosymbolic Artificial Intelligence. Please do not hesitate to contact us should you require any further information. With kind regards, The authors.

Approve Decision:

Approved

Revised Version:

Knowledge Graphs and Explainable AI for Drug Repurposing on Rare Diseases

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 24/Jan/2025

By Janna Hastings
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The paper describes an implemented knowledge graph drug repurposing framework that uses a graph traversal-based explainability technique to offer explanations as chains of edges between drugs, diseases and mechanisms. The approach seems promising and the evaluation shows good performance.

One question I would have for the drug repurposing part is the extent to which the system returns false positives specifically. These types of repurposing recommendations could be imagined to have a relatively high rate of false positives. Of course clinical validation of a given recommendation for drug repurposing would be out of scope for the current study but perhaps clinical validation could be approximated by determining which drugs had failed in investigations for particular indications and using that as a quasi validation dataset.

I didn't fully understand if the chosen interpretability method was reproducible and if not what steps are taken to mitigate stochastic variability in performance?

Why use the 2021 versions of the source databases Monarch/DrugCentral/TTD without updates to more recent versions? Would there be expected to be significant changes in the underlying databases since then that might affect the outcome of the analysis?

"this explanation is classified into complete..." <-- maybe "classified as complete" would work better?

Review #2 submitted on 15/Jan/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Unsatisfactory
Overall presentation: Weak

Detailed Comments:

This paper introduces a new approach for link prediction and explanation in the context of drug repurposing for rare diseases. The paper is interesting, and I do believe the task is valuable and important.

However, I currently have a few issues with this paper. There are quite a few writing problems and missing details that prevent me to build a complete picture of this work. I have also some more general questions regarding the methodology.

At the present stage, I do not think the paper is ready for publication. I'd recommend rethinking the presentation. In this review I'll try to give some indications of what I think the author should improve (I'll provide what I think are relevant examples, but I'd suggest the authors to "propagate" the feedback to other sections).

Ultimately, I do believe that this can become an interesting and valuable paper, but not before some additional work.

** On The Methodology **

The method, as I understand it, is at least partially incremental. The authors put together three pre-existing components (edge2vec, a GNN, and GNNExplainer) and only use a different dataset. I guess that the major contribution here could be the domain and the task, which I agree is very valuable and important.

I have a few questions regarding the setup, but I do apologize if I missed something.

It seems to me that edge2vec is used before training the GNN, but it also seems that edge2vec is used on the entire graph before training the GNN. Isn't this process leaking some information in the second step? Like, when the authors remove the links to create the testing data, nodes embedding will still encode that kind of information? My assumption here is that the authors are actually using edge2vec as pre-initialization for the GNN.

There are many things that would require more details. For example, how is edge2vec used for link prediction in Table 5? How was the optimal value for edge2vec selected (see Table S5)?

Was optimization with raytune also run for the baselines (Table5)? In addition to this, the authors' model was tuned with raytune, but is this run even at cross validation time?

I could not find the model architecture so I looked at the code and seem complicated enough (few graph layers + batch norms) that it might be worth adding to the paper for reproducibility purposes.

** On Writing **

In general, writing would benefit from more work as there are a few sentences that are unclear:

"Next, it is the link prediction for each drug-symptom node embeddings pair by using the dot product as scoring function" (page 4).
>> I think this sentence is missing a verb or something similar (there are a few typos in the paper).

"in the training dataset the supervision edges and the message passing edges are the same; in the validation dataset the message passing edges are the training edges (message and supervision) ... that are different from the training and validation supervision edges" (page 6).
>> This is very long and would benefit from being split and rephrased.

I'd recommend the authors to improve the general presentation and add more details. In particular, related work never introduces the methods in details (e.g., GraphSAGE). Table 1 is a bit too "high-level" as a summary of an entire field (which I understand you don't need to cite all of, but I'd still argue that some polishing here could improve the paper). I often found myself scrolling up and down to retrieve information from previous parts of the paper.

There are several minor writing problems (e.g., typos) that, when taken together, become a bit distracting.
The paper sometimes uses British and sometimes American English. "The code is freely accessible with an open license at https://github.com/PPerdomoQ/rare-disease-explainer" appears at least 3 times. Versions of the Python packages are reported in multiple sections where they could probably be appendix material.

First I read "GraphSAGE and GNNExplainer were implemented using PyTorch Geometric version 2.0.4" (page 4) but then "The GraphSAGE model was created using the DeepSNAP library" (page 6).
>> I guess this is because DeepSNAP uses PyTorch Geometric, but again, this is something that could have been polished.

Another issue comes from reference [3]: "and only in Europe about 36 million people suffer from rare diseases." When I click on the link, the actual figure is "between 27 and 36 million people live with a rare disease," which I understand might look like a minor thing, but I think this should be reported correctly.

The bibliography contains many typos/papers with missing journal names, etc. Also, there are >150 references but the paper seems to have less than 70?

Quality of the images seem low, I'd try to export them using higher resolution.

Review #3 submitted on 19/Feb/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes
Adequacy of the bibliography: No

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

- In Section 3.2: The overview of the explainable AI in graphML is a bit empty. Many more methods have been proposed for this task (e.g., GRETEL, PGExplainer) , and it would be useful to have a table for them in the style of Table 1. Then an explaination of why GNNExplainer was picked out of all the others should be provided. Once all the pros of this model have been identified, then - for the sake of transparency - the disadvatages should also be given. For example, it should be highlightedt he fact that GNNExplainer is a post-hoc explainer, hence the explanations might not always be faithful. For example, if the GNN is trained with noisy data, GNNExplainer may highlight irrelevant edges or nodes simply because they correlate with predictions. Or again, GNNExplainer often focuses on the local neighborhood of a node, while some GNNs (e.g., Graph Attention Networks) might base predictions on long-range dependencies.

- In Section 4.1: the authors write that they used the dot product as a scoring function for each drug-symptom node embeddings pair> Why was this operator used an not some other (e.g., cosine similarity)?

- In Section 4.2: how is the information gathered from these three databases? were they compatible? which features does a note have and which features does a edge have? The paper in this sense might benefit from some restructuring, as some of this information is introduced in this section but then more can be found in Section 5.1. In general, it would be nice to have a figure where max 5 rows are considered for each database, and then it is shown how the graph for each of the mini-databases is built and then how they are merged.

- General Question: in order to train a GNN for link prediction, negative examples are needed and how the negative sampling is performed has a great impact on the final results. Could you please elaborate on this aspect?

- Section 4.3.5: The authors write: "Regarding the parameters of GNNExplainer, because the graphs are highly connected, explanations were generated by using the 1-hop neighborhood around the graph." This is a very tight neighbourhood, was any ablation study conducted to support this choice?

- Minor comments: 1. the figures occupy much more space then needed, with some being vertically aligned when there was enough space for a horizontal alignment and with a suboptimal allocation of the parts in the figure (e.g., figure 5). I strongly encourage the authors to move figures and table around to minimise the waste of space. 2. sometimes abbreviations are used (e.g., can't)

Knowledge Graphs and Explainable AI for Drug Repurposing on Rare Diseases

Tracking #: 810-1801

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Supplementary Files:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 810-1801

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Supplementary Files:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links