By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Weak
Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Unsatisfactory
Overall presentation: Weak
Detailed Comments:
This paper introduces a new approach for link prediction and explanation in the context of drug repurposing for rare diseases. The paper is interesting, and I do believe the task is valuable and important.
However, I currently have a few issues with this paper. There are quite a few writing problems and missing details that prevent me to build a complete picture of this work. I have also some more general questions regarding the methodology.
At the present stage, I do not think the paper is ready for publication. I'd recommend rethinking the presentation. In this review I'll try to give some indications of what I think the author should improve (I'll provide what I think are relevant examples, but I'd suggest the authors to "propagate" the feedback to other sections).
Ultimately, I do believe that this can become an interesting and valuable paper, but not before some additional work.
** On The Methodology **
The method, as I understand it, is at least partially incremental. The authors put together three pre-existing components (edge2vec, a GNN, and GNNExplainer) and only use a different dataset. I guess that the major contribution here could be the domain and the task, which I agree is very valuable and important.
I have a few questions regarding the setup, but I do apologize if I missed something.
It seems to me that edge2vec is used before training the GNN, but it also seems that edge2vec is used on the entire graph before training the GNN. Isn't this process leaking some information in the second step? Like, when the authors remove the links to create the testing data, nodes embedding will still encode that kind of information? My assumption here is that the authors are actually using edge2vec as pre-initialization for the GNN.
There are many things that would require more details. For example, how is edge2vec used for link prediction in Table 5? How was the optimal value for edge2vec selected (see Table S5)?
Was optimization with raytune also run for the baselines (Table5)? In addition to this, the authors' model was tuned with raytune, but is this run even at cross validation time?
I could not find the model architecture so I looked at the code and seem complicated enough (few graph layers + batch norms) that it might be worth adding to the paper for reproducibility purposes.
** On Writing **
In general, writing would benefit from more work as there are a few sentences that are unclear:
"Next, it is the link prediction for each drug-symptom node embeddings pair by using the dot product as scoring function" (page 4).
>> I think this sentence is missing a verb or something similar (there are a few typos in the paper).
"in the training dataset the supervision edges and the message passing edges are the same; in the validation dataset the message passing edges are the training edges (message and supervision) ... that are different from the training and validation supervision edges" (page 6).
>> This is very long and would benefit from being split and rephrased.
I'd recommend the authors to improve the general presentation and add more details. In particular, related work never introduces the methods in details (e.g., GraphSAGE). Table 1 is a bit too "high-level" as a summary of an entire field (which I understand you don't need to cite all of, but I'd still argue that some polishing here could improve the paper). I often found myself scrolling up and down to retrieve information from previous parts of the paper.
There are several minor writing problems (e.g., typos) that, when taken together, become a bit distracting.
The paper sometimes uses British and sometimes American English. "The code is freely accessible with an open license at https://github.com/PPerdomoQ/rare-disease-explainer" appears at least 3 times. Versions of the Python packages are reported in multiple sections where they could probably be appendix material.
First I read "GraphSAGE and GNNExplainer were implemented using PyTorch Geometric version 2.0.4" (page 4) but then "The GraphSAGE model was created using the DeepSNAP library" (page 6).
>> I guess this is because DeepSNAP uses PyTorch Geometric, but again, this is something that could have been polished.
Another issue comes from reference [3]: "and only in Europe about 36 million people suffer from rare diseases." When I click on the link, the actual figure is "between 27 and 36 million people live with a rare disease," which I understand might look like a minor thing, but I think this should be reported correctly.
The bibliography contains many typos/papers with missing journal names, etc. Also, there are >150 references but the paper seems to have less than 70?
Quality of the images seem low, I'd try to export them using higher resolution.