Semantic-based Data Augmentation for Machine Learning Prediction Enhancement

Tracking #: 801-1792

Flag : Review Received

Authors:

Majlinda Llugiqi

Responsible editor:

Guest Editors NeSy 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-801.pdf

Cover Letter:

Dear Editors, We are happy to submit our manusript to the NeSy 2024 special issue part of the Neuro-Symbolic AI Journal. In this paper, we extend our work presented at the NeSy 2024 conference by formalizing and expanding our methodology for integrating knowledge graph (KG) embeddings into machine learning (ML) pipelines. We have developed and tested eight different approaches for augmenting the training set with semantic knowledge, incorporating two additional embedding techniques beyond those previously utilized, for two domains. We believe that our enhanced methodology and comprehensive evaluation provide significant contributions to the field of neuro-symbolic AI, particularly in the context of data augmentation with semantic knowledge within ML models. Thank you for considering our manuscript. We appreciate your time and look forward to your feedback. Best Regards The author team

Approve Decision:

Approved

Revised Version:

Semantic-based Data Augmentation for Machine Learning Prediction Enhancement

Tags:

Reviewed

Decision:
Minor Revision

Solicited Reviews:

Review #1 submitted on 09/Jan/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The paper explores the enhancement of machine learning (ML) predictions in data-scarce environments through semantic-based data augmentation leveraging knowledge graphs (KGs). It enriches tabular datasets with various KG-derived embeddings and evaluates their impact on the predictive performance of ML models (e.g., KNN, SVM, XGBoost, Neural Networks) across different embedding techniques and augmentation strategies. The methodology is applied to binary classification tasks for heart disease and chronic kidney disease using public datasets. The findings demonstrate notable improvements, particularly when distance-based KG features are incorporated, with XGBoost and Neural Networks showing the most significant gains.

The paper fits well within the scope of the Neuro AI journal. It is well-written and clear. Therefore, I recommend accepting the paper with minor revisions. Below are some suggested improvements.

Presentation:
- Include a summary of the main changes in this extended version compared to the NeSy 2024 paper.
- Add a summary table in the related works section to visualize the relationship between the current work and NeSy-related studies.
- Position tables summarizing the results in the relevant sections of the main text.
- Move all algorithms to an appendix for better readability.
- Utilize Figure 14 in the main text to reference all approaches instead of having separate figures.

Experiments and Results:

- Section 6.2: While the paper reports an averaged performance across three embedding dimensions to ensure robustness, it is common practice to also average the results of multiple runs for each experiment and report the standard deviation.

- Table 2: Provide details on how the hyperparameters for each embedding method were selected.

- Section 7: Clarify how the impact of KGs was computed. Since the ontologies are used to build the KGs and subsequently implement the described approaches, specify whether the reported value represents an average across all these approaches.

Minor comments:

p10, line 44: approachess -> approaches
p10, line 51: c_j for each target class is computed. How is the centroid computed?
p11, line 19: and no classes -> and noDisease classes.
p11 in Alg3, $\vec{v_i}$ is not defined
p12, line 43: In this approach, referred to as EmbedClusterAugTab -> In this approach, referred to as ClusterAugTab
p12, line 45: Algorithm 6 -> Algorithm 5
p17, line 33: we only used on the third approach -> we only used the third approach
p17, line 39: Detailed descriptions -> I would rather say 'An overview of these models is provided in Section 2.'
p19, line 34: in the other hand -> on the other hand

Review #2 submitted on 06/Jan/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

[Paper Summary]
The paper explores the integration of Knowledge Graphs into Machine Learning pipelines to address challenges in data-scarce domains. The authors hypothesize that enriching training datasets with semantic information derived from KGs can improve ML models' predictive capabilities.
The paper explores three primary research objectives: identifying optimal methods for integrating KG-derived features into ML pipelines; analyzing the impact of various KG embedding techniques on model performance; and comparing the effectiveness of ML algorithms when augmented with KG information. To address the objective the authors tested five sub-hypotheses across eight approaches and conducted experiments on binary classification tasks focused on predicting heart and chronic kidney diseases. The results indicate substantial improvements when models are augmented with distance features derived from KG embeddings.

[Review]
I reviewed this paper for NeSy 2024, and I am pleased to note that it has significantly improved since then. The authors have addressed almost all my previous comments, resulting in a much stronger and more refined submission. The methodology and results are now presented more clearly, and the contributions are well-articulated. Additionally, the paper has been expanded considerably with new content.
The paper in its current form is well-prepared and demonstrates substantial progress. However, I still have a few minor comments that could further enhance the quality and clarity of the work:
- The paper briefly mentions the types of features used in the datasets but does not provide detailed specifications. It would be beneficial to clarify whether the features are categorical, continuous, or integers. Including a table in the appendix that lists each feature along with its type and any relevant characteristics would enhance clarity and reproducibility.
- It is unclear whether literals are considered part of the entity set E. If literals are not included in E, the graph should be formally represented KG =(E,L,R’,Tr) where L denotes the set of literals. Additionally, the paper should explain how literals are mapped into the embedding space, as their representation may influence the model’s effectiveness.
- The notation for the embedding function requires more consistency. Initially, it is defined as ϕ : E∪R →R^d mapping entities and relations into a d-dimensional space. Later, the notation shifts to ϕ : P →R^d, where P⊆E.
- The meaning of the yellow sections in the "augmented tabular data" block is not explained in Figures 3–10. Please provide a legend or annotation within the figures.

[Minor things]
- While the paper primarily focuses on accuracy and F2 scores, incorporating recall as an evaluation metric could offer valuable insights.
- In the abstract and introduction, you mention a focus on accuracy and F2 score without explaining why. The explanation is provided later in section 4.2. Include a brief sentence in the introduction explaining this focus or reference where the explanation can be found.
- Rescale the y-axis in your figures for clarity. For example, Fig 12 should have a y-axis range of 0.6 to 0.8. This will better illustrate performance differences.
- Typos:
- Page 3 line 42: R capital letter (R subset of CxC)
- H1.2 Analysis [line 14 page 19] missing capital letter

Review #3 submitted on 09/Jan/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Excellent

Content:
Technical Quality of the paper: Excellent
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Detailed Comments:

This paper presents an impressive and innovative contribution to the field of machine learning, addressing the critical challenge of improving model performance in data-scarce or sensitive scenarios. The authors' hypothesis that semantic enrichment through knowledge graph (KG) integration can enhance predictive power is both compelling and highly relevant. The introduction of novel neuro-symbolic approaches and the systematic exploration of KG embedding techniques highlight the authors' dedication to advancing the state of the art. Their rigorous evaluation across multiple ML algorithms and KG embedding methods showcases the robustness of their approach. The focus on real-world applications, such as heart disease and chronic disease prediction, further emphasizes the practical significance of their work. The results are particularly noteworthy, demonstrating remarkable improvements in F2 scores, such as a dramatic boost in XGBoost performance for heart disease prediction. These findings convincingly illustrate the potential of KG-based augmentation to transform ML performance, especially in binary classification tasks. The clear, data-driven methodology and the emphasis on accuracy and F2 scores provide valuable insights for both researchers and practitioners. This reviewer emphasises that this paper is a significant step forward in the integration of symbolic reasoning with ML techniques, paving the way for more context-aware, robust, and effective predictive models. It is a must-read for anyone interested in enhancing ML performance through innovative data augmentation strategies. Summarizing, the work presented is interesting, relevant, important, well presented and also well written and fits well into this journal. For all these reasons, this reviewer argues for acceptance of this work and provides in the following just one minor suggestion for improvement to further enhance its usefulness to the potential reader: Page 5, last paragraph- In addition to the excellent work of Bhatt et al. (2020), a very new paper should be mentioned here, a related work that is very interesting for the reader: Kraisnikovic, C. 2025. Fine-tuning language model embeddings to reveal domain knowledge: An explainable artificial intelligence perspective on medical decision making. Engineering Applications of Artificial Intelligence, 139, 109561, doi:10.1016/j.engappai.2024.109561.

Semantic-based Data Augmentation for Machine Learning Prediction Enhancement

Tracking #: 801-1792

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 801-1792

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links