Graph-ic Improvements: Adding Explicit Syntactic Graphs to Neural Machine Translation

Tracking #: 734-1718

Flag : Review Received

Authors:

Yuqian Dai

Serge Sharoff

Marc de Kamps

Responsible editor:

Luis Lamb

Submission Type:

Regular Paper

Full PDF Version:

nai-paper-734.pdf

Approve Decision:

Approved

Revised Version:

Graph-ic Improvements: Adding Explicit Syntactic Graphs to Neural Machine Translation

Tags:

Reviewed

Decision:
Minor Revision

Solicited Reviews:

Review #1 submitted on 04/Nov/2024

By Joel Carbonera
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

This paper explores the integration of explicit syntactic graphs with BERT models in Neural Machine Translation (NMT) systems, aiming to enhance translation quality. The idea is based on using graph attention networks (GAT) to introduce explicit syntactic knowledge, complementing the implicit learning of pre-trained language models, such as BERT.

The work presented by the authors is based on syntactic dependency relations. However, the authors do not provide a good explanation of what these relations are, their types, what they represent, etc. This knowledge is fundamental to understanding the whole paper. The absence of this discussion negatively impacts the understanding of the discussions and results. I recommend including a discussion that provides the minimum knowledge about syntactic dependency relations that readers need to have to understand the paper.

The paper presents a promising idea and offers an interesting scientific contribution. And up to section 3, the work is generally well conducted. However, after section 3, some of the explanations provided are not clear, and there seems to be a missing glue in the discourse that allows readers to follow the reasoning. In general, the reasoning underlying some parts is not explicit, making it quite difficult to understand the underlying reasoning and, consequently, making it difficult to understand the results and evaluate the quality of the author's contribution. I suggest a more detailed analysis of the discursive structure of the paper aiming to provide a structure that makes explicit all the reasoning underlying the discourse. The idea is to allow the readesr of the paper to be able to read it and reproduce the authors' results in full. I suggest following this principle as a guide for restructuring the paper.

Regarding the acronym “SGB”, it would be interesting to explain it clearly at the first mentions. This helps to avoid confusion and improves accessibility for all audiences.

It is not clear to whom is not an expert on the task what is the output of the Universal Dependencies-based parser. It seems that, in some way, the parsing provides the information of the syntactic dependencies between words, but it is not clear by the text what is the structure of this information. Is it a directed acyclic graph whose edges represent different types of syntactic dependency? How the different types of dependency are represented, in this case? What does the resulting node adjacency matrix look like? This should be explained by providing more details. It would be more informative to the readers if the authors provide some example of this process considering some specific sentences as examples and showing what is the information generated in the parsing process.

The text in section "3.2. Metrics for Machine Translation Evaluation" is confusing. I suggest to explain in a clearer way the roles of BLEU, COMET and TransQuest in the evaluation process. It seems that COMET and TransQuest are used for generatin QE metrics. If this is the case, it is not clear why the authors used COMET in some evaluations and TransQuest in others. TransQuest seems to be ideal for evaluating translations in scenarios where there are no reference translations. However, it is not clear whether this is the reason why the authors used it in some evaluations instead of using only COMET. It would be interesting to make it clearer in the text why the authors used more than one strategy for QE.

Table 1. In my point of view, the caption of this table does not help the reader understand the information presented in the table. In general, a table caption should be more objective and focused on describing the content and structure of the table, so that the reader knows what it presents without inferring specific interpretations or results. In the case of the description above, there are interpretations about the results, which should generally be left to the discussion in the main text, not to the table caption. I suggest that the authors change the caption to a description of the data presented in the Table. The conclusions obtained by analyzing this data should be discussed in the main text. I also suggest that authors follow this principle throughout the paper, for all tables and figures.

It is not clear what the column size means in Table 1. Is it the size of the dataset? This is the kind of information I expect to be explicit in a table caption, for example.

"1 million (M) sentence pairs are selected as the training set for each language, with 6 thousand (K) and 5K sentence pairs for the validation and test sets, respectively."
If the column size in Table 1 is related to the dataset size, how is this sentence related to the information presented in Table 1?

The authors split the data into training, validation, and testing, but it is not clear how the selection of sentences for each set was made. Was random selection used?

Besides that, the role of the validation set in the methodology is not clear. Did the authors use some early stopping method to monitor the validation set or was this set used only for human monitoring of the model's performance throughout training on a data set not used to adjust the parameters (training data)?

I suggest including examples for a qualitative analysis of the results. Instances with low BLEU and high EQ. Instances where both are low. Instances where both are high. Instances with high BLEU score and low EQ. Discussing examples like this helps to raise knowledge about the model's behavior, favoring the revelation of phenomena, explanatory hypotheses, and justifications and promoting the emergence of new ideas from the knowledge gathered.

"...outperforms SGBD in handling certain syntactic relations, including "discourse:sp," "orphan," and "csubj.""
It is important to describe what these relationships mean, even if only briefly.

Section 4.3
The section 4.3 needs improvements. It would be more informative to clarify the assumptions and the reasoning behind the discussion more explicitly to allow the reader to follow along. This would help the reader better understand the information presented in Table 4. It would be helpful to explain this table further.

Section 5.1
In Section 5.1, the explanation of the proposed model could be better detailed. The authors introduce a model to investigate the types of syntactic knowledge GAT is capable of learning. However, it is hard to understand the model introduced in this section. It is not clear what the input and output of such a model are. Without an explicit and detailed explanation of this model, it is hard to follow the underlying reasoning the authors are trying to present. Presenting some examples of input and output in some diagrams would make this section much easier to understand. Furthermore, the authors should better characterize the learning problem related to this model. Is the goal of the model to classify the edges between words with classes of syntactic dependency relations? If so, it is essential that this information be explicit. In addition, as it is a classification problem, it is unclear whether it is a binary, multiclass, or multilabel classification problem. Finally, it is important to make it clear whether the dataset considered is balanced or not. It is also not so clear how this discussion relates to the models proposed by the authors. It is important to make all this reasoning explicit so that the reader understands why they are being presented with this discussion.

"The F1-score is used as the evaluation metric."
Since I was unable to understand exactly the characteristics of the learning problem that the introduced model must solve, I cannot assess whether this metric is suitable for the problem. If it is a binary classification problem, mentioning that you are using F1 is ok. However, if the classification problem is multiclass, you need to state what type of aggregation you are using (macro, micro, weighted), etc. Note that it is also important to select the appropriate aggregation, considering whether the problem is balanced or not. Furthermore, if the model classifies all edges in the graph, it is not clear how a single F1 metric is generated for the entire dataset. It is essential to explain this in detail. A clearer explanation of section 5.1 would benefit the understanding of the results in Table 5.

The caption for Figure 2 needs to be significantly improved. It does not describe the data presented. Actually, I strongly suggest that the authors review the captions for all figures and tables in the paper. Many captions present conclusions rather than describe the data presented in the figures and tables. A good example of this is the caption for Figure 6. Captions for figures and tables should be concerned exclusively with describing the structure of the data presented in the figures and tables to allow the reader to consume the information and draw their own conclusions based on that description. The conclusions that the authors draw from the data presented should be placed in the main body of the text, along with an explicit and carefully articulated analysis that represents how the authors drew their conclusions from the data. I suggest following this principle in all figures and tables of the paper.

Section 5.2 is also difficult to follow. What are the inputs and outputs of the model trained in this section? How is the learning problem characterized?... The information is presented to the reader without the necessary details and without explicitly explaining the reasoning behind what is being presented. I suggest representing these ideas in a way that makes it easier to understand why the reader is being presented with certain ideas, the analyses performed, the arguments established, and the conclusions drawn.

Section 6.1
The analysis in section 6.1 is difficult to follow. It is not clear what exactly is used as input to the models. For example, consider this statement:
"The source sentences corresponding to the 300 low-quality translations are divided according to the type of dependency relations as the stimulus. Given the current dependency relation is x, the source sentences of low-quality translations containing x are all composed
into one group stimulus."
It is not clear whether the stimulus is the sentences, the type of dependency relations, or groups of sentences. Examples of input would be very enlightening. I suggest that the authors provide examples and make explicit all the reasoning behind this process. The goal is to allow other readers to be able to understand and reproduce the experiments in detail. So the explanation provided should allow for this. Also, consider this statement:
"Table 8 presents selected results from an RSA analysis, comparing Baseline BERT with SGB engines based on syntactic prediction scores by GAT (full results are in Appendix A)."
I could not understand what the idea was at this point. I suggest more detailed and explicit explanations. Also, consider the following statement:
"Specifically, layers 3-5 for Chinese and Russian, and layers 5-8 for German, exhibit the lowest RSA scores."
The authors discuss layers here, but up until this point, it was not clear that the analyses would be performed considering different layers. It is important to establish exactly how the analysis was conducted before presenting and discussing the results. Without a thorough understanding of the purpose and methodology of the analysis, it is not possible to understand or evaluate the results.

I think that the analysis in section 6.1 is as follows:
-Initially, the authors select 300 low-quality translations to perform representation similarity analysis. The source sentences of these translations are divided based on the specific syntactic dependency relations they contain. This means that for each type of dependency relation (e.g., "root", "nsubj", etc.), the sentences containing that specific relation are grouped together. These groups of sentences form what the paper calls the "stimulus" for the analysis.
-Then, the internal BERT representations for these sentences are extracted. Representations are collected from each BERT layer for both model configurations: the baseline model (which uses only BERT) and the GBS models (BCMS and DBMS), which integrate GAT. The idea is to compare the representations between the baseline models and the models that use syntactic graphs.
-For each layer of the model, the sentence representations are organized into a similarity matrix. To do this, the similarity between all combinations of sentence pairs within the "stimulus" group is calculated. The similarity is measured using cosine similarity, which allows us to quantify how close the representations of sentences in a specific layer are.
-Once the similarity matrices are constructed for each layer of each model (baseline and SGB), it is possible to compare the representations of the layers of one model with those of the other. The comparison is made using the Pearson correlation between the upper parts of the similarity matrices (excluding the main diagonal, which represents the similarity of a sentence with itself). This correlation value is then interpreted as the degree of similarity between the representations of the same layer in the two models being compared (baseline and SGB).
But I am not sure if this is the process. I interpreted it this way by performing some deductions and "interpolations". A more detailed and explicit explanation, seeking to fill these gaps, would greatly benefit the paper.

"This finding indicates that BERT is more instrumental in forming representations of source sentences and affecting translation quality in this hybrid approach."
It is not clear what this sentence means. I suggest more clarity in the interpretation of this result. I suggest the authors structure their ideas better, filling in the gaps and explaining their reasoning.

Overall, the paper brings relevant contributions to the field of neural machine translation and creatively explores the use of syntactic graphs combined with BERT. The suggestions above only seek to add clarity and depth to the explanations, strengthening the connection with the audience and allowing the study to reach its full potential.

Review #2 submitted on 27/Nov/2024

By Revathy Venkataramanan
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

Summary:
This paper presents a novel method to incorporate syntactic relationships among the words into machine translation tasks to enhance the performance of translation. Specifically, they employ GAT to capture these syntactic relationships and infuse into the machine translation procedure. The paper presents comprehensive evaluation approaches. Overall, it is a good paper with novel contributions. A few points that could help strengthen the paper are:

1. Can the authors explain why these three languages were chosen for evaluation?
2. The term prism of syntactic knowledge needs to be clarified. Does this refer to the syntactic tree every sentence possesses?
3. Two network architectures were evaluated - SGBD and SGBC. Is there a rationale behind these two architectures or they were chosen to demonstrate variations?
4. The baseline mentioned in Table 1 is difficult to find. Can the authors include the definition in the table caption or use a bold or italic to explain what is the baseline?
5. Can the authors present a sensitivity analysis on these models - SGBD and SGBC? That is, purposefully introducing incorrect syntactic graphs to show that the model performance suffers. This strengthens the core claim of the paper that accurate syntactic knowledge is essential in improving models’ performance. There is an evaluation presented in section 6.2 by altering the word order. Rather than altering the word order, can the relationship be altered as the paper is around syntactic relationships?
6. Can the authors provide an explanation as to why SGBC performed better in Table 1?
7. In section 4.3, can the authors provide a few anecdotal examples on what are those specific syntactic relationships pricked by SGBD vs SGBC?
8. In Tables 4, 7, and 8, it is better to include the definitions of items in the first column. It is not clear what obl:agent and other items mean
9. The evaluations presented in section 5.1 using too small a dataset - 800,100,100 sentences for train, validation, and test. Can the authors provide additional evidence or justification that this is a sufficient number? The lack of labeled datasets is a well-recognized problem. So if a justification can be provided, that would be appreciated.

Minor Comments
1. There is a mistake on line 30 in page 7
2. The section headers 4,5 and 6 are not intuitive. I was not sure whether to expect a discussion on results or theoretical explanations.

Review #3 submitted on 31/Oct/2024

By Kyle Hamilton
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Excellent
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Detailed Comments:

The authors introduce an MT model which fuses BERT with GAT. The main idea is to utilize the sentence parse tree to enrich the BERT encodings. While the fusion of BERT and GAT is not new, its application to MT is, according to the authors. The results are measured using the BLEU and QE metrics. The BLEU scores are only minimally higher than baselines and statistical significance is not provided. QE, however, appears to significantly outperform the baselines. The authors provide an analysis of the dependency parses and their impact on the results. The authors also provide results of using different numbers of layers and heads within GAT.

The manuscript is quite long, but well written overall and relatively easy to read. The premise that structural information should benefit NLP tasks is an appealing one from a theoretical viewpoint. It is therefore important to test whether this bears out in practice. From that perspective, this is an interesting and important study.

My main concern is with the metrics used for reporting results. It seems much effort went into their selection, but in the end it is not clear which are used where and why. See Major below.

The baseline is simply fine tuned BERT. This is good for assessing whether adding GAT to BERT will improve results. However, there is no mention of the current SOTA on this dataset. What are the SOTA baselines on this dataset? This should be included in the literature review.

Finally, there are some minor grammatical and typographical errors which should be corrected.

Major:
The terms COMET and QE are used interchangeably. Sometimes, the two are used together like "COMET QE". In particular, the caption in Table 2 uses QE, but the cell values are labelled COMET. Also, you state that you are using TransQuest, but it not clear which results are measured using TransQuest, or COMET. Please explain how these two differ, why you chose to use one vs the other, and please clarify exactly which technique or implementation is used where and use the terms consistently. Include the specific metric and implementation in the table captions. Otherwise, it looks like the metrics were cherry-picked to make the results seem better than they are.

On pg7 line 43-44 - You talk about rejecting H_0, but you never stated what the hypothesis is. It does not make sense to put the hypothesis in parentheses as an afterthought. This should be stated earlier on in the manuscript, in the introduction.

Minor:
You provide the range for QE on pg7 line 14. This is very good. Can you please similarly provide a range for BLEU.

Figure 1 - are the alternative connections for the SGBD model? If so, could you make this clear in the figure? Which part is SGBC and which is SGBD.

Very Minor:
pg1, line 43 - "between across" remove "between"
pg2, line 4 - "insights" should be "information"
pg2 line 4- "notably" is not precise. Do you mean "statistically significantly"?
pg2 line 6 - "syntactic strategies"? what are the strategies? Do you mean syntactic information?
pg4 line 21 - "even their" should be "even though their" ?
pg4 line 24 - "To obtain the syntactic dependency information of the source sentence˜S". The source sentence is S no? not S_hat?
pg4 line 35 - "the same with BERT embedding." Do you mean "the same as the BERT embedding"?
pg4 line 36 - "equation" should be plural "equations"
pg4 line 47 - "concluded as". Do you mean "expressed as"?
pg7 line 30-31 - duplicate sentence fragment
pg7 line 41 - please spell out all the variables (x_d, S_d, etc) just like you did for test statistic (t)
…

There are more such minor items which could benefit from correction. Please proofread the remaining pages for similar errors and typos.

Review #4 submitted on 27/Nov/2024

By Deepa Tilwani
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: No

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

Technical Quality and Contributions

The paper's technical foundation and execution are strong. It integrates Graph Attention Networks (GAT) with Transformer-based Neural Machine Translation (NMT) models to incorporate explicit syntactic knowledge, which improves translation quality across multiple language pairs. The results are well-supported by quantitative evaluations and insights into the influence of syntactic graphs on translation performance.

However, the paper fails to address its focus on neurosymbolic artificial intelligence. While the abstract positions the work as a neurosymbolic approach, the discussion in the main body does not adequately explore or contextualize the work within the neurosymbolic AI framework. There is minimal engagement with the broader principles of neurosymbolic AI, such as how this approach bridges symbolic reasoning (syntactic graphs) and neural networks (BERT/Transformer models).

Originality and Novelty

While the technical novelty of integrating syntactic graphs with GAT in NMT is clear, the claim of contributing to neurosymbolic AI feels underdeveloped due to the lack of direct discussion on this paradigm. This is a critical omission given the stated focus of the journal. Authors discussed about syntactic representations in the bert but to align with the neurosymbolic AI theme, the authors should explicitly discuss:

The symbolic and neural components of their approach.

How this integration fits into the goals of neurosymbolic AI.

Broader implications of their method in advancing neurosymbolic reasoning beyond the specific NMT application

Suggestions for Improvement

Reframe the introduction: After explaining the problem of syntactic ambiguity in NMT, explicitly introduce neurosymbolic AI and how the approach combines neural models (BERT) with symbolic reasoning (syntactic graphs). This will provide a stronger thematic anchor for the rest of the paper.

The authors should include a dedicated discussion (either in the introduction or a separate section) on neurosymbolic AI. This should address:

How their method exemplifies neurosymbolic principles.

The role of explicit syntactic graphs as symbolic knowledge.

How GAT and BERT collaborate to bridge symbolic reasoning with neural processing.

Without this, the paper risks appearing as a specialized NMT study rather than a contribution to the neurosymbolic domain. Also, the related work section is comprehensive but lacks focus. It discusses various topics (pre-trained models, syntactic knowledge, and GNNs) in a disconnected manner.

The paper relies heavily on quantitative metrics like BLEU and QE scores but provides little qualitative analysis or examples of how syntactic graphs improve translation. Including such examples would significantly enhance the paper’s interpretability and practical relevance.

Broader Context:

The paper should connect the work to broader challenges in neurosymbolic AI, such as interpretability, robustness, or learning efficiency. Highlighting these aspects would strengthen the paper's relevance to the journal’s audience.

What motivates the inclusion of syntax in the decoder for SGBD? Are there theoretical or empirical justifications for this choice?

Integrating GAT into the NMT pipeline is a central contribution, yet critical details are missing. Clarify these following in the manuscript:

Question: How does GAT handle cases of ambiguous or incorrect syntactic dependency parsing? For example:

Does the adjacency matrix derived from Universal Dependencies handle errors gracefully, or is the model's performance sensitive to parsing accuracy?

Are there fallback mechanisms for sentences with parsing failures?

Question: Could the authors elaborate on why the SGB engines struggle with scrambled word order? For example:

Is this due to a failure in the parser’s dependency representation or limitations in GAT’s ability to generalize beyond linear syntax?

The experiments with XLM-Roberta demonstrate some level of generalizability:

Question: Does the performance of SGBC and SGBD with XLM-Roberta suggest that the proposed approach is independent of the specific pre-trained model? Clarify this point in manuscript.

Add a Detailed Architecture Diagram: While the paper includes a basic diagram (Fig. 1), it lacks sufficient detail to distinguish between SGBC and SGBD. Including flow diagrams showing the data transformations and attention mechanisms in both models would clarify their operation.

Question: Could the authors discuss potential challenges in adapting this architecture to other pre-trained models, such as GPT or T5?

I suggest authors to look into following papers while talking about neurosymbolic aI and cite at appropriate places if necessary:

Besold, T. R., d’Avila Garcez, A., Bader, S., Bowman, H., Domingos, P., Hitzler, P., ... & Zaverucha, G. (2021). Neural-symbolic learning and reasoning: A survey and interpretation 1. In Neuro-Symbolic Artificial Intelligence: The State of the Art (pp. 1-51). IOS press.

Garcez, Artur d’Avila, and Luis C. Lamb. "Neurosymbolic AI: The 3 rd wave." Artificial Intelligence Review 56.11 (2023): 12387-12406.

Tilwani, Deepa, Revathy Venkataramanan, and Amit P. Sheth. "Neurosymbolic AI approach to Attribution in Large Language Models." IEEE Intelligent Systems 39.6 (2024).

Garcez, Artur d'Avila, et al. "Neural-symbolic learning and reasoning: contributions and challenges." 2015 AAAI Spring Symposium Series. 2015.

Gaur, M., & Sheth, A. (2024). Building trustworthy NeuroSymbolic AI Systems: Consistency, reliability, explainability, and safety. AI Magazine, 45(1), 139-155.

Gaur, Manas. "Targeted knowledge infusion to make conversational AI explainable and safe." Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence. 2023.

Hitzler, P., Eberhart, A., Ebrahimi, M., Sarker, M. K., & Zhou, L. (2022). Neuro-symbolic approaches in artificial intelligence. National Science Review, 9(6).

Sheth, Amit, and Kaushik Roy. "Neurosymbolic Value-Inspired Artificial Intelligence (Why, What, and How)." IEEE Intelligent Systems 39, no. 1 (2024): 5-11.

Hitzler, Pascal, and Md Kamruzzaman Sarker. "Neuro-Symbolic Artificial Intelligence: The State of the Art." (it is a book, cite whatever is necessary to paper)

More literature on neurosymbolic AI is available. Please look at those. Follow the above mentioned authors for neurosymbolic AI.

Graph-ic Improvements: Adding Explicit Syntactic Graphs to Neural Machine Translation

Tracking #: 734-1718

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 734-1718

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links