By Joel Carbonera
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Average
Detailed Comments:
This paper explores the integration of explicit syntactic graphs with BERT models in Neural Machine Translation (NMT) systems, aiming to enhance translation quality. The idea is based on using graph attention networks (GAT) to introduce explicit syntactic knowledge, complementing the implicit learning of pre-trained language models, such as BERT.
The work presented by the authors is based on syntactic dependency relations. However, the authors do not provide a good explanation of what these relations are, their types, what they represent, etc. This knowledge is fundamental to understanding the whole paper. The absence of this discussion negatively impacts the understanding of the discussions and results. I recommend including a discussion that provides the minimum knowledge about syntactic dependency relations that readers need to have to understand the paper.
The paper presents a promising idea and offers an interesting scientific contribution. And up to section 3, the work is generally well conducted. However, after section 3, some of the explanations provided are not clear, and there seems to be a missing glue in the discourse that allows readers to follow the reasoning. In general, the reasoning underlying some parts is not explicit, making it quite difficult to understand the underlying reasoning and, consequently, making it difficult to understand the results and evaluate the quality of the author's contribution. I suggest a more detailed analysis of the discursive structure of the paper aiming to provide a structure that makes explicit all the reasoning underlying the discourse. The idea is to allow the readesr of the paper to be able to read it and reproduce the authors' results in full. I suggest following this principle as a guide for restructuring the paper.
Regarding the acronym “SGB”, it would be interesting to explain it clearly at the first mentions. This helps to avoid confusion and improves accessibility for all audiences.
It is not clear to whom is not an expert on the task what is the output of the Universal Dependencies-based parser. It seems that, in some way, the parsing provides the information of the syntactic dependencies between words, but it is not clear by the text what is the structure of this information. Is it a directed acyclic graph whose edges represent different types of syntactic dependency? How the different types of dependency are represented, in this case? What does the resulting node adjacency matrix look like? This should be explained by providing more details. It would be more informative to the readers if the authors provide some example of this process considering some specific sentences as examples and showing what is the information generated in the parsing process.
The text in section "3.2. Metrics for Machine Translation Evaluation" is confusing. I suggest to explain in a clearer way the roles of BLEU, COMET and TransQuest in the evaluation process. It seems that COMET and TransQuest are used for generatin QE metrics. If this is the case, it is not clear why the authors used COMET in some evaluations and TransQuest in others. TransQuest seems to be ideal for evaluating translations in scenarios where there are no reference translations. However, it is not clear whether this is the reason why the authors used it in some evaluations instead of using only COMET. It would be interesting to make it clearer in the text why the authors used more than one strategy for QE.
Table 1. In my point of view, the caption of this table does not help the reader understand the information presented in the table. In general, a table caption should be more objective and focused on describing the content and structure of the table, so that the reader knows what it presents without inferring specific interpretations or results. In the case of the description above, there are interpretations about the results, which should generally be left to the discussion in the main text, not to the table caption. I suggest that the authors change the caption to a description of the data presented in the Table. The conclusions obtained by analyzing this data should be discussed in the main text. I also suggest that authors follow this principle throughout the paper, for all tables and figures.
It is not clear what the column size means in Table 1. Is it the size of the dataset? This is the kind of information I expect to be explicit in a table caption, for example.
"1 million (M) sentence pairs are selected as the training set for each language, with 6 thousand (K) and 5K sentence pairs for the validation and test sets, respectively."
If the column size in Table 1 is related to the dataset size, how is this sentence related to the information presented in Table 1?
The authors split the data into training, validation, and testing, but it is not clear how the selection of sentences for each set was made. Was random selection used?
Besides that, the role of the validation set in the methodology is not clear. Did the authors use some early stopping method to monitor the validation set or was this set used only for human monitoring of the model's performance throughout training on a data set not used to adjust the parameters (training data)?
I suggest including examples for a qualitative analysis of the results. Instances with low BLEU and high EQ. Instances where both are low. Instances where both are high. Instances with high BLEU score and low EQ. Discussing examples like this helps to raise knowledge about the model's behavior, favoring the revelation of phenomena, explanatory hypotheses, and justifications and promoting the emergence of new ideas from the knowledge gathered.
"...outperforms SGBD in handling certain syntactic relations, including "discourse:sp," "orphan," and "csubj.""
It is important to describe what these relationships mean, even if only briefly.
Section 4.3
The section 4.3 needs improvements. It would be more informative to clarify the assumptions and the reasoning behind the discussion more explicitly to allow the reader to follow along. This would help the reader better understand the information presented in Table 4. It would be helpful to explain this table further.
Section 5.1
In Section 5.1, the explanation of the proposed model could be better detailed. The authors introduce a model to investigate the types of syntactic knowledge GAT is capable of learning. However, it is hard to understand the model introduced in this section. It is not clear what the input and output of such a model are. Without an explicit and detailed explanation of this model, it is hard to follow the underlying reasoning the authors are trying to present. Presenting some examples of input and output in some diagrams would make this section much easier to understand. Furthermore, the authors should better characterize the learning problem related to this model. Is the goal of the model to classify the edges between words with classes of syntactic dependency relations? If so, it is essential that this information be explicit. In addition, as it is a classification problem, it is unclear whether it is a binary, multiclass, or multilabel classification problem. Finally, it is important to make it clear whether the dataset considered is balanced or not. It is also not so clear how this discussion relates to the models proposed by the authors. It is important to make all this reasoning explicit so that the reader understands why they are being presented with this discussion.
"The F1-score is used as the evaluation metric."
Since I was unable to understand exactly the characteristics of the learning problem that the introduced model must solve, I cannot assess whether this metric is suitable for the problem. If it is a binary classification problem, mentioning that you are using F1 is ok. However, if the classification problem is multiclass, you need to state what type of aggregation you are using (macro, micro, weighted), etc. Note that it is also important to select the appropriate aggregation, considering whether the problem is balanced or not. Furthermore, if the model classifies all edges in the graph, it is not clear how a single F1 metric is generated for the entire dataset. It is essential to explain this in detail. A clearer explanation of section 5.1 would benefit the understanding of the results in Table 5.
The caption for Figure 2 needs to be significantly improved. It does not describe the data presented. Actually, I strongly suggest that the authors review the captions for all figures and tables in the paper. Many captions present conclusions rather than describe the data presented in the figures and tables. A good example of this is the caption for Figure 6. Captions for figures and tables should be concerned exclusively with describing the structure of the data presented in the figures and tables to allow the reader to consume the information and draw their own conclusions based on that description. The conclusions that the authors draw from the data presented should be placed in the main body of the text, along with an explicit and carefully articulated analysis that represents how the authors drew their conclusions from the data. I suggest following this principle in all figures and tables of the paper.
Section 5.2 is also difficult to follow. What are the inputs and outputs of the model trained in this section? How is the learning problem characterized?... The information is presented to the reader without the necessary details and without explicitly explaining the reasoning behind what is being presented. I suggest representing these ideas in a way that makes it easier to understand why the reader is being presented with certain ideas, the analyses performed, the arguments established, and the conclusions drawn.
Section 6.1
The analysis in section 6.1 is difficult to follow. It is not clear what exactly is used as input to the models. For example, consider this statement:
"The source sentences corresponding to the 300 low-quality translations are divided according to the type of dependency relations as the stimulus. Given the current dependency relation is x, the source sentences of low-quality translations containing x are all composed
into one group stimulus."
It is not clear whether the stimulus is the sentences, the type of dependency relations, or groups of sentences. Examples of input would be very enlightening. I suggest that the authors provide examples and make explicit all the reasoning behind this process. The goal is to allow other readers to be able to understand and reproduce the experiments in detail. So the explanation provided should allow for this. Also, consider this statement:
"Table 8 presents selected results from an RSA analysis, comparing Baseline BERT with SGB engines based on syntactic prediction scores by GAT (full results are in Appendix A)."
I could not understand what the idea was at this point. I suggest more detailed and explicit explanations. Also, consider the following statement:
"Specifically, layers 3-5 for Chinese and Russian, and layers 5-8 for German, exhibit the lowest RSA scores."
The authors discuss layers here, but up until this point, it was not clear that the analyses would be performed considering different layers. It is important to establish exactly how the analysis was conducted before presenting and discussing the results. Without a thorough understanding of the purpose and methodology of the analysis, it is not possible to understand or evaluate the results.
I think that the analysis in section 6.1 is as follows:
-Initially, the authors select 300 low-quality translations to perform representation similarity analysis. The source sentences of these translations are divided based on the specific syntactic dependency relations they contain. This means that for each type of dependency relation (e.g., "root", "nsubj", etc.), the sentences containing that specific relation are grouped together. These groups of sentences form what the paper calls the "stimulus" for the analysis.
-Then, the internal BERT representations for these sentences are extracted. Representations are collected from each BERT layer for both model configurations: the baseline model (which uses only BERT) and the GBS models (BCMS and DBMS), which integrate GAT. The idea is to compare the representations between the baseline models and the models that use syntactic graphs.
-For each layer of the model, the sentence representations are organized into a similarity matrix. To do this, the similarity between all combinations of sentence pairs within the "stimulus" group is calculated. The similarity is measured using cosine similarity, which allows us to quantify how close the representations of sentences in a specific layer are.
-Once the similarity matrices are constructed for each layer of each model (baseline and SGB), it is possible to compare the representations of the layers of one model with those of the other. The comparison is made using the Pearson correlation between the upper parts of the similarity matrices (excluding the main diagonal, which represents the similarity of a sentence with itself). This correlation value is then interpreted as the degree of similarity between the representations of the same layer in the two models being compared (baseline and SGB).
But I am not sure if this is the process. I interpreted it this way by performing some deductions and "interpolations". A more detailed and explicit explanation, seeking to fill these gaps, would greatly benefit the paper.
"This finding indicates that BERT is more instrumental in forming representations of source sentences and affecting translation quality in this hybrid approach."
It is not clear what this sentence means. I suggest more clarity in the interpretation of this result. I suggest the authors structure their ideas better, filling in the gaps and explaining their reasoning.
Overall, the paper brings relevant contributions to the field of neural machine translation and creatively explores the use of syntactic graphs combined with BERT. The suggestions above only seek to add clarity and depth to the explanations, strengthening the connection with the audience and allowing the study to reach its full potential.