LitKGE: Improving numeric LITerals based Knowledge Graph Embedding models

Tracking #: 760-1749

Flag : Review Received

Authors:

Genet Asefa Gesese

Harald Sack

Mehwish Alam

Responsible editor:

Guest Editors Knowledge Graphs and Neurosymbolic AI 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-760.pdf

Cover Letter:

Dear Editors, We want to inform you that we are submitting this full research paper to the "Special Issue on Knowledge Graphs and Neurosymbolic AI". Kind regards, The authors

Approve Decision:

Approved

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 15/Jan/2025

By Heiko Paulheim
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The paper introduces LitKGE, an approach for improving the handling of numeric literals in knowledge graph embedding models that are already capable of using those literals. The authors show that improvements can be obtained in an experiment with three datasets and two variants of LiteralE.

The approach basically densifies a graph by adding attributes of distant entities to an entity, as shown in the introductory example. As such, it can be understood as an approach for adding attributive triples to a graph, using new attribute relations which are made up from paths. That being said, it would be helpful to describe it as that throughout the entire paper. That would make it easier to follow and get rid of some additional nomenclature (e.g., tables 1 and 2 could use the exact same terminology and not two distinct ones).

The approach is interesting and shows some improvements. At the same time, there are some open questions, some of which can be simply fixed in the paper, some of which might require additional experimental studies. Nevertheless, I think they would be worthwhile to investigate to make the paper more valuable.

What should be depicted is the size of the relation/attribute networks. From what I understood, the number of nodes is rather small (at max 257+297 nodes for LitWK48k). That being said, I wonder why random walks are used, instead of simply enumerating all walks of a given depth. Given the small size of the graphs, this should be very simply possible (although at the expense of not utilizing the weights, but I feel like they might be less interesting anyways, see below).

As far as the weights of the relation/attribute network are concerned, I wonder how valuable they actually are. In an extreme case, each entity as a numeric attribute with an identifier. Consequently, the edge to the node "id" always has the maximum weight. Likewise, it could be that two relations are frequent in isolation (e.g., r=owner, d=deathdate), but the combination is infrequent anyways (in the given case, it only occurs when the owner of a company is a deceased natural person, which may be very rare). The walk approach will favor such unproductive combinations. Therefore, I feel like an unweighted approach (or even enumerating all paths, as discussed above) should be explored in an ablation study.

Moreover, in real knowledge graphs, there might be different relations with similar semantics. Consider the four paths:
FIZ -> locatedIn -> Karlsruhe -> population -> 300.000
ZKM -> location -> Karlsruhe -> population -> 300.000
JPMorgan -> headquarter -> New_York_City -> population -> 8.800.000
UCLA -> campus -> Los_Angeles -> population -> 3.821.000
Here, we would create four different features, but they would essentially capture the same information (the population of the city where the organization is located). I wonder if there any plans for this, e.g., by introducing walks with wildcards, or clustering semantically similar relations upfront?

As far as the approach is concerned, there is a filtering step that discards features that are only assigned to a single entity. At the same time, Fig. 2 also depicts edges in the network with a weight of 1, which, as far as I understand will automatically lead to features that are discarded when included in a walk, and therefore I see no need to create those edges in the first place. Moreover, since the authors show that filtering has an effect, I wonder how more aggressive filtering would impact the approach. Moreover, since the graphs are very different in size, it might be valuable to replace the fixed threshold with a percentage of the entities (e.g., removing features that exist only for less than p% of all entities, or, if classes are given, for less than p% of all entities in a class).

I was also wondering about the impact of the average aggregation when extracting literals. Averages are very susceptible to outliers. Did the authors face any issues with that? An ablation on other aggregators (min/max/median) might be interesting.

In the algorithm description section, it might be good to explain the three cases using a concrete example.

The experiment section also leaves some open questions:
* First, the parameters for the random walks are not given. How many random walks are generated per entity? What is the maximum depth?
* Are the parameters (epochs etc.) for plain DistMult/ComplEx and DistMult-LiteralE and ComplEx-LiteralE the same as for DistMult/ComplEx-LitKGE? If not, how comparable are the results?
* To which extent is the better performance of DistMult attributed simply to training it for more epochs?
* The ablation with ConvE is nice, but it raises the question why not a full set of experiments with ConvE is presented. Moreover, what are the results for plain ConvE?

Minor points:
* The section on LiteralE (and LitKGE in 4.2) always assumes real-valued vectors, but they are complex-valued when the underlying model is ComplEx.
* In terms of notation, the paper could be more uniform. For example, section 3 uses (s,o,p) as a notation for triples, while later, the more common (s,p,o) notation is used, and the set of entities is sometimes notated as \mathcal{E}, sometimes simply as E.
* Fig. 2 has identical depictions for the network and the walks, probably the latter should be different
* In the paragraph "WeiDNeR-Extended", it says "the higher the number of common entities", but this should probably be "number of common pairs of entities", and refers to tail-head pairs
* The approach seems to be semantically related, not semantically similar pairs of relations
* The random walk algorithm is mentioned first (in the last paragraph of "Algorithm Description") without mentioning it first, and comes a bit surprising
* The "2nd order biased random walk approach" should be explained

Overall, since a few of the points of critique may require additional experiments, I recommend a major revision.

Review #2 submitted on 24/Feb/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Detailed Comments:

This paper presents LitKGE, an approach for incorporating literals into Knowledge Graph embeddings to improve Link Prediction Tasks. In particular, it aims to improve upon the existent approach LiteralE (2019 publication), which integrates both object properties (relational triples) and datatype properties (attributive triples) for numerical values. Other type of literals, such as textual literals or other modalities like images, are not considered.

The paper is well-written and includes a toy example that effectively illustrates the approach. The primary novelty of the proposed approach with respect to LiteralE is that it utilises literal information not only for the entities that are directly connected to the literals, but also for others that are indirectly associated. While such connections may be relevant in certain KGs, such as the one presented as illustrative example, that might not always be the case. To mitigate potential noise or explosion on the number of triples with literals and new features created, they establish some heuristics to generate the proposer graphs to collect literals, based on the number of occurrences and weights, with the purpose to generate less-sparse features. LiteralE was chosen for comparison because it outperformed other TransE-like KGE models in a 2019 survey.

The most significant limitation of this work its the lack of comparing with most recent state of the art approaches, particularly approaches based on inductive GNNs. GNNs can learn a representation by exploring the neighbourhood of the graph, naturally incorporating and aggregating the literal edges up to a certain depth. In such approaches, attribute values, such as text, numerical values, images, can be treated as (multimodal) entities, allowing the model to learn from both the graph topology and detailed entity attributes, without having to artificially modify the KG to directly incorporate triples that are one node further away, as presented in this approach.

The authors reference to a more recent work where literals are converted to entities to learn embedding (for industry applications) , and mention that converting literals to entities may raise challenge on model scalability. However, this claim is not substantiated by experimental results. A paper evaluation on the trade-offs between performance and scalability on generating additional features (as proposed here) vs incorporating literals as entities in the GNN-based model is needed to make such a claim. That will make this paper to have a more relevant and stronger contribution.

Moreover , it is well known that traditional KG embedding like those used in this paper are not well suited for nodes unseen during training. Differently from TRansE like based approaches (DistMult, ComplexE), GNNs can be used inductively and work well with unseen nodes enabling generalization to unseen entities by reasoning (through iterative neighbourhood aggregation message passing). This aspect is not even discussed in the paper.

Another limitation , also not discussed or addressed here, is that the proposed approach is not trying to push the state of the art at handling more modalities for attributive triples, other than numerical values, while real world KGs encompass a diverse range of datatype attributes. Furthermore, this approach relies on scoring functions that limit the plausibility of the triples that are then used to Link Prediction tasks, despite including literals, this approach does not incorporate a regression objective, which could leverage numerical attributes prediction.

In the evaluation the paper demonstrates a small improvement in performance with respect to LiteralE, LitKGE still suffers from the same core limitations that this former work from 2019. The authors should compare their approach with respect to a standard GNN embedding based approach for Link prediction. Without this comparison, it is hard to evaluate if this approach offers really any contribution in performance or promise with respect to the recent state of the art approaches.
Additionally, the section discussing the limitations can be improved - this approach considers only numerical literals in a KG, but can not be easily extended to consider Multimodal Knowledge Graphs that incorporate many more modalities. The authors do discuss scalability issues but they do not for example mention what happens with KGs that contain unseen entities.

The evaluation is conducted on three standard Link prediction datasets . The evaluation set up needs to be extended a it more for the paper to be self contained (rather than simply referring to / saying that the setup is similar to the one in LiteralE). The results show a small improvement on performance when using DistMult and comparing LitKGE to LiteralE. However, when using a different scoring function like ComplexE, the results are actually worse in a couple of benchmarks. This isn’t adequately explained. The discussion needs to improved, as well as evaluated with respect to other GNN based approaches to give some insights on why this could be the case.

Adding one reference here as an example related to the limitations of this approach highlighted above, but there are many more recent references that the authors should include in the related work:

Zhu, X., Li, Z., Wang, X., Jiang, X., Sun, P., Wang, X., Xiao, Y., Yuan, N.J., 2022. Multi-modal knowledge graph construction and application: A survey 36, 715–735. URL: https://doi.org/10.1109/ TKDE.2022.3224228, doi:10.1109/TKDE.2022.3224228.

Review #3 submitted on 04/Feb/2025

By Xander Wilcke
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

In this work, the authors propose an extension to KGE models that support (numerical) literals, by including not only dirrectly-connected literals but also indirectly-connected ones to influence the final embedding. This is done by computing all relevant property paths, combined with biased random walks, and projected into a feature embedding space, after which suitable KGE models can incorporate this information.

The paper is well written and clear, and the authors report SOTA results, yet there are several caveats and limitations with their experiments and evaluation (see below) that would warrant a revision. However, I am also not convinced of the fundamentals of their approach, which goes against common practice in this field without a strong rationale, and the increase in performance don't justify this (again, see caveats below). Because of this, I'm inclined to vote for reject, since the whole method should be put under consideration.

## Questions

The proposed method is said to differ from other methods in that it can include literal information from indirectly-connected entities. True as this may be, I do have my doubts on whether this makes sense in graph data. Firstly, a fundamental theory that underlies modelling knowledge as graph is that of distributional semantics, which can be interpreted as that the semantics of entities are defined by their neighbours. Specifically, the closer a node is to a certain entity, the more it contributes to the identity of that entity. This core idea also forms the basis for almost any KGE model, as well as most GNNs. Can the authors motivate their decision on why they decided to include this indirectly-connected information (other than that it yields better results)?

Secondly, even if the aggregation depth is limited to the directly-connected nodes (as indeed most other models do) there are multiple methods that allow indirectly-connected information to get propagated by the existing relations. Using the given example, for instance, the information from the age literal would be embedded in the representation of the author (using 1-hop) which, again, would be embedded in the representation of the paper (again just using 1-hop). Sure, this only works for up to two hops, but going further in graphs usually discouraged anyway (due to low degrees of separation). Can the authors explain how their approach improves on these methods? What are the advantages of using property paths over just exploiting the distributional semantics one hop at a time?

The additional assumptions of the proposed WeiDNeR-Extended algorithm seem rather self evident and hence overly complex for its purpose: if the frequency of a two-hop path is higher, then, of course, is has a higher probability of exiting. Why did the author decide for this approach, rather than, for example, compute all statistically significant subgraphs/motifs, trees (eg using the WL kernel), or paths?

The authors propose the averaging of literal values which are connected using the same property path and which are from the same domain (in this case on age). This seems nonsensical to me, since you're averaging values that belong to entirely different entities. The only situation I can think of where this makes sense is when the entities represent measurements of some sort, and you're averaging over a certain dimension, eg space or time. Such a use case is rather unique for graphs, however. Why did the authors choose this strategy, rather than, for example, use the mode or median, which at least retain some information about the distribution? Also, how does this generalise to other datatypes, eg text or geometries?

The method requires the features to be precomputed, which makes it ill-suited for dynamic graphs or for the generation of out-of-dataset nodes. Since the ability to support these use cases is gaining popularity in the community (see eg NodePiece, amongst other) isn't the need for recomputing features a big caveat? Can the author share their thought on this? Do the author have ideas on how to eliminate this need?

The MRR and Hits@k metrics are used to evaluate the performance, yet, from what I can tell, the authors only report on the raw values, rather than on the now-more-common filtered variants, which account for multiple correct answers. Why did the authors decide not to report and/or compute these?

The authors show nice results, which they say are obtained without hyperparameter tuning. However, the well known Ruffinelli paper [1] showed that many of these methods are competitive, with hyperparameter tuning being a huge contributor to this. Given that the authors claim SOTA performance without hyperparameter tuning for the baseline methods, can the results actually be meaningful? Or do the chosen fixed set of hyperparameters simply favour the author's model? Is there any reason why hyperparameter tuning wasn't done?

[1] Daniel Ruffinelli, Samuel Broscheit, Rainer Gemulla: You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings. ICLR 2020

The authors mention that ComplEx-LitKGE does not improve upon ComplEx alone, yet it does on DistMult. Since the only actual difference between ComplEx and DistMult is that the former can represent inverse relationships (via different hyperplanes) what can the authors infer from these results? Since that the inverse relations are removed from FB15K-237 (and probably not from YAGO3-10) I would surmise that this is being exploited by ComplEx on YAGO3-10. Did the authors consider removing such links from YAGO3-10, or, else, use the original FB15K which contains these? If so/not, why?

## Minor

The authors call methods such as DistMult and ComplEx SOTA. While much depends on hyperparameter tuning, I'd still suggest to refrain from calling such old and limited (in expressivity) models SOTA.

LitKGE: Improving numeric LITerals based Knowledge Graph Embedding models

Tracking #: 760-1749

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 760-1749

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links