By Heiko Paulheim
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Average
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
The paper introduces LitKGE, an approach for improving the handling of numeric literals in knowledge graph embedding models that are already capable of using those literals. The authors show that improvements can be obtained in an experiment with three datasets and two variants of LiteralE.
The approach basically densifies a graph by adding attributes of distant entities to an entity, as shown in the introductory example. As such, it can be understood as an approach for adding attributive triples to a graph, using new attribute relations which are made up from paths. That being said, it would be helpful to describe it as that throughout the entire paper. That would make it easier to follow and get rid of some additional nomenclature (e.g., tables 1 and 2 could use the exact same terminology and not two distinct ones).
The approach is interesting and shows some improvements. At the same time, there are some open questions, some of which can be simply fixed in the paper, some of which might require additional experimental studies. Nevertheless, I think they would be worthwhile to investigate to make the paper more valuable.
What should be depicted is the size of the relation/attribute networks. From what I understood, the number of nodes is rather small (at max 257+297 nodes for LitWK48k). That being said, I wonder why random walks are used, instead of simply enumerating all walks of a given depth. Given the small size of the graphs, this should be very simply possible (although at the expense of not utilizing the weights, but I feel like they might be less interesting anyways, see below).
As far as the weights of the relation/attribute network are concerned, I wonder how valuable they actually are. In an extreme case, each entity as a numeric attribute with an identifier. Consequently, the edge to the node "id" always has the maximum weight. Likewise, it could be that two relations are frequent in isolation (e.g., r=owner, d=deathdate), but the combination is infrequent anyways (in the given case, it only occurs when the owner of a company is a deceased natural person, which may be very rare). The walk approach will favor such unproductive combinations. Therefore, I feel like an unweighted approach (or even enumerating all paths, as discussed above) should be explored in an ablation study.
Moreover, in real knowledge graphs, there might be different relations with similar semantics. Consider the four paths:
FIZ -> locatedIn -> Karlsruhe -> population -> 300.000
ZKM -> location -> Karlsruhe -> population -> 300.000
JPMorgan -> headquarter -> New_York_City -> population -> 8.800.000
UCLA -> campus -> Los_Angeles -> population -> 3.821.000
Here, we would create four different features, but they would essentially capture the same information (the population of the city where the organization is located). I wonder if there any plans for this, e.g., by introducing walks with wildcards, or clustering semantically similar relations upfront?
As far as the approach is concerned, there is a filtering step that discards features that are only assigned to a single entity. At the same time, Fig. 2 also depicts edges in the network with a weight of 1, which, as far as I understand will automatically lead to features that are discarded when included in a walk, and therefore I see no need to create those edges in the first place. Moreover, since the authors show that filtering has an effect, I wonder how more aggressive filtering would impact the approach. Moreover, since the graphs are very different in size, it might be valuable to replace the fixed threshold with a percentage of the entities (e.g., removing features that exist only for less than p% of all entities, or, if classes are given, for less than p% of all entities in a class).
I was also wondering about the impact of the average aggregation when extracting literals. Averages are very susceptible to outliers. Did the authors face any issues with that? An ablation on other aggregators (min/max/median) might be interesting.
In the algorithm description section, it might be good to explain the three cases using a concrete example.
The experiment section also leaves some open questions:
* First, the parameters for the random walks are not given. How many random walks are generated per entity? What is the maximum depth?
* Are the parameters (epochs etc.) for plain DistMult/ComplEx and DistMult-LiteralE and ComplEx-LiteralE the same as for DistMult/ComplEx-LitKGE? If not, how comparable are the results?
* To which extent is the better performance of DistMult attributed simply to training it for more epochs?
* The ablation with ConvE is nice, but it raises the question why not a full set of experiments with ConvE is presented. Moreover, what are the results for plain ConvE?
Minor points:
* The section on LiteralE (and LitKGE in 4.2) always assumes real-valued vectors, but they are complex-valued when the underlying model is ComplEx.
* In terms of notation, the paper could be more uniform. For example, section 3 uses (s,o,p) as a notation for triples, while later, the more common (s,p,o) notation is used, and the set of entities is sometimes notated as \mathcal{E}, sometimes simply as E.
* Fig. 2 has identical depictions for the network and the walks, probably the latter should be different
* In the paragraph "WeiDNeR-Extended", it says "the higher the number of common entities", but this should probably be "number of common pairs of entities", and refers to tail-head pairs
* The approach seems to be semantically related, not semantically similar pairs of relations
* The random walk algorithm is mentioned first (in the last paragraph of "Algorithm Description") without mentioning it first, and comes a bit surprising
* The "2nd order biased random walk approach" should be explained
Overall, since a few of the points of critique may require additional experiments, I recommend a major revision.