By Anonymous User
Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Weak
Content:
Technical Quality of the paper: Average
Originality of the paper: No
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Average
Detailed Comments:
The paper provides for a pipeline to predict poor student performance in a course. The authors argue for this variant of the grade-prediction problem, formalize an approach primarily based on embeddings with KNN inference, and report results, which by the authors' own admission were not particularly good.
I think the biggest strength of this paper was the positioning of the problem, which I found interesting. It is a different take on previous work (which is thoroughly reviewed by the authors) and the imbalanced aspect of predicting a poor grade provides an interesting challenge.
However, I think there were two big weaknesses that preclude it from acceptance to this journal. First, the approach does not appear to be neurosymbolic. It strikes me as a pure data mining approach - perhaps more appropriate for a venue like KDD's applied data science track. Second, the evaluation requires a bit of work. The results are quite limited, both in significance and thoroughness.
With respect to the experiments, let me start with that.
1. First, the overall results of precision, recall, and F1 are very low - so on the surface, the results are not significant. However, I question if these metrics are even the most important. This is a highly imbalanced problem that has not been shown to be solvable with other methods - so maybe these numbers are not so bad after all.
2. The authors lack baseline comparisons. The three that would come to mind are (1.) applying a standard dense neural network, (2.) applying a simple modification of an approach from the grade prediction literature, and (3.) some random baseline. How well or poorly your algorithm work would be more clear in the context of other methods.
3. The hyperparameter exploration seems insufficient. The authors tried two values of k, but perhaps other hyperparameters or easy modifications can be explored, for example why not threshold distances, try different distance functions, different embeddings, etc. It would be interesting to see if the authors could develope methods that have different properties (e.g., one to improve precision and one to improve recall).
4. I also question if precision and recall are the best metrics. While standard metrics such as these are a good idea to report, the authors should also consider developing an application-specific metric as well. Is there a way the algorithm can provide results that can better impact the students? How do the baselines perform?
My second major problem with the paper is that it cannot be considered neurosymbolic in its current form. I think there are some opportunities with this problem that are worth exploring:
1. The authors describe in section 5.3 how the performance prediction works, which amounts to a series of case statements. It seems that this can be represented as a logic program (e.g., ASP, PROLOG, PyReason, etc.) and you can dumb the KNN results into this as logical facts. The result would also give the user an explanation as to why it thinks they are likely to not perform well in a class.
2. The distance functions include direct grade information, which can also be represented symbolically (e.g., propositional logic) and perhaps such information can be used to characterize such distances with some sort of symbolic annotation. Again, this could improve explainability.
In short, while I see some promise with this problem, the paper is not ready for acceptance to the journal. The authors would at a minimum need to address both of the above points; if they only address the first, it should be resubmitted to a data mining venue (and not a neurosymbolic one).