By Kai-Uwe Kühnberger
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Good
Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good
Detailed Comments:
The paper „Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques“ attempts to address challenges for LLMs concerning multilingual information retrieval and reasoning over long multilingual contexts. The paper proposes a neural-symbolic framework that integrates multilingual information retrieval and symbolic reasoning. The retrieval process is based on “Cross-lingual Retrieval Optimized for Scalable Solutions” (CROSS) whereas the reasoning process uses “Neuro-Symbolic Augmented Reasoning” (NSAR) which produces executable Python code for reasoning tasks.
The proposed architecture is essentially hybrid. The NSAR module used for reasoning is based on an LLM to retrieve relevant facts from a given context, to represent these facts in a structured form FACT(entity,attribute,value), and to generate Python code to compute the answer. The underlying attributive logic is very weak and probably not appropriate to address many real-world problems, where complex relational structures need to be considered. It would be interesting whether a more expressive logical framework can simply be plugged-in or would generate challenges for the approach. For example, complex reasoning can be expensive (questioning the cost-efficiency of the proposed approach), complex relational structures could challenge the RAG framework bge-m3, translating structured knowledge into Python code could be more error-prone etc. Nevertheless, the paper is an interesting step to identify relevant facts in tasks with large contexts.
The architecture uses several processes to compute a solution to a given task. In each process errors can occur. The errors for the embedding by the bge-m3 tool and the candidate selection by the RAG framework have rather low error rates, e.g. 4% in case of the 1-needle test and 5.7% for the 3-needle test. Relative to the used dataset this is a limit of performance. Other sources of errors concern the LLMs issues with respect to incorrect answer failures and unanswerable failures. Whereas in the 1-needle test these errors are for both tested LLMs rather low, in particular GPT-4o-mini shows weak results in the 3-needle test (45% of total response are classified as unanswerable). Concerning ablation studies with respect to reasoning strategies (e.g. chain-of-thought, SelfReflection, NSAR etc.) the differences between these strategies are not dramatically large for GPT-4o-mini and Llama 3.2 90B (with RAG-vanilla in the case of GPT-4o-mini as an exception). Nevertheless, NSAR for GPT-4o-mini and NSAR+3 for Llama 3.2 show the best performances. In total, the potential errors occurring from the different modules do not harm the overall approach.
The various results of the study are in general good up to being excellent in some cases. To which extent these results generalize to other datasets, to more complex reasoning tasks, and to LLMs that are highly optimized for reasoning tasks is unclear to me. For example, in Figure 11 it is shown that GPT-4o-mini shows an LLM failure rate of 52.2% in the 3-needles scenario, whereas o1-mini shows in the same scenario only an LLM failure rate of 9.7%. It seems to be the case that specialized (but rather small) LLMs could be equally successful in reasoning tasks without NSAR.
The related works section seems to be rather short. There is a long research tradition in cross-lingual retrieval, reasoning over large contexts, and the combination of both. Perhaps the reader is interested in a deeper embedding of the study into a larger research context.
The paper’s focus is on LLMs and neural inspired methods. Symbolic reasoning plays only a role in the NSAR module using a very weak logic and some automatically generated Python code for reasoning without any deeper forms of deductive approaches. Nevertheless, I would call this paper an interesting contribution to the neural-symbolic AI research field. It allows resource-aware extensions of small LLMs to improve significantly multilingual reasoning tasks.
The paper is formally well written and clearly structured. Here are some remarks:
- I would recommend to improve the legend of several figures. For example, it would be helpful for the reader to distinguish red and blue dashed lines and red and blue solid lines in Figures 1 to 5. Such a legend would correspond better to the figures instead of the current one. The current legends are confusing.
- I would further recommend to unify the usage of highlighted words, numbers, and constituents (bold fonts) in several sections: for example, in Sections “Dataset: mLongRR-V2” and “Cross-Lingual Language Pairs and Needle Positioning” there is a massive usage of bold words, whereas in most sections no highlighted words occur. In several section, only numbers are (sometimes) highlighted. The logic of this highlighting strategy is not clear to me.
- Perhaps it would be appropriate to explain somewhere the overall processing pipeline / architecture. Because of the different modules, the reader could lose the general overview.
- I would unify whether to write “bge-m3” or “BGE-M3”. In the section “Cost Efficiency Analysis” the system appears in capital letters, but not in other sections. In the same section the symbol N needs to be specified.