Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques

Tracking #: 882-1891

Flag : Review Received

Authors:

Sina Bagheri Nezhad

Ameeta Agrawal

Responsible editor:

Guest Editors NeSy 2025

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-882.pdf

Cover Letter:

Dear Editors, Please consider our manuscript, "Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques," for the NAI Journal’s special issue of extended papers from the 19th International Conference on Neurosymbolic Learning and Reasoning (NeSy 2025). This work substantially extends our NeSy 2025 paper, "Enhancing large language models with neurosymbolic reasoning for multilingual tasks." In line with the special issue guidelines, we provide several major contributions: - Formalization of CROSS: A complete introduction of the CROSS framework, only preliminarily outlined in the conference version. - Expanded Evaluation: A rigorous bidirectional cross-lingual assessment across 49 language pairs and seven languages. - New Benchmark (mLongRR-V2): Enriched with greater linguistic diversity, new retrieval and reasoning tasks, and stronger evaluation of cross-lingual performance. - Comprehensive Analysis: Detailed 1-needle and 3-needle protocols showing up to 92% retrieval accuracy and a fivefold reduction in reasoning failures. These additions deliver deeper insights, broader evaluation, and significant new contributions beyond the conference paper. The submission fully complies with author guidelines and cites the original publication. Thank you for your consideration. We look forward to your response. Sincerely, Sina Bagheri Nezhad and Ameeta Agrawal Portland State University

Approve Decision:

Approved

Revised Version:

Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques

Tags:

Reviewed

Decision:
Minor Revision

Solicited Reviews:

Review #1 submitted on 16/Nov/2025

By Kai-Uwe Kühnberger
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The paper „Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques“ attempts to address challenges for LLMs concerning multilingual information retrieval and reasoning over long multilingual contexts. The paper proposes a neural-symbolic framework that integrates multilingual information retrieval and symbolic reasoning. The retrieval process is based on “Cross-lingual Retrieval Optimized for Scalable Solutions” (CROSS) whereas the reasoning process uses “Neuro-Symbolic Augmented Reasoning” (NSAR) which produces executable Python code for reasoning tasks.

The proposed architecture is essentially hybrid. The NSAR module used for reasoning is based on an LLM to retrieve relevant facts from a given context, to represent these facts in a structured form FACT(entity,attribute,value), and to generate Python code to compute the answer. The underlying attributive logic is very weak and probably not appropriate to address many real-world problems, where complex relational structures need to be considered. It would be interesting whether a more expressive logical framework can simply be plugged-in or would generate challenges for the approach. For example, complex reasoning can be expensive (questioning the cost-efficiency of the proposed approach), complex relational structures could challenge the RAG framework bge-m3, translating structured knowledge into Python code could be more error-prone etc. Nevertheless, the paper is an interesting step to identify relevant facts in tasks with large contexts.

The architecture uses several processes to compute a solution to a given task. In each process errors can occur. The errors for the embedding by the bge-m3 tool and the candidate selection by the RAG framework have rather low error rates, e.g. 4% in case of the 1-needle test and 5.7% for the 3-needle test. Relative to the used dataset this is a limit of performance. Other sources of errors concern the LLMs issues with respect to incorrect answer failures and unanswerable failures. Whereas in the 1-needle test these errors are for both tested LLMs rather low, in particular GPT-4o-mini shows weak results in the 3-needle test (45% of total response are classified as unanswerable). Concerning ablation studies with respect to reasoning strategies (e.g. chain-of-thought, SelfReflection, NSAR etc.) the differences between these strategies are not dramatically large for GPT-4o-mini and Llama 3.2 90B (with RAG-vanilla in the case of GPT-4o-mini as an exception). Nevertheless, NSAR for GPT-4o-mini and NSAR+3 for Llama 3.2 show the best performances. In total, the potential errors occurring from the different modules do not harm the overall approach.

The various results of the study are in general good up to being excellent in some cases. To which extent these results generalize to other datasets, to more complex reasoning tasks, and to LLMs that are highly optimized for reasoning tasks is unclear to me. For example, in Figure 11 it is shown that GPT-4o-mini shows an LLM failure rate of 52.2% in the 3-needles scenario, whereas o1-mini shows in the same scenario only an LLM failure rate of 9.7%. It seems to be the case that specialized (but rather small) LLMs could be equally successful in reasoning tasks without NSAR.

The related works section seems to be rather short. There is a long research tradition in cross-lingual retrieval, reasoning over large contexts, and the combination of both. Perhaps the reader is interested in a deeper embedding of the study into a larger research context.

The paper’s focus is on LLMs and neural inspired methods. Symbolic reasoning plays only a role in the NSAR module using a very weak logic and some automatically generated Python code for reasoning without any deeper forms of deductive approaches. Nevertheless, I would call this paper an interesting contribution to the neural-symbolic AI research field. It allows resource-aware extensions of small LLMs to improve significantly multilingual reasoning tasks.

The paper is formally well written and clearly structured. Here are some remarks:

- I would recommend to improve the legend of several figures. For example, it would be helpful for the reader to distinguish red and blue dashed lines and red and blue solid lines in Figures 1 to 5. Such a legend would correspond better to the figures instead of the current one. The current legends are confusing.

- I would further recommend to unify the usage of highlighted words, numbers, and constituents (bold fonts) in several sections: for example, in Sections “Dataset: mLongRR-V2” and “Cross-Lingual Language Pairs and Needle Positioning” there is a massive usage of bold words, whereas in most sections no highlighted words occur. In several section, only numbers are (sometimes) highlighted. The logic of this highlighting strategy is not clear to me.

- Perhaps it would be appropriate to explain somewhere the overall processing pipeline / architecture. Because of the different modules, the reader could lose the general overview.

- I would unify whether to write “bge-m3” or “BGE-M3”. In the section “Cost Efficiency Analysis” the system appears in capital letters, but not in other sections. In the same section the symbol N needs to be specified.

Review #2 submitted on 29/Oct/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Detailed Comments:

Summary:

The paper presents CROSS + NSAR, a neurosymbolic retrieval and reasoning approach that first retrieves relevant documents that may contain the answer, and then introduces an intermediate step before the LLM generates the answer in which it first generates Python code to answer the question from the retrieved text. This approach is applied to the mLongRR-V2 benchmark, a cross-lingual retrieval benchmark with long documents, where it significantly outperforms the base models (using GPT-4 and Llama 3.2) and performs more robustly across language pairs and context lengths.

Strength:

1. A well-motivated neurosymbolic method for retrieval and reasoning that adds an intermediate step of code generation.

2. The method achieves strong performance on the mLongRR-V2 benchmark, improving upon a vanilla RAG approach and several advanced prompting techniques such as chain of thoughts.

3. The proposed method performs robustly well across languages and context lengths.

4. The proposed method is more interpretable than existing methods.

5. The authors also enhanced to the original mLongRR benchmark.

Weaknesses:

I think this paper can do a better job at situating its contribution with respect to the state-of-the-art. I believe NSAR is novel but I’m not sure how CROSS is different from existing cross-lingual retrieval methods. Yet, most of the experiments in the paper focus on retrieval and CROSS (which I see as improvements from RAG, mostly) rather than reasoning and NSAR. I think it’s important to clearly distinguish the contribution from each. If CROSS is considered a substantial contribution in this paper, its difference from existing cross-lingual RAG systems should be explained and it should be evaluated against such methods.

Questions and Comments:

1. Introduction and related work: I think it may be worth first discussing the problems with cross-lingual information retrieval vs. the problems with retrieval from long documents separately, since they introduce different challenges and the existing methods for addressing these challenges are also different.

2. Results: I’m a bit confused about the setup comparing CROSS to the “LLM-only” baseline. All the way to the cost analysis I thought the “LLM-only” condition doesn’t have access to the documents and I was wondering if it was fair to test it on "retrieval” accuracy. After reading the cost analysis my understanding is that the LLM gets all (?) the documents as context, and is tasked with finding the target sentences. Is this true?

3. CROSS: I have a bit of a hard time understanding how CROSS operates differently from existing cross-lingual RAG methods. I believe existing methods also embed chunks from the documents using multi-lingual embeddings and then retrieve the top k most similar chunks to the query. My understanding is that the contribution of the paper is mostly focused on the reasoning component NSAR, but it would be useful to explicitly mention which parts of the CROSS pipeline are novel. E.g., is the main difference the choice to embed sentences rather than longer chunks? If CROSS is considered one of the contributions of this paper, the retrieval accuracy needs to be evaluated against typical cross-lingual RAG systems (before the generation step), not just against using the LLMs without access to the source documents.

4. Page 10: please explain how RAG with CoT (for example) combines with NSAR in NSAR+3 - is CoT applied before or after generating code? In addition, are CoT, React, and SelfReflection in this evaluation (Figure 14) combined with RAG?

Minor Comments:

- Page 4: “lacked diversity in script types Agrawal et al. (2024).” -> “lacked diversity in script types (Agrawal et al., 2024).” (and a few others across the paper where an inline citation is needed).

Robust Long-Context Multilingual Retrieval and Reasoning Enabled by Combined Neural and Symbolic Techniques

Tracking #: 882-1891

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 882-1891

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links