Revisiting Business Process Analysis through the lens of Large Language Models: Prompting experiments with BPMN process serializations

Tracking #: 788-1779

Flag : Review Received

Authors:

Damaris Dolha

Ana-Maria Ghiran

Robert Buchmann

Responsible editor:

Guest Editors Neuro-Symbolic AI and Conceptual Modeling 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-788.pdf

Cover Letter:

Dear Editors, We hereby submit the manuscript titled Revisiting Business Process Analysis through the lens of Large Language Models: Prompting experiments with BPMN process serializations, to be considered for publication in Neuro-symbolic AI Journal, the special issue on Neuro-Symbolic AI and Domain Specific Conceptual Modelling. The paper reports on comparative prompting experiments with alternative process representations of the same BPMN models: the standard BPMN XML serialization (as exported from the Signavio toolset) and the non-standard semantic RDF graph serialization (available as export format from the Bee-Up modeling toolkit).The ability of ChatGPT 4 to answer process queries is compared through a series of prompting experiments on full explicit process models and on minimalist BPMN patterns with generic labelling. Quality of answers is evaluated using the RAGAs framework. The work was motivated by a need to revisit the BPM lifecycle through the lens of what Large Language Models can bring to the different phases of the lifecycle – for now we focus on the Process Analysis phase. The submission extends a conference paper that was presented at BIR 2024 (Perspectives on Business Informatics Research) in September 2024: https://link.springer.com/chapter/10.1007/978-3-031-71333-0_2 The extension in this journal version pertains to - experiments on a variety of minimalist BPMN patterns (whereas in the conference only one end-to-end process exemplar was discussed) - insight on the differences between the RDF and XML serializations of BPMN - structurally and how it reflects into the linear text-based serialization We look forward to hearing from you, Damaris Dolha, Ana-Maria Ghiran, Robert Buchmann

Approve Decision:

Approved

Revised Version:

Revisiting Business Process Analysis through the lens of Large Language Models: Prompting experiments with BPMN process serializations

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 04/Dec/2024

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Poor
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

- Although I intuitively find the question interesting as to how well LLMs may be able to understand BPMN, the precise motivation for the research is only partially clear to me - in my opinion, many of the queries that were tried out in the evaluation can best be answered by looking at a graphical model. While this may not always be so, I really miss a clear statement of the objective of this research.
- In terms of the evaluation, I have a number of doubts:
-> firstly, I wonder how the test queries were selected and what was the general strategy behind this selection. This should be more clearly described in the paper
-> secondly, I am not very convinced about the use of the RAGA scores. For instance, Case I in Table 1 gets a a faithfulness score of 1.0 which your discussion initially supports. Later, you say that the answer introduces additional details not supported by the context (e.g. the activity "activate bot"), which should lead to a lower faithfulness score. Another example is the discussion of answer similarity for Case I in Table 2 where you state that the high score is based on the comprehensiveness of the answer - but the ground truth is not at all comprehensive! In conclusion, I think you should discuss the scores more critically...
-> And please do not report them with 6 decimal places!
-> Lastly, since you are discussing all answers in detail anyway, why not rely on a human assessment of these (or other) metrics instead of automatically computed ones? For instance, it is somehow counterintuitive that answers that are identical to the ground truth receive very low values for some scores (e.g. faithfulness=0 for Case I in Tables 6 and 7, but faithfulness=1 in Table 8, in all cases answer is identical to ground truth).
- In terms of presentation, I find that the paper is far too long, given its rather minor contribution. In particular, it seems to me that Tables 7-19 rather consistently show how the RDF-based variant (Case I) delivers more accurate answers than Case II. Even if there are some subtleties / additional findings, I feel that you could summarise all main findings of these Tables in a rather short section and put the tables themselves into an appendix.

Overall, the contribution of the paper is not very big - based on a generally interesting, but rather vaguely formulated research question, it mainly shows how RDF representations of processes lead to more useful answers / summaries. This conclusion has been reached via a number of test questions whose selection is not very clearly motivated. The paper could be substantially shortened to avoid repetition of findings across test cases. When shortened, it might be suitable for a submission at a conference.

Review #2 submitted on 20/Nov/2024

By Andreas Martin
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments

Presentation:
Adequacy of the abstract: No
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Unsatisfactory
Overall presentation: Weak

Detailed Comments:

The study is well-aligned with current trends in applying generative AI to business process analysis, but several significant issues limit the paper's clarity and overall contribution.

Key Strengths

- Timely Research: The paper addresses the intersection of generative AI and BPM, an area with considerable potential for advancing process analysis techniques.
- Visual Insights: Figures 5 and 6 are well-executed and hold practical value, especially for readers unfamiliar with BPM concepts.

Major Concerns and Recommendations

1. Uncertainty Regarding LLM Inferencing
- The methodology lacks clarity about how LLM inferencing was conducted. The abstract mentions the ChatGPT interface, but the manuscript inconsistently refers to “LLM services” and “GPT services,” leaving the reader unsure whether standard ChatGPT functionality or custom GPTs were used.
- Additionally, the specific models utilized are not disclosed. Clear documentation of the experimental setup, including the version of the LLM and its configuration, is essential for reproducibility and transparency.

2. State-of-the-Art and Background
- The section titled “Large Language Models and the BPM lifecycle” appears to function as a background chapter, but is far too brief for a journal article. It lacks critical engagement with existing literature and does not establish a clear research gap.
- A proper literature review should analyze the strengths and limitations of prior work, offering a critical stance and positioning the paper's contribution in the broader academic discourse.

3. Research Gap, Hypothesis, and Questions
- The paper fails to articulate its research gap clearly, hypotheses, or research questions. Without this foundation, the study's objectives and outcomes are ambiguous, making it difficult to evaluate its scientific merit.
- Explicitly stating these elements would provide focus and enable readers to assess the novelty and value of the research.

4. Presentation of Tables and Figures
- The tables are not adequately referenced in the text, and their presentation is suboptimal. Consider relocating tables to the appendix if they are not directly essential to the narrative.
- Figures 1–3 and potentially Figure 4 could also be moved to an appendix, as their content adds limited value to the main discussion. Conversely, Figures 5 and 6 could be expanded upon to better engage readers less familiar with BPM concepts.

5. Discussion Section
- The paper lacks a dedicated discussion section, which is crucial for interpreting findings and situating them within the context of existing work. The absence of a discussion makes it difficult for readers to gauge the added value of the study.
- A robust discussion should revisit the research questions (if stated) and critically evaluate the implications, limitations, and potential future directions of the findings.

6. Conclusion
- The conclusion does not sufficiently explore future research opportunities. For example, the potential of continuous pre-training on domain-specific knowledge graphs, meta-models, or XML/RDF schemas is an intriguing direction that is not mentioned but warrants inclusion.
- Introducing these perspectives would underline the study's relevance and inspire further exploration in the field.

Additional Recommendations

- Proofreading and Language Clarity: Issues such as incomplete sentences (e.g., the last sentence in the abstract) and unnecessarily long sentences hinder readability. Rigorous proofreading is needed.
- Terminology Regarding LLM Capabilities: Caution is advised when describing LLMs as “understanding” or “reasoning.” The paper should reflect recent literature, which highlights the limitations of LLMs in these areas.
- SPARQL Presentation: SPARQL snippets should not span across pages, as this disrupts the flow and readability of technical content.

Summary of Improvements Needed
1. Clarify the methodology, specifically the LLM interface and models used.
2. Expand the state-of-the-art section into a comprehensive literature review with a critical stance.
3. Explicitly define the research gap, hypotheses, or questions.
4. Improve the referencing and presentation of tables and figures, relocating less critical ones to the appendix.
5. Introduce a discussion section to interpret the findings and connect them to broader implications.
6. Enhance the conclusion by discussing future research directions, particularly pre-training on structured schemas.

Review #3 submitted on 10/Jan/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Weak

Detailed Comments:

The paper explores various experiments on how large language models, in particular ChatGPT, can be used in various phases of the BPM lifecycle. Four metrics are used to compare the answers of the LLM: Faithfulness, relevancy, correctness and similarity. Two kinds of BPMN models were used: a realistic one and a set of process patterns. For the first one the content (labels) can be analysed by LLM while for the process patterns only structural and flow-based aspects can be taken into account, i.e.the LLM must only "know" the BPMN language but not the application domain.

A comparison is made between RDF export and XML. However, RDF and XML are on different levels. XML is a markup language that can be used for representing data in any domain by defining tags. RDF is a data format representing data as triples. There is an XML-syntax for RDF, thus RDF-XML is an application of XML. It is not made explicit which XML schema is used. I assume RDF is compared with the ".bpmn" representation of the models. This should be then made explicit writing that RDF export is compared to BPMN schema in XML.

The presentation of the experiment results should be written clearer. It took me some time to identify that the prompts are in the first line above the tables. At least the label "Prompt 1:" etc should be written in bold, but I suggest to not only write them in the tableThe table formatting is not clear either: There should be a label what the rows mean. And for the key metrics it is not necessary to mention them twice per row. One possibility would be to have three columns: the first column containing the what the rows mean and the second and third the values for each case.

It is not clear, how the values for the key metrics are calculated. It is also suspicious that they are calculated to six digits after the period.

The experiment consists of 19 prompts, but it is not clear, why exactly these prompts are chosen.

For each of the 19 prompts an analysis is described, but the conclusion is very short. There should be a learning, synthesizing the results for the prompts by making clear for which kinds of prompts which key metrics are higher for RDF and XML and for RDF and BPMN XML.

At the end of Section 3.2 a reference to BPM Analyse is missing.

The research is interesting but the presentation is weak. I suggest that the authors revise the paper and resubmit it.

Revisiting Business Process Analysis through the lens of Large Language Models: Prompting experiments with BPMN process serializations

Tracking #: 788-1779

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 788-1779

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links