Process Extraction from Text: a Reference Corpus, a Benchmark, and a Generative AI Approach

Patrizio Bellan

Business Process Management is a discipline that regards the analyses of the business processes running in organizations. Business Process Management techniques often rely on explicit (business) process models. Unfortunately, the initial elicitation of a process model is a time-consuming and cost-intensive operation. While the extraction of process models from event log data is an active research field, the extraction of process models from documents is a research field in an early stage of development that presents crucial limitations hampering its applicability in real-world scenarios. At the beginning of my doctoral thesis, the analysis of the reference contributions revealed that this research field is scattered and mainly based on out-of-date technology. Indeed, none of the contributions adopted the newest NLP advancements, and none of them were tested using a standard benchmarking procedure on a common dataset. This highlighted limitations regarding the data available and the lack of a gold standard dataset, the evaluation performed, and the techniques adopted. The advent of Large Language Models (LLMs) and generative AI is having a huge impact on the Natural Language Processing (NLP) field, opening the possibility of performing complex reasoning tasks with human-like performance. We decided to embrace this new technology to address the task of extracting process models from texts while successfully dealing with the scarce presence of resources that characterize this field of research. In this thesis, we aim to achieve three main goals to address the limitations highlighted above: • We aim to address the data limitation by creating the first gold standard dataset of process model descriptions annotated with process model elements and their relations. A reference dataset would allow for direct comparison of contributions and the development of efficient data-driven approaches based on machine learning. • We aim to address the heterogeneity of pipelines presented in the reference contributions by introducing a benchmarking procedure. This procedure will allow for a comparative analysis of the reference contributions and their tools, highlighting the strengths and weaknesses of different research contributions and identifying future research directions. • We aim to address the lack of novel NLP techniques adopted by proposing a novel approach based on generative AI to extract process models from documents and represent them in knowledge graphs. This is an interesting research direction to explore for supporting the automatic construction of knowledge graphs through the use of Large Language Models in order to understand how much conceptual and relational knowledge they can extract, how much such knowledge differs from reality, and how it is possible to make them more effective within specific contexts.

Process Extraction from Text: a Reference Corpus, a Benchmark, and a Generative AI Approach

Abstract

Files and links (1)

Details

Metrics