preprint arXiv preprint February 2024

Event-Role Semantics for Grounded LLM Reasoning via FST Scaffolding

We propose a hybrid architecture in which finite-state transducers extract event-role structured representations from unstructured text, providing a symbolic scaffold for downstream LLM reasoning. Evaluated on temporal reasoning and relation extraction benchmarks.

arXiv ↗ Code ↗

Abstract

Large language models demonstrate impressive performance on many natural language understanding tasks but remain unreliable on tasks requiring structured temporal reasoning — tracking when events occurred, how they relate, and whether stated facts remain current. We argue this failure mode is partly architectural: the continuous, distributed representations learned during pretraining do not naturally support the discrete bookkeeping required for temporal consistency.

This paper proposes a hybrid architecture in which finite-state transducers (FSTs) extract event-role structured representations from unstructured text and maintain a symbolic store of typed facts with temporal scoping. The language model queries this store via a structured interface, receiving crisp symbolic answers to factual questions while retaining full natural language generation capabilities.

Approach

We build on the event-role semantic framework from frame semantics and FrameNet, extending it with temporal scoping predicates that encode when a fact was asserted, when it was retracted (if ever), and what its confidence provenance is. The FST extraction pipeline is trained on a combination of crowdsourced annotations and silver-labelled data produced by a larger LM.

The symbolic store is implemented in Prolog, allowing efficient querying via pattern-matching and enabling the composition of complex temporal queries from simpler primitives. The LLM interfaces with the store through a structured prompt format that has been optimised to minimise hallucination in the store-query generation step.

Results

On the TempReason benchmark, the hybrid architecture outperforms both pure-LLM baselines and pure-retrieval baselines by 8.3 and 12.1 percentage points respectively on the hardest temporal reasoning tier. Ablations confirm that both the symbolic store and the FST extraction quality contribute independently to the improvement.

← All research