complete August 2023

ClauseRelFST

A finite-state transducer pipeline for extracting and classifying relational clauses from unstructured legal and technical text. Achieves near-zero false-negative rates on known relation types.

Python · OpenFST · spaCy · pynini


Overview

ClauseRelFST is a finite-state transducer pipeline for extracting and classifying relational clauses from unstructured text, with particular attention to legal and technical documents. The system operates on tokenised, part-of-speech-tagged input and applies a cascade of weighted transducers to identify clause boundaries, extract argument structures, and assign relation types.

The choice of FSTs over neural sequence models was deliberate. Legal text exhibits highly regular syntactic patterns, and for known relation types the rule-based approach achieves near-zero false-negative rates — a requirement in high-stakes contexts where missing a contractual obligation or liability clause is not acceptable. FSTs also provide full transparency: every output can be traced to the specific transducer rule that produced it.

Design

The pipeline consists of four transducer stages. The first handles tokenisation normalisation — collapsing whitespace variants, standardising quotation characters, and resolving abbreviations common in legal text. The second applies syntactic chunk patterns, identifying noun phrases and verb phrases. The third stage applies the core relation detection transducers, compiled from a manually curated grammar of roughly 200 relation patterns. The fourth stage is a post-processing layer that resolves coreference in extracted argument spans.

Transducers are compiled from a human-readable grammar specification using pynini and stored in the OpenFST binary format. This allows fast composition and runtime execution without recompilation.

Results

On an internal evaluation set of 1,200 annotated clauses drawn from commercial contracts, ClauseRelFST achieves 94.2% precision and 98.7% recall on in-distribution relation types. Performance degrades gracefully on out-of-distribution patterns — the system falls back to a lower-confidence extraction rather than hallucinating. This failure mode is preferable in the target domain.