in-progress May 2024

FOLD-RM for Knowledge Extraction: Inductive Rule Learning over Noisy Event Graphs

An investigation into applying FOLD-RM, a scalable answer set programming induction algorithm, to rule learning over knowledge graphs extracted from noisy, domain-specific corpora. We report results on legal and biomedical graph benchmarks.


Motivation

Inductive logic programming (ILP) offers a principled approach to learning symbolic rules from data — one that produces interpretable, compositional, and verifiable rules rather than opaque model weights. Despite renewed interest in ILP as a complement to neural approaches, most ILP algorithms struggle to scale to the size and noise levels of knowledge graphs extracted from real-world text.

FOLD-RM is a recent answer set programming induction algorithm that addresses the scalability problem through a combination of aggressive pruning and a greedy-with-backtracking search strategy. This work investigates whether FOLD-RM can be applied effectively to the messier setting of knowledge graphs extracted from noisy corpora, where entity mentions are ambiguous, relation types are inconsistently labelled, and the graph is highly incomplete.

Method

We apply FOLD-RM to knowledge graphs extracted from two domain-specific corpora: a collection of commercial legal contracts and a biomedical literature graph. In both cases, extraction is performed using ClauseRelFST (for legal) and a domain-adapted NER/RE pipeline (for biomedical). FOLD-RM then operates on the resulting typed graphs, with learning targets defined as typed relation predicates.

We introduce a noise-tolerance extension to the FOLD-RM search that uses a confidence threshold on examples rather than treating all training facts as ground truth. This is crucial for the noisy extraction setting, where a non-trivial fraction of extracted facts are incorrect.

Preliminary Results

On the legal graph benchmark, FOLD-RM with noise tolerance achieves 78.4% F1 on held-out relation instances, compared to 61.2% for the standard FOLD-RM baseline. The induced rules are compact and human-readable, which we confirmed through an independent evaluation with legal domain experts. Biomedical results are pending and will appear in the final paper.