experimental March 2024

Math Reasoning Verifier

An LLM-assisted verifier for multi-step mathematical reasoning chains. Uses a fine-tuned process reward model to score intermediate steps and identify the first incorrect inference in a proof-like derivation.

Python · PyTorch · Hugging Face · Lean4

GitHub ↗ Notebook ↗

Overview

This project explores using process reward models (PRMs) to verify multi-step mathematical reasoning in LLM outputs. The core motivation is that outcome supervision — judging only whether a final answer is correct — provides a weak training signal that allows models to be right for the wrong reasons. Process supervision, by contrast, scores each intermediate step, providing much richer feedback.

The verifier takes a reasoning chain as input, scores each step using a fine-tuned PRM, and returns a verdict identifying whether the chain is valid and, if not, pinpointing the first erroneous step. This can be used either for filtering generated solutions at inference time or as a training signal for reinforcement learning from process feedback.

Implementation

The PRM is based on a language model fine-tuned on the MATH dataset with step-level annotations generated through a semi-automated pipeline. Each step is scored on a probability-of-correctness scale from 0 to 1, with the overall chain verdict derived from the product of step scores.

An experimental integration with Lean4 allows formal verification of steps that can be expressed as Lean propositions. When formal verification is possible, it takes precedence over the learned model’s score, providing a ground truth signal for that step. The hybrid approach — formal where possible, learned where not — reflects a broader design philosophy of using symbolic methods to anchor neural reasoning.

Current Status

This project is experimental and ongoing. The step-annotation pipeline requires substantial manual effort to validate, and the Lean4 integration covers only a narrow fragment of the mathematical reasoning tasks in the benchmark. Results on the MATH test set are promising but not yet publishable.

← All projects