AI-Assisted Generation of Difficult Questions

1Mila - Quebec AI Institute, 2Université de Montréal, 3Princeton University, 4CU Boulder
Main pipeline

AI-assisted question generation: We propose a principled AI powered human-in-the-loop approach for creating increasingly difficult evaluation benchmarks for mathematical reasoning. This figure outlines a five-step pipeline for generating high-quality questions. (a) Skill Pair Validation: The model ensures the given skills are distinct. (b) Question Generation: The model is asked to generate a question requiring both skills. (c) Attempted Solution The model is asked to solve the question with a defeatist approach. (d) Question Validation: The question is assessed for correctness, rigor, and clarity, etc. (e) Final Solution: Valid questions are re-solved using advanced techniques like in-context prompting and majority voting. This is followed by verification of the questions generated using the above AI-only pipeline by human-annotators.

Abstract

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is an unmet demand for diverse and challenging mathematics questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. Initially, leveraging LLM metacognition skills [Didolkar et al., 2024], a strong LLM is used to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills that must be utilized in the question. The use of two very different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multi-turn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from MATH dataset [Hendrycks et al., 2021] resulted in MATH2 - a dataset of higher quality math questions, as evidenced by: (a) Lower performance of all models on MATH2 than on MATH (b) Higher performance on MATH when using MATH2 questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. where human experts evaluate highly capable AI models by also using AI-assistance. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH2 is the square on MATH. This suggests that successfully solving the question in MATH2 requires a nontrivial combination of two distinct math skills.

Results

We create 210 challenging math questions using the pipeline proposed above comprising the MATH2 dataset and evaluate 25 open-source and proprietary models spanning a range of parameter counts. Models show consistent drop in performance when evaluated on MATH2 as compared to MATH, attesting to the higher difficulty of MATH2 as a result of skill re-combination. Main bar plot

Comparison of Zero-Shot Performance of Various Models on MATH and new Dataset MATH2 - This figure illustrates the zero-shot Chain of Thought (CoT) performance of both open-source and proprietary models on two different datasets: MATH and MATH2 - our generated dataset. Across the board, models demonstrate a lower performance on the generated dataset compared to MATH. Models show consistent drops in performances relative to MATH when evaluated on MATH2.

Relative drops in performance as compared to MATH

o1-preview demonstrates the least drop in percentage terms (10.89%) whereas MetaMath-13B shows the highest relative drop (97.33%). Main comparison

Surprising relationship with performance on MATH

We observe that the performance of models on MATH2 follows, in most of the cases, an approximately perfect quadratic (Y = X2) relationship with their performance on MATH. This implies that that how well a model generalized to MATH2 is agnostic to its training procedure. We hypothesize the following explanation: Suppose there are N skills and si denotes the success rate of the model at correctly applying the i-th skill. Then, its X value should reflect the average of the si's. Furthermore, on a random question using the ith and jth skill, the probability that the model correctly answers it should be sisj, since it has to successfully apply both skills. the ith and jth skill, the probability that the model correctly answers it should be si sj, since it has to successfully apply both skills. If the questions are created using pairs of skills chosen randomly and independently, then the Y value will be the average value of sisj's, which by independence will be roughly X2. Although, some models do deviate non-trivially from the perfect quadratic relationship such as DeepSeek-R1-Distill-Llama-3-8B (highest positive deviation), o1-preview, Deepseek-R1-Distill-Qwen-32B and Claude-3.5 Sonnet (highest negative)

Squared Relationship
Relation between the performance of models on MATH2 (Y) vs the square of their performances on MATH (X2). As can be seen from the plot, Y โ‰ˆ X2

MATH2 questions act as superior in-context examples for solving MATH

The table given below compares the performance of models on MATH under three prompting strategies. MAmmoTH 4-shot CoT uses exemplars from the MAmmoTH (Yue et al., 2023) evaluation suite. Skill-Based 4-shot CoT (Didolkar et al., 2024) retrieves exemplars from MATH based on required skills (identified by GPT-4). Proposed 4-shot CoT selects MATH2 exemplars where at least one skill matches the target question. Using MATH2 exemplars improves performance, with gains up to 13.72% over baseline (Llama-3.1-70B-Instruct (Dubey et al., 2024)).

MATH^2 ICL

Performance on human modified vs un-modified subsets of MATH2

Models find it more difficult to solve the subset of MATH2 questions which were modified by human verifiers before making it to the dataset as compared to the questions that were taken directly as they were generated by the AI pipeline (i.e. were not modified at all). Infact, the human modified subset is even more difficult than Level-5 MATH, i.e., the most difficult level of MATH.

MATH^2 ICL

Citation

@article{shah2024ai,
    title={Ai-assisted generation of difficult math questions},
    author={Shah, Vedant and Yu, Dingli and Lyu, Kaifeng and Park, Simon and Yu, Jiatong and He, Yinghui and Ke, Nan Rosemary and Mozer, Michael and Bengio, Yoshua and Arora, Sanjeev and others},
    journal={arXiv preprint arXiv:2407.21009},
    year={2024}
}