AI-Assisted Generation of Difficult Math Questions

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is an unmet demand for diverse and challenging mathematics questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. Initially, leveraging LLM metacognition skills [Didolkar et al., 2024], a strong LLM is used to extract core "skills" from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills that must be utilized in the question. The use of two very different skills within each question makes finding such questions an "out of distribution" task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multi-turn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from MATH dataset [Hendrycks et al., 2021] resulted in MATH² - a dataset of higher quality math questions, as evidenced by: (a) Lower performance of all models on MATH² than on MATH (b) Higher performance on MATH when using MATH² questions as in-context examples. Although focused on mathematics, our methodology seems applicable to other domains requiring structured reasoning, and potentially as a component of scalable oversight. where human experts evaluate highly capable AI models by also using AI-assistance. Also of interest is a striking relationship observed between models' performance on the new dataset: the success rate on MATH² is the square on MATH. This suggests that successfully solving the question in MATH² requires a nontrivial combination of two distinct math skills.

We create 210 challenging math questions using the pipeline proposed above comprising the MATH² dataset and evaluate 25 open-source and proprietary models spanning a range of parameter counts. Models show consistent drop in performance when evaluated on MATH² as compared to MATH, attesting to the higher difficulty of MATH² as a result of skill re-combination. Main bar plot

Comparison of Zero-Shot Performance of Various Models on MATH and new Dataset MATH² - This figure illustrates the zero-shot Chain of Thought (CoT) performance of both open-source and proprietary models on two different datasets: MATH and MATH² - our generated dataset. Across the board, models demonstrate a lower performance on the generated dataset compared to MATH. Models show consistent drops in performances relative to MATH when evaluated on MATH².

Relative drops in performance as compared to MATH

o1-preview demonstrates the least drop in percentage terms (10.89%) whereas MetaMath-13B shows the highest relative drop (97.33%). Main comparison

Surprising relationship with performance on MATH

We observe that the performance of models on MATH² follows, in most of the cases, an approximately perfect quadratic (Y = X²) relationship with their performance on MATH. This implies that that how well a model generalized to MATH² is agnostic to its training procedure. We hypothesize the following explanation: Suppose there are N skills and s_i denotes the success rate of the model at correctly applying the i-th skill. Then, its X value should reflect the average of the s_i's. Furthermore, on a random question using the ith and jth skill, the probability that the model correctly answers it should be s_is_j, since it has to successfully apply both skills. the ith and jth skill, the probability that the model correctly answers it should be s_i s_j, since it has to successfully apply both skills. If the questions are created using pairs of skills chosen randomly and independently, then the Y value will be the average value of s_is_j's, which by independence will be roughly X². Although, some models do deviate non-trivially from the perfect quadratic relationship such as DeepSeek-R1-Distill-Llama-3-8B (highest positive deviation), o1-preview, Deepseek-R1-Distill-Qwen-32B and Claude-3.5 Sonnet (highest negative)

Squared Relationship — Relation between the performance of models on MATH² (Y) vs the square of their performances on MATH (X²). As can be seen from the plot, Y ≈ X²

MATH² questions act as superior in-context examples for solving MATH

The table given below compares the performance of models on MATH under three prompting strategies. MAmmoTH 4-shot CoT uses exemplars from the MAmmoTH (Yue et al., 2023) evaluation suite. ^{Skill-Based 4-shot CoT} (Didolkar et al., 2024) retrieves exemplars from MATH based on required skills (identified by GPT-4). ^{Proposed 4-shot CoT} selects MATH² exemplars where at least one skill matches the target question. Using MATH² exemplars improves performance, with gains up to 13.72% over baseline (Llama-3.1-70B-Instruct (Dubey et al., 2024)).

Performance on human modified vs un-modified subsets of MATH²

Models find it more difficult to solve the subset of MATH² questions which were modified by human verifiers before making it to the dataset as compared to the questions that were taken directly as they were generated by the AI pipeline (i.e. were not modified at all). Infact, the human modified subset is even more difficult than Level-5 MATH, i.e., the most difficult level of MATH.

AI-Assisted Generation of Difficult Questions

Abstract

Results

Relative drops in performance as compared to MATH

Surprising relationship with performance on MATH

MATH² questions act as superior in-context examples for solving MATH

Performance on human modified vs un-modified subsets of MATH²

Citation

AI-Assisted Generation of Difficult Questions

Abstract

Results

Relative drops in performance as compared to MATH

Surprising relationship with performance on MATH

MATH2 questions act as superior in-context examples for solving MATH

Performance on human modified vs un-modified subsets of MATH2

Citation

MATH² questions act as superior in-context examples for solving MATH

Performance on human modified vs un-modified subsets of MATH²