top of page

Are AIs Smarter than a 5th Grader

By Anne Newmarch

he Math Word Problem is a natural language processing (NLP) challenge that has seen exciting progress within the last few years. It requires a machine learning model to read a contextualised math problem, identify the relevant information, and produce an answer for which humans would require multi-step reasoning [1]. Most machine learning models have been trained on primary school mathematics problems to scale to higher levels of complexity after high accuracy is achieved.


This article will discuss some recent academic papers addressing the Math Word Problem. While it is easy to conclude that a paper is better based on accuracy rates, it is important to note that such accuracy rates may be contingent on the dataset for which the model was trained. If an independent test of questions that were created for a primary school level were given to these models, it is unclear which would outperform the others. Ultimately, it may depend more on the questions’ nature than the models themselves.


One approach to solving the Math Word Problem is to generate an expression tree from which computing the final answer is rudimentary. Reading the tree from the bottom up reveals the multi-step reasoning needed to solve the problem.


A paper from Singapore Management University in 2020 [1] generated this type of solution with their novel Graph2Tree model. Their fully supervised approach processed the input and extrapolated the quantities and related words. This information was then projected onto a graph. This graph displays the relative relationships between the concepts of the question. A graph convolution network (GCN) and a tree-based decoder were then used to produce an expression tree. The model was tested against questions from the MAWPS [2] and Math23K [3] datasets, resulting in one of the highest accuracy scores for this problem at 77.4% on Math23K.


An example of two expression trees which both evaluate to a correct answer for a Math Word Problem. Hong et al. [4]


A more recent approach by Hong et al. last year [4] also generates a solution tree, but instead uses weakly-supervised learning. Fully supervised learning uses the correct answer and solution tree as the target of the learning algorithm. They argue that this restricts the variety of solutions as only one way of reaching the correct answer is produced. There are many distinct approaches to solving these problems, so the study did not train to the tree — only the correct answer. This ultimately allowed the model to suggest a range of correct ways to arrive at the same solution. Furthermore, Hong et al. took an interesting new approach by programming the model to fix its own mistakes by trying out different values in the incorrect expression tree to find the correct answer. This was to more closely imitate the way humans learn, coined by the researchers as ‘Learning by fixing’ [4]. If a correct solution was reached, it was then committed to memory to encourage more diverse solutions. The researchers proved that their model generated a range of different solutions to the same problem at 45-60% accuracy on Math23K.


A different approach to solving the Math Word Problem is to use a verifier to improve accuracy, as demonstrated by a paper from OpenAI headed by Cobbe [5]. The researchers proved that a verifier given a range of proposed generated solutions could accurately evaluate the probability that a proposed solution was correct. The solution with the greatest probability of being correct was chosen for output. Cobbe et al. proved that this approach ultimately increased the accuracy of a fine-tuned model by as much as 20%.


A comparison between finetuning and verification on 6B and 175B models. Given a large enough training set, the test solve rate of the verification model will surpass that of the finetuned one. Cobbe et al., [5]


However, all this research appears to have been shot out of the water by a recent paper released this year in a joint effort from MIT, Columbia University, Harvard University, and the University of Waterloo [6]. Drori et al., in their self-professed ‘milestone’ [6] paper, have produced a transformer model capable of solving math word problems at a university level with perfect accuracy. The model can also produce these university-level problems well enough that students cannot correctly identify whether the problem was machine-generated 100% of the time. As this model requires no additional programming between switching course content, the researchers state that it could be applied to any STEM course.


There is, however, a caveat to this, and it is not a small one: Drori et al. essentially solved a slightly different problem than previously discussed, as their model requires additional contextual information with the input text. They attribute their success and previous research failures to this fact.

The model [6] works as follows: an input question is tidied and given additional contexts, such as the mathematics topic, and the relevant programming language and libraries. The researchers report that the majority of the questions required minor or no modifications. A portion of the modifications could be done automatically, while the rest is inferred to have been done manually. The transformed question is then fed to the OpenAI Codex Transformer [7], a highly successful machine learning model that takes in text input and generates corresponding code. The produced program is then run to achieve the correct answer.


The researchers argue that providing this additional key context is fair, as the students who take these courses rely on implicit knowledge for their answers. Additionally, further research may improve this model to fully automate question modification.


This recent development has not been without backlash. In a paper published only 20 days after Drori et al., Biderman and Raff [8] take the stance that this type of machine learning research ‘has not engaged with the issues and implications of real-world application of their results [8]. They argue that machine learning models like that of Drori et al. will be abused by students cheating, especially given that results are often not flagged by plagiarism detection tools.


They are correct that students will use these models to cheat if given mainstream access. However, this is not a new situation: a primary school student cheating on their times-tables homework with a calculator is not so functionally different from a university student cheating on their calculus assignment with this model [6]. The result is the same: neither student is likely to perform well under test conditions. For online exams, a tool such as this disappears into the haze of students’ many methods to cheat.


While Drori et al. [6] have found success, this is not the end of the road for the previous research. The Math Word Problem is not only about solving the questions themselves – it is about learning how we can improve on our machine learning techniques to facilitate reasoning. If we believe that there are problems we want to solve that require reasoning that cannot be programmed, then developing research into graphical representations of relationships and learning by fixing could be crucial to success. All progress and the effort researchers put into these methods are valuable.


Each step of the transformer model, from original question to its modified version, the program generated by Codex, and the output given as an answer. Drori et al. [6]


References


  1. J. Zhang et al., “Graph-to-tree learning to solve math word problems,” in Proc. 58th Annu. Meeting Association for Computational Linguistics, Jul. 2020, pp 3928-3937. [Online]. Available: https://ink.library.smu.edu.sg/sis_research/5273

  2. R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, and H. Hajishirzi, “MAWPS: A math word problem repository,” in Proc. NAACL-HLT 2016, pp. 1152-1157. [Online]. Available: https://aclanthology.org/N16-1136.pdf

  3. Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word problems,” in Proc. 2017 Conf. Empirical Methods Natural Language Processing, Sep. 2017, pp. 845-854, doi:10.18653/v1/D17-1088.

  4. Y. Hong, Q. Li, D. Ciao, S. Huang, and S. Zhu, “Learning by fixing: Solving math word problems with weak supervision,” in Proc. 35th AAAI Conf. Artificial Intelligence, AAAI-21. [Online]. Available: https://www.aaai.org/AAAI21Papers/AAAI-5790.HongY.pdf

  5. Cobbe et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. [Online]. Available: https://arxiv.org/pdf/2110.14168.pdf

  6. Drori et al., “A neural network solves and generates mathematics problems by program synthesis: calculus, differential equations, linear algebra, and more,” arXiv preprint arXiv:2112.15594, Dec. 2021. [Online]. Available: https://arxiv.org/pdf/2112.15594.pdf

  7. M. Chen et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, Jul. 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

  8. S. Biderman and E. Raff, “Neural language models are effective plagiarists,” arXiv preprint arXiv:2201.07406, Jan. 2022. [Online]. Available: https://arxiv.org/pdf/2201.07406.pdf


Comments


bottom of page