How Many Grades Can BERT Reach? Seq2Seq Tackles Elementary Math Problems

Follow our WeChat public account “ML_NLP“

Set as “Starred“, heavy content delivered to you instantly!

Reprinted from｜PaperWeekly

©PaperWeekly Original · Author｜ Su Jianlin

Unit｜Zhuiyi Technology

Research Direction｜NLP, Neural Networks

How Many Grades Can BERT Reach? Seq2Seq Tackles Elementary Math Problems — ▲ The Years of “Chicken and Rabbit in the Same Cage”

“Profit and Loss Problems”, “Age Problems”, “Tree Planting Problems”, “Cows Eating Grass Problems”, “Profit Problems”… Have you ever been tortured by various types of math application problems during elementary school? No worries, machine learning models can now help us solve these application problems. Let’s see how many grades they can reach!

This article will provide a baseline for solving elementary math application problems (Math Word Problem), trained on the ape210k dataset [1], directly generating executable mathematical expressions using the Seq2Seq model. Ultimately, the Large version of the model achieves an accuracy rate of 73%+, surpassing the results reported in the ape210k paper.

The term “hard confrontation” refers to generating readable expressions that are similar to human methods without any special transformations or template processing.

Data Processing

Here we first observe the situation of the ape210k dataset:

{
    "id": "254761",
    "segmented_text": "小 王 要 将 150 千 克 含 药 量 20% 的 农 药 稀 释 成 含 药 量 5% 的 药 水 ． 需 要 加 水 多 少 千 克 ？",
    "original_text": "小王要将150千克含药量20%的农药稀释成含药量5%的药水．需要加水多少千克？",
    "ans": "450",
    "equation": "x=150*20%/5%-150"
}

{
    "id": "325488",
    "segmented_text": "一 个 圆 形 花 坛 的 半 径 是 4 米 ， 现 在 要 扩 建 花 坛 ， 将 半 径 增 加 1 米 ， 这 时 花 坛 的 占 地 面 积 增 加 了 多 少 米 * * 2 ．",
    "original_text": "一个圆形花坛的半径是4米，现在要扩建花坛，将半径增加1米，这时花坛的占地面积增加了多少米**2．",
    "ans": "28.26",
    "equation": "x=(3.14*(4+1)**2)-(3.14*4**2)"
}

We mainly focus on the original_text, equation, and ans fields, where original_text is the problem, equation is the calculation process (usually starting with x=), and ans is the final answer. We hope to train a model to generate equation from original_text, and then directly obtain ans using Python’s eval function.

However, we need to do some preprocessing because the equation provided by ape210k is not always directly eval-able. For example, the expression 150*20%/5%-150 is an illegal expression for Python. The processing I did is as follows:

For percentages like a%, uniformly replace with (a/100);
For mixed fractions like a(b/c), uniformly replace with (a+b/c);
For true fractions like (a/b), remove parentheses in the problem to become a/b;
For ratios represented by colons :, uniformly replace with /.

After this processing, most equations can be directly eval-able, and we can compare them with ans, keeping only the problems with consistent results.

However, there is still room for improvement, as the expressions obtained may contain some redundant parentheses (which are equivalent after removing parentheses). Therefore, we also need to add a step to remove parentheses, by traversing each group of parentheses. If removing that group of parentheses results in an equivalent expression, then we can remove that group of parentheses, resulting in shorter average-length expressions, which are easier to generate.

Finally, we obtained the following usable dataset:

How Many Grades Can BERT Reach? Seq2Seq Tackles Elementary Math Problems

The rest are basically some wrong or messy problems, which we will temporarily ignore.

Model Overview

The model itself is not much to talk about, it simply takes original_text as input and equation as output, based on the architecture of “BERT+UniLM”, training a Seq2Seq model. If you have any doubts about the model, please read from language models to Seq2Seq: Transformers are like plays, all rely on Mask.

Project Link:

http://github.com/bojone/ape210k_baseline

The author’s training was done using a single TITAN RTX with 22G memory, with the optimizer being Adam and the learning rate set to 2e-5. The Base version used a batch_size of 32, requiring about 25 epochs of training, with each epoch taking about 50 minutes (including validation set evaluation time); while the large version used a batch_size of 16, requiring about 15 epochs of training, with each epoch taking about 2 hours (including validation set evaluation time).

By the way, speaking of Large, since UniLM borrowed some weights from MLM, we cannot use Harbin Institute of Technology’s open-source RoBERTa-wwm-ext-large [2], because the MLM weights for this version are randomly initialized (but its Base version is normal and can be used). The Large version is recommended to use Tencent UER [3] open-source weights, which were originally in PyTorch, and I converted them to TF version, available for download via the link below.

Cloud Disk Link:

https://pan.baidu.com/s/1Xp_ttsxwLMFDiTPqmRABhg

Extraction Code:l0k6

The results are shown in the table below:

How Many Grades Can BERT Reach? Seq2Seq Tackles Elementary Math Problems

The results of the Large model are significantly higher than the 70.20% reported in the ape210k paper Ape210K: A Large-Scale and Template-Rich Dataset of Math Word Problems [4], indicating that our model here is a reasonably good baseline.

It seems that if we use some Seq2Seq techniques to alleviate the Exposure Bias problem (see Analysis and Countermeasures of the Exposure Bias Phenomenon in Seq2Seq), the model could be further improved; and perhaps we could introduce a copy mechanism to enhance the consistency of outputs with input numbers; and we could also find ways to further shorten the sequence length (for example, replacing the four-character 3.14 with the two-letter pi). These tasks are left for everyone to try.

Standard Output

From a modeling perspective, our task is essentially complete, as the model only needs to output the expression, and during evaluation, it only needs to check whether the result of eval-ing the expression matches the reference answer.

However, from a practical perspective, we also need to further standardize the output, determining whether to output a decimal, integer, fraction, or percentage based on different problems. This requires us to: 1) decide when to output which format; 2) convert the results according to the specified format.

The first step is relatively simple; generally, we can judge based on some keywords in the problem or equation. For example, if there are decimals in the expression, the output result is generally a decimal; if the problem asks for “how many vehicles”, “how many items”, or “how many people”, the output is always an integer; if it directly asks for “how many fractions” or “how many percentages”, then the corresponding output is a fraction or percentage.

The more difficult cases are those that require rounding, such as “Each cake costs 7.90 yuan, how many cakes can you buy for 50 yuan at most?” which requires us to take the floor of 50/7.90, but sometimes it requires rounding up. However, I was surprised to find that there are no rounding problems in ape210k, so this issue does not exist. If we encounter a dataset with rounding requirements, and the rules are challenging to judge, the most direct method is to include the rounding symbols in the equation and let the model predict.

The second step seems a bit complex, mainly concerning fractions. General readers may not know how to retain the fractional results of operations; for example, if we directly eval(‘(1+2)/4’), we get 0.75 (Python3), but sometimes we want the result to be the fraction 3/4.

In fact, retaining fractional operations falls under the category of CAS (Computer Algebra System), which means symbolic computation rather than numerical computation. Fortunately, Python has such tools, namely SymPy [5], which can achieve our goal. Please see the example below:

from sympy import Integer
import re

r = (Integer(1) + Integer(2)) / Integer(4)
print(r)  # Output is 3/4 instead of 0.75

equation = '(1+2)/4'
print(eval(equation))  # Output is 0.75

new_equation = re.sub('(\d+)', 'Integer(\1)', equation)
print(new_equation)  # Output is (Integer(1)+Integer(2))/Integer(4)
print(eval(new_equation))  # Output is 3/4

Article Summary

This article introduces a baseline for math application problems using the Seq2Seq model, mainly by directly converting the problem into eval-able expressions using “BERT+UniLM”, and shares some experiences in result standardization. With the BERT Large model’s UniLM, we achieved an accuracy of 73%+, surpassing the results of the original paper.

So, what grade do you think it can reach?

References

[1] https://github.com/Chenny0808/ape210k

[2] https://github.com/ymcui/Chinese-BERT-wwm

[3] https://github.com/dbiir/UER-py

[4] https://arxiv.org/pdf/2009.11506v1.pdf

[5] https://www.sympy.org/en/index.html

Download 1: Four Essentials
Reply with "Four Essentials" in the backend of the machine learning algorithm and natural language processing public account to get the learning materials for TensorFlow, Pytorch, machine learning, and deep learning!

Download 2: Shared Repository
Reply with "Code" in the backend of the machine learning algorithm and natural language processing public account to get 195 NAACL + 295 ACL2019 papers with open-source code. The open-source address is as follows: https://github.com/yizhen20133868/NLP-Conferences-Code

Heavyweight! The machine learning algorithm and natural language processing exchange group has been officially established! There are a lot of resources in the group, and everyone is welcome to join the group for learning!

Extra benefits! Deep learning and neural networks by Qiu Xipeng, official Chinese tutorial for Pytorch, data analysis with Python, machine learning notes, Chinese version of pandas official documentation, effective java (Chinese version), and other 20 welfare resources.

How to get: After entering the group, click on the group announcement to obtain the download link. Please modify the note when adding, such as [School/Company + Name + Direction]. For example —— Harbin Institute of Technology + Zhang San + Dialogue System. Please avoid adding if you are a micro-business. Thank you!

Recommended Reading:
12 Golden Rules for Solving NER Problems in Industry
Three Steps to Master the Core of Machine Learning: Matrix Derivation
Distillation Techniques in Neural Networks, Starting with Softmax

Leave a Comment Cancel reply