Real-Time Recognition of Handwritten Chinese Characters Spanning a Large Inventory of Thirty Thousand Characters

Vol. 1, Issue 5 September Two Thousand Seventeen by Handwriting Recognition Team


Handwriting recognition is more important than ever given the prevalence of mobile phones, tablets, and wearable gear like smartwatches. The large symbol inventory required to support Chinese handwriting recognition on such mobile devices poses unique challenges. This article describes how we met those challenges to achieve real-time performance on iPhone, iPad, and Apple Watch (in Scribble mode). Our recognition system, based on deep learning, accurately handles a set of up to thirty thousand characters. To achieve acceptable accuracy, we paid particular attention to data collection conditions, representativeness of writing styles, and training regimen. We found that, with proper care, even larger inventories are within reach. Our experiments show that accuracy only degrades slowly as the inventory increases, as long as we use training data of sufficient quality and in sufficient quantity.

Introduction

Handwriting recognition can enhance user experience on mobile devices, particularly for Chinese input given the relative complexity of keyboard methods. Chinese handwriting recognition is uniquely challenging, due to the large size of the underlying character inventory. Unlike alphabet-based writing, which typically involves on the order of one hundred symbols, the set of Hànzì characters in Chinese National Standard GB18030-2005 contains twenty thousand five hundred thirty three entries, and many additional logographic characters are in use throughout Greater China.

For computational tractability, it is usual to focus on a restricted number of characters, deemed most representative of usage in daily life. Thus, the standard GB2312-80 set includes only six thousand seven hundred sixty three entries (three thousand seven hundred fifty five and three thousand 8 characters in the level-1 and level-2 sets, respectively). The closely aligned character set used in the popular CASIA databases, built by the Institute of Automation of Chinese Academy of Sciences, comprises a total of seven thousand three hundred fifty six entries [6]. The SCUT-COUCH database has similar coverage [8].

These sets tend to reflect commonly used characters across the entire population of Chinese writers. At the individual user level, however, what is “commonly used” typically varies from one person to the next. Most people need at least a handful of characters deemed “infrequently written,” as they occur in, e.g., proper names relevant to them. Thus ideally Chinese handwriting recognition algorithms need to scale up to at least the level of GB18030-2005.

While early recognition algorithms mainly relied on structural methods based on individual stroke analysis, the need to achieve stroke-order independence later sparked interest into statistical methods using holistic shape information [5]. This obviously complicates large-inventory recognition, as correct character classification tends to get harder with the number of categories to disambiguate [3].

On Latin script tasks such as MNIST [4], convolutional neural networks (CNNs) soon emerged as the way to go [eleven]. Given a sufficient amount of training data, supplemented with synthesized samples as necessary, CNNs decidedly achieved state-of-the-art results [1], [ten]. The number of categories in those studies, however, was very small (ten).

When we started looking into the large-scale recognition of Chinese characters some time ago, CNNs seemed to be the obvious choice. But that approach required scaling up CNNs to a set of approximately thirty-thousand characters, while simultaneously maintaining real-time performance on embedded devices. This article focuses on the challenges involved in meeting expectations in terms of accuracy, character coverage, and robustness to writing styles.

System Configuration

We adopt in this work a general CNN architecture similar to that used in previous handwriting recognition experiments on the MNIST task (see, e.g., [1], [ten]). The configuration of the overall system is illustrated in Fig. 1.

Figure 1. Typical CNN Architecture (With 2 Consecutive Stages of Convolution and Subsampling)
A diagram of a convolutional neural network that shows input on the left and output on the right.  The diagram shows four inner layers that represent feature maps. The layers are convolution, subsampling, convolution, and subsampling.

The input is a medium-resolution image (for performance reasons) of fourty eight x fourty eight pixels representing a Chinese handwritten character. We fed this input to a number of feature extraction layers alternating convolution and subsampling. The last feature extraction layer is connected to the output via a fully connected layer.

From one convolution layer to the next, we chose the size of the kernels and the number of feature maps so as to derive features of increasingly coarser granularity. We subsampled through a max-pooling layer [9] using a 2x2 kernel. The last feature layer typically comprises on the order of one thousand small feature maps. Finally, the output layer has one node per class, e.g., three thousand seven hundred fifty five for the Hànzì level-1 subset of GB2312-80, and close to thirty thousand when scaled up to the full inventory.

As baseline, we evaluated the previously described CNN implementation on the CASIA benchmark task [6]. While this task only covers Hànzì level-1 characters, there exist numerous reference results for character accuracy in the literature (e.g., in [7] and [fourteen]). We used the same setup based on CASIA-OLHWDB, DB1.0-1.2, split in training and testing datasets [6], [7], yielding about one million training exemplars.

Note that, given our product focus, the goal was not to tune our system for the highest possible accuracy on CASIA. Instead our priorities were model size, evaluation speed, and user experience. We thus opted for a compact system that works real-time, across a wide variety of styles, and with high robustness towards non-standard stroke order. This led to an image-based recognition approach even though we evaluate on online datasets. As in [ten], [eleven], we supplemented actual observations with suitable elastic deformations.

Table 1 shows the results using the CNN of Figure 1, where the abbreviation “Hz-1” refers to the Hànzì level-1 inventory (three thousand seven hundred fifty five characters), and “CR(n)” denotes top-n character recognition accuracy. In addition to the commonly reported top-1 and top-ten accuracies, we also mention top-4 accuracy because, as our user interface was designed to show 4 character candidates, the top-4 accuracy is an important predictor of user experience in our system.

Table 1. Results on CASIA On-line Database, three thousand seven hundred fifty five Characters. Standard Training, Associated Model Size = 1MB

Inventory Training CR(1) CR(4) CR(10)
Hz-1 CASIA 87.4% 94.5% 98.1%

The figures in Table 1 compare with online results in [7] and [14] averaging roughly 93% for top-1 and 98% for top-10 accuracy. Thus, while our top-10 accuracy is in line with the literature, our top-1 accuracy is slightly lower. This has to be balanced, however, against a satisfactory top-4 accuracy, and perhaps even more importantly, a model size (1 MB) smaller than any comparable system in [7] and [14].

The figures in Table 1 compare with online results in [7] and [fourteen] averaging roughly ninty three percent for top-1 and ninty 8 percent for top-ten accuracy. Thus, while our top-ten accuracy is in line with the literature, our top-1 accuracy is slightly lower. This has to be balanced, however, against a satisfactory top-4 accuracy, and perhaps even more importantly, a model size (1 MB) smaller than any comparable system in [7] and [fourteen].

The system in Table 1 is trained only on CASIA data, and does not include any other training data. We were also interested in folding in additional training data collected in-house on iOS devices. This data covers a larger variety of styles (cf. next section) and comprises a lot more training instances per character. Table 2 reports our results, on the same test set with a 3,755-character inventory.

The system in Table 1 is trained only on CASIA data, and does not include any other training data. We were also interested in folding in additional training data collected in-house on iOS devices. This data covers a larger variety of styles (cf. next section) and comprises a lot more training instances per character. Table 2 reports our results, on the same test set with a three thousand seven hundred fifty five character inventory.

Table 2. Results on CASIA On-line Database, 3,755 Characters. Augmented Training, Associated Model Size = 15MB

Table 2. Results on CASIA On-line Database, three thousand seven hundred fifty five Characters. Augmented Training, Associated Model Size = fifteen mega byte

Inventory Training CR(1) CR(4) CR(10)
Hz-1 Augmented 88.4% 96.5% 98.3%

Though the resulting system has a much greater footprint (fifteen MB), accuracy only improves slightly (about 2% absolute for top-4 accuracy). This suggests that, by and large, most styles of characters appearing in the test set were already well-covered in the CASIA training set. It also indicates that there is no downside in folding more training data: the presence of additional styles is not deleterious to the underlying model.

Scaling Up to 30K Characters

Scaling Up to thirty thousand Characters

Since the ideal set of “frequently written” characters varies from one user to the next, a large population of users requires an inventory of characters much larger than three thousand seven hundred fifty five. Exactly which ones to select, however, is not entirely straightforward. Simplified Chinese characters defined with GB2312-80 and traditional Chinese characters defined with Big5, Big5E, and CNS 11643-92 cover a wide range (from three thousand seven hundred fifty five to fourty eight thousand twenty seven Hànzì characters). More recently came HKSCS-2008 with four thousand five hundred sixty eight extra characters, and even more with GB18030-2000.

We wanted to make sure that users are able to write their daily correspondence, in both simplified and traditional Chinese, as well as names, poems and other common markings, visual symbols, and emojis. We also wanted to support the Latin script for the occasional product or trade name with no transliteration. We followed Unicode as the prevailing international standard of character encoding, as it encapsulates almost all of the above mentioned standards. (Note that Unicode 7.0 can specify over seventy thousand characters in its extensions B-D, with more being considered for inclusion). Our character recognition system thus focused on the Hànzì part of GB18030-2005, HKSCS-2008, Big5E, a core ASCII set, and a set of visual symbols and emojis, for a total of approximately thirty thousand characters, which we felt represented the best compromise for most Chinese users.

After selecting the underlying character inventory, it is critical to sample the writing styles that users actually use. While there are formal clues as to what styles to expect (cf. [thirteen]), there exist many regional variations, e.g., (i) the use of the U+2EBF (艹) radical, or (ii) the cursive U+56DB (四) vs. U+306E (の). Rendered fonts can also contribute to confusion as some users expect specific characters to be presented in a specific style. As a speedy input tends to drive toward cursive styles, it tends to increase ambiguity, e.g. between U+738B (王) and U+4E94 (五). Finally, increased internationalization sometimes introduces unexpected collisions: for example, U+4E8C (二), when cursively written, may conflict with the Latin characters “2” and “Z”.

Our rationale was to offer users the whole spectrum of possible input from printed to cursive to unconstrained writing [5]. To cover as many variants as possible, we sought data from writers from several regions in Greater China. We were surprised to find that most users have never seen many of the rarer characters. This unfamiliarity results in hesitations, stroke-order errors, and other distortions which we needed to take into account. We collected data from paid participants across various age groups, gender, and with a variety of educational backgrounds. The resulting handwritten data is unique in many ways: comprising several thousands users, written with a finger, not a stylus, on iOS devices, in small batches of data. One advantage is that the sampling of iOS devices results in a very clear handwritten signal.

We found a wide variety of writing styles. Figures 2-4 show some examples of the “flower” character U+82B1 (花), in printed, cursive, and unconstrained styles.

Figure 2. Printed radical variations of U+82B1 (花)
This figure shows four variations of the flower character in a printed style.
Figure 3. Cursive radical variations of U+82B1 (花)
This figure shows four variations of the flower character in a cursive style.
Figure 4. Unconstrained variations of U+82B1 (花)
This figure shows three unconstrained variations of the flower character.

The fact that, in daily life, users often write quickly and unconstrained can lead to cursive variations that have a very dissimilar appearance. Conversely, sometimes it also leads to confusability between different characters. Figures 5-7 show some of the concrete examples we observe in our data. Note that it is especially important to have enough training material to distinguish cursive variations such as in Figure 7.

Figure 5. Variations of U+7684 (的)
This figure shows three variations of the same character, each looks dissimilar to the others.
Figure 6. Variations of U+4EE5 (以)
This figure shows three variations of the same character, each looks dissimilar to the others.
Figure 7. Similar shapes of U+738 (王) and U+4E94 (五)
This figure shows two variations of the same character, each looks dissimilar to the other.

With the guiding principles discussed previously, we were able to collect tens of millions of character instances for our training data. Compare the three thousand seven hundred fifty five-character system in the previous section with the results shows in Table 3, on the same test set, after increasing the number of recognizable characters from three thousand seven hundred fifty five to approximately thirty thousand.

Table 3. Results on CASIA On-line Database, thirty thousand Characters

Inventory CR(1) CR(4) CR(10) Model Size
30K 85.6% 95.2% 97.5% 15 MB

Note that the model size remains the same, as the system of Table 2 was simply restricted to the “Hz-1” set of characters, but was otherwise identical. Accuracy drops marginally, which is to be expected, since the vastly increased coverage creates additional confusability of the kind mentioned earlier, for example “二” vs. “Z”.

Compare Tables 1-3 and you’ll see that multiplying coverage by a factor of ten does not entail ten times more errors, or ten times more storage. In fact, as the model size goes up, the number of errors increases much more slowly. Thus, building a high-accuracy Chinese character recognition that covers thirty thousand characters, instead of only three thousand seven hundred fifty five, is possible and practical.

To get an idea of how the system performs across the entire set of thirty thousand characters, we also evaluated it on a number of different test sets comprising all supported characters written in various styles. Table 4 lists the average results.

Table 4. Average Results on Multiple In-House Test Sets Comprising All Writing Styles, thirty thousand Characters

Inventory CR(1) CR(4) Model Size
30K 88.1% 95.1% 15 MB

Of course, the results in Tables 3-4 are not directly comparable, since they were obtained on different test sets. Nonetheless, they show that top-1 and top-4 accuracies are in the same ballpark across the entire inventory of characters. This is a result of the largely balanced training regimen.

Discussion

The total set of CJK characters in Unicode (currently around seventy five thousand [twelve]) could increase in the future, as the ideographic rapporteur group (IRG) keeps on suggesting new additions from a variety of sources. Admittedly, these character variants will be rare (e.g., used in historical names or poetry). Nevertheless, this is of high interest to every person whose name happens to contain one of those rare characters.

So, how well do we expect to handle larger inventories of characters in the future? The experiments discussed in this article support learning curves [2] based on training and test error rates with a varying amount of training data. We can therefore extrapolate asymptotic values, regarding both what our accuracy would look like with more training data, and how it would change with more characters to recognize.

For example, given the ten times larger inventory and corresponding (less than) 2% drop in accuracy between Tables 1 and 3, we can extrapolate that with an inventory of one hundred thousand characters and a corresponding increase in training data, it would be realistic to achieve top-1 accuracies around eighty four %, and top ten accuracies around ninety seven % (with the same type of architecture).

In summary, building a high-accuracy handwriting recognition system which covers a large set of thirty thousand Chinese characters is practical even on embedded devices. Furthermore, accuracy only degrades slowly as the inventory goes up, as long as training data of sufficient quality is available in sufficient quantity. This bodes well for the recognition of even larger character inventories in the future.

References

[1] D.C. Ciresan, U. Meier, L.M. Gambardella, and J. Schmidhuber, Convolutional Neural Network Committees For Handwritten Character Classification, in eleventh Int. Conf. Document Analysis Recognition (ICDAR twenty eleven), Beijing, China, Sept. twenty eleven.

[2] C. Cortes, L.D. Jackel, S.A. Jolla, V. Vapnik, and J.S. Denker, Learning Curves: Asymptotic Values and Rate of Convergence, in Advances in Neural Information Processing Systems (NIPS nineteen ninety three), Denver, pp. three hundred twenty seven–three hundred thirty four, Dec. nineteen ninety three.

[3] G.E. Hinton and K.J. Lang, Shape Recognition and Illusory Conjunctions, in Proc. 9th Int. Joint Conf. Artificial Intelligence, Los Angeles, CA, pp. two hundred fifty two – two hundred fifty nine, nineteen eighty five.

[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient– based Learning Applied to Document Recognition, Proc. IEEE, Vol. eighty six, No. eleven, pp. two thousand two hundred seventy eight–two thousand three hundred twenty four, Nov. nineteen ninety eight.

[5] C.-L. Liu, S. Jaeger, and M. Nakagawa, Online Recognition of Chinese Characters: The State-of-the-Art, IEEE Trans. Pattern Analysis Machine Intelligence, Vol. twenty six, No. 2, pp. one hundred ninety eight–two hundred thirteen, Feb. Two thouasand four.

[6] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, CASIA Online and Offline Chinese Handwriting Databases, in Proc. eleventh Int. Conf. Document Analysis Recognition (ICDAR twenty eleven), Beijing, China, Sept. twenty eleven.

[7] C.-L. Liu, F. Yin, Q.-F. Wang, and D.-H.Wang, ICDAR twenty eleven Chinese Handwriting Recognition Competition,in eleventh Int. Conf. Document Analysis Recognition (ICDAR twenty eleven), Beijing, China, Sept. twenty eleven.

[8] Y. Li, L. Jin , X. Zhu, T. Long, SCUT-COUCH2008: A Comprehensive Online Unconstrained Chinese Handwriting Dataset (ICFHR 2008), Montreal, pp. one hundred sixy five–one hundred seventy, Aug. two thousand eight.

[9] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, What is the Best Multi-stage Architecture for Object Recognition?, in Proc. IEEE Int. Conf. Computer Vision (ICCV09), Kyoto, Japan, Sept. two thousand nine.

[ten] U. Meier, D.C. Ciresan, L.M. Gambardella, and J. Schmidhuber, Better Digit Recognition with a Committee of Simple Neural Nets, in eleventh Int. Conf. Document Analysis Recognition (ICDAR twenty eleven), Beijing, China, Sept. twenty eleven.

[eleven] P.Y. Simard, D. Steinkraus, and J.C. Platt, Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis, in 7th Int. Conf. Document Analysis Recognition (ICDAR two thousand three), Edinburgh, Scotland, Aug. two thousand three.

[twelve] Unicode, Chinese and Japanese, http://www.unicode.org/faq/han_cjk.html, twenty fifteen.

[thirteen] F.F. Wang, Chinese Cursive Script: An Introduction to Handwriting in Chinese, Far Eastern Publications Series, New Haven, CT: Yale University Press, nineteen fifty eight.

[fourteen] F. Yin, Q.-F. Wang, X.-Y. Xhang, and C.-L. Liu, ICDAR2013 Chinese Handwriting Recognition Competition, in eleventh Int. Conf. Document Analysis Recognition (ICDAR twenty thirteen), Washington DC, USA, Sept. twenty thirteen.