gender prediction based on chinese nametcci.ccf.org.cn/conference/2019/papers/sw4.pdf · early work...

8
Gender Prediction Based on Chinese Name Jizheng Jia 1[0000-0002-8876-9644] and Qiyang Zhao 2[0000-0002-8476-5742] 1 Beihang University, Beijing, China {zy1706128}@buaa.edu.cn 2 Beihang University, Beijing, China {zhaoqy}@buaa.edu.cn Abstract. Much work has been done on the problem of gender prediction about English using the idea of probability models or traditional machine learning meth- ods. Different from English or other alphabetic languages, Chinese characters are logosyllabic. Previous approaches work quite well for Indo-European languages in general and English in particular, however, their performance deteriorate in Asian languages such as Chinese, Japanese and Korean. In our work, we focus on Simplified Chinese characters and present a novel approach incorporating pho- netic information (Pinyin) to enhance Chinese word embedding trained on BERT model. We compared our method with several previous methods, namely Naive Bayes, GBDT, and Random forest with word embedding via fastText as features. Quantitative and qualitative experiments demonstrate the superior of our model. The results show that we can achieve 93.45% test accuracy using our method. In addition, we have released two large-scale gender-labeled datasets (one with over one million first names and the other with over six million full names) used as a part of this study for the community. Keywords: Gender inference of names · Pinyin representation · BERT-based model. 1 Introduction In recent years, the problem of classifying gender has occupied an important place in academia and industry. For many research questions, demographic information about individuals (such as names, gender, or ethnic background) is highly beneficial but it often particularly difficult to obtain [3,2]. The industry can also benefit from gender data and they can gain additional infor- mation about the customers, which would be used for targeted marketing or person- alized advertisements [3,4]. Also, it can be used for name translation and automated gender selection in online form filling, making the software more convenient for users. Most Internet companies in China can access to National Identification Card (People’s Republic of China) database, with an odd/even number indicating male/female with 100% accuracy. However, with increasing awareness of privacy data protection, more and more Internet users are reluctant to provide their personal data. It has long been known that the appearance of a person is not independent of the name. Parents spend extraordinary time and effort to select the perfect name for an expected child. The name of a child conveys much more information including gender

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

Gender Prediction Based on Chinese Name

Jizheng Jia1[0000−0002−8876−9644] and Qiyang Zhao2[0000−0002−8476−5742]

1 Beihang University, Beijing, China{zy1706128}@buaa.edu.cn

2 Beihang University, Beijing, China{zhaoqy}@buaa.edu.cn

Abstract. Much work has been done on the problem of gender prediction aboutEnglish using the idea of probability models or traditional machine learning meth-ods. Different from English or other alphabetic languages, Chinese characters arelogosyllabic. Previous approaches work quite well for Indo-European languagesin general and English in particular, however, their performance deteriorate inAsian languages such as Chinese, Japanese and Korean. In our work, we focuson Simplified Chinese characters and present a novel approach incorporating pho-netic information (Pinyin) to enhance Chinese word embedding trained on BERTmodel. We compared our method with several previous methods, namely NaiveBayes, GBDT, and Random forest with word embedding via fastText as features.Quantitative and qualitative experiments demonstrate the superior of our model.The results show that we can achieve 93.45% test accuracy using our method. Inaddition, we have released two large-scale gender-labeled datasets (one with overone million first names and the other with over six million full names) used as apart of this study for the community.

Keywords: Gender inference of names · Pinyin representation · BERT-basedmodel.

1 Introduction

In recent years, the problem of classifying gender has occupied an important place inacademia and industry. For many research questions, demographic information aboutindividuals (such as names, gender, or ethnic background) is highly beneficial but itoften particularly difficult to obtain [3,2].

The industry can also benefit from gender data and they can gain additional infor-mation about the customers, which would be used for targeted marketing or person-alized advertisements [3,4]. Also, it can be used for name translation and automatedgender selection in online form filling, making the software more convenient for users.Most Internet companies in China can access to National Identification Card (People’sRepublic of China) database, with an odd/even number indicating male/female with100% accuracy. However, with increasing awareness of privacy data protection, moreand more Internet users are reluctant to provide their personal data.

It has long been known that the appearance of a person is not independent of thename. Parents spend extraordinary time and effort to select the perfect name for anexpected child. The name of a child conveys much more information including gender

Page 2: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

2 Jizheng Jia and Qiyang Zhao

(a) Male (Hanzi -> Pinyin) (b) Female (Hanzi -> Pinyin) (c) Unisex (Hanzi -> Pinyin)

Fig. 1: The Chinese character (Hanzi) of name can indicate some information of gen-der, in the same way, Pinyin is gender-specific. The characters of two figures (Hanzi andPinyin) do not completely match because different characters that have the same pro-nunciation are reduced to the same Pinyin representation. For instance, in the Figure 1a,“中” and “忠” have the same Pinyin representation “zhong”.

both in western countries and China. As we all know, names can be divided into threeparts: first, middle, and last name, which don’t have equal effects on gender. In mainlandChina, most people’s name consist of first name and last name. According to a report[23] on Chinese names from the Ministry of Public Security, there were 23 last namesthat claim more than 10 million users and the number of Chinese people bearing thelast names Wang and Li has surpassed 100 million for both, which indicates millionof people have the same last name. Maybe we can easily get the conclusion from thereport that last name has little indication about gender information. In order to verifyour inference, we provide two versions of name-gender datasets so as to compare thedifference between the full name and first name in terms of gender prediction.

There have been several approaches proposed for gender estimation using first names.Most previous work were rules-based [5,16] or focused on probabilistic prediction usinga large corpus of known names [7]. However, rule-based methods have drawbacks thatthe coverage of rules was limited and they are language dependent. Thus, it seems ab-surd to attempt to hard code every possible pattern when we have deep learning. What’smore, many previous work focused on alphabetic writing systems, their performanceson Chinese deteriorated. So far, there are few studies on the Chinese language.

We intend to fill this gap by presenting a benchmark and comparison of severalgender inference methods in the area of the Chinese language. There are many distinc-tions between Chinese and English languages. For example, Chinese sentences are notseparated by spaces [17]. But the main distinction is that written Chinese is logosyl-labic. A Chinese character has its own meaning or as a part of the polysyllabic word,making Chinese sentence segmentation a non-trivial task. In addition to the difficultyof sentence segmentation [9], most previous studies computed the semantic meaning ofChinese characters using the corpus from Baidu Baike3 or Chinese Wikipedia4. Thiskind of method ignored the importance of pronunciations features. The work found thatphonology also contributed to character recognition [10], which was ignored by mostprevious work.

3https://baike.baidu.com4https://en.wikipedia.org/wiki/Main_Page

Page 3: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

Gender Prediction Based on Chinese Name 3

Thus, in our work, the Chinese word representations can be roughly divided intotwo components: semantic component and phonetic component. The semantic com-ponent indicates the meaning of a character while the latter indicates the sound of acharacter. Pinyin is the official phonetic representation for Chinese word, which con-verts Chinese character to Roman letters depending on the pronunciation of MandarinChinese [11]. For example, “qing” and “qin” are the phonetic component of characters“青” (seafoam) and “琴” (Qin, a Chinese musical instrument). In this case, they bothcontain female cues. Figure 1 shows that Pinyin is also gender inclination like Chi-nese characters (Hanzi). For example, in Figure 1a, “国” /guo/ and “德” /de/ are maleinclined and in Figure 1c, we call “明” /ming/ and “晓” /xiao/ are unisex.

Then we used two representations: the simplified Chinese string and respondingPinyin representation. Our idea is that we can use phonetic features to enhance theoriginal Chinese embeddings. Finally, we used the state-of-art model BERT [12] forgender classification based on the features we have described.

To summarize, the key contributions of this work are as follows: (1) This work fillsthe gap of integrating Pinyin information with traditional word embedding via BERT-Base model. (2) We compare several previous gender inference architectures with ourmodel and the effectiveness of our idea is empirically well demonstrated. (3) To ourbest knowledge, this is the first large gender-labeled Chinese name dataset available forthe community.

2 Related Work

Early work on gender prediction includes gender-guesser [22], Gender API [19] andNgender [20]. The gender-guesser guesses gender from first English name using a dic-tionary. Sometimes, you get the “Unknown Token” if the input doesn’t exist in the dic-tionary. Gender API infers the likely gender of a name but it is a commercial applicationand not free for the community. Ngender also uses a predefined dictionary. The draw-back is obvious: the dictionary is too limited when it comes to new words although thiswork5 uses Laplacian smoothing technique addressing low-frequency words or OOV(out-of-vocabulary) problem.

More Recent studies pay more attention to traditional machine learning such asSVM [7,13] or boosted decision trees [5]. This work finds out a set of characteristicssuch as number of syllables, number of vowels and ending character, extracted fromfirst name and then use these characteristics as feature to an SVM model [3].

In recent years, some effective methods to understand Chinese names have beenproposed. This work provided a tool for both English and Chinese names simultane-ously but only in STEM fields when they find no free gender forecasting software topredict global gender disparity [14]. [15] proposed a framework exploring both internalcharacter information and external context of words to explore subword-level informa-tion of Chinese characters, however, in this paper, we propose to use Pinyin to enhancethe traditional Chinese word embeddings.

In this paper, we’d like to use BERT model instead of traditional machine learningmethods, which need manual feature extraction and selection. Moreover, we introduce

5http://sofasofa.io/tutorials/naive_bayes_classifier/

Page 4: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

4 Jizheng Jia and Qiyang Zhao

Pinyin as raw training data to original embedding hoping to enhance the ability of wordembedding.

3 The Approaches and Data

3.1 Four Approaches

There are mainly four types of gender prediction models in our experiment: (1) a rule-based approach, which was omitted by its limitation in specific language or region [4],(2) probabilistic model like Naive Bayes, (3) two types of machine learning algorithmssuch as GBDT and Random forest, (4) BERT-Base model with character-Pinyin data.

Naive Bayes Classifier We firstly choose Naive Bayes for its simplicity since thereis almost no hyper-parameter tuning needed. The Naive Bayes classifier uses BayesTheorem to predict the probability of gender given name information [14]. The formulaof Naive Bayes algorithm for gender prediction:

P(gendername

) =P(gender)×P( gender

name )

P(name)

In this formula, P(gender|name) is the posterior probability of gender, given names;P(name|gender) is the likelihood which is the probability of predictor, given gender;P(name) is the prior probability of predictor.

We also apply Laplace smoothing when a frequency-based probability is zero, whichis a practical technique used to deal with unknown words (names).

g(x) =nx +α

l +αc

where nx is the number of occurrences of the word x, l is the length of the sentence,c is the number of different words in the sentence, and α is the smooth coefficient ofLaplacian smoothness, which is a hyper-parameter.

GBDT with Simple Features GBDT, however, is chosen as “simple version” of neu-ral network and its decent performance in many machine learning competitions. Thefeatures are term frequency of every single Chinese character from the name.

Random Forest with Various Types of Features This model takes three types offeatures: the first name, the unigram of first name and a vector of size one hundredextracted from Skip-gram model via fastText trained separately using a corpus collectedfrom Baidu Baike.

BERT-Base Model with Character-Pinyin Data The architecture of our model couldbe found in Figure 2. We mainly have to train the binary classifier, with some changeslike the dictionary to the BERT-Base model during training process. The training pro-cess is also called Fine-Tuning.

Page 5: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

Gender Prediction Based on Chinese Name 5

Fig. 2: BERT has achieved state-of-the-art results in wide variety of NLP language task.We load BERT-Base model, attach an additional layer for classification, and fine-tuneBERT model for gender classification.

Data Type Number Percentfirstname (female) 718,400 63.70%firstname (male) 409,411 36.30%fullname (female) 4,366,278 66.25%fullname (male) 2,223,964 33.75%

Table 1: The statistical information of our datasets. The difference between firstnamedataset and fullname dataset is that the former has no first name. For example, “杨振宁”is a Chinese name that is officially displayed as Chenning Yang. And, in the fullnamedataset, it’s showed as “yang chen ning” where chracters are separated by space andconverted into lower case. But in the firstname dataset, it is represented as “chen ning”.

3.2 Gender-labeled Data

The gender-labeled data can be vital to gender classification. However, to the best ofour knowledge, there is no suitable gender-labeled dataset for deep network training.

In our work, data was collected with reliable gender labels collected from severalpublic sources. After cleaning and removing repetitive names, the final data lists consistof 6,590,242 names and 1,127,811 first names, respectively. We used pypinyin [21] andGoogle translator to translate those Chinese names into Pinyin format. Table 1 showsmore details about our gender-labeled datasets.

There are two types of datasets: The first one is Simplified Chinese character whichwe call "hanzi". Since each Chinese character corresponds to a Pinyin, we take eachPinyin as a token corresponding to the Chinese character. Then we call the second typedataset as "character-pinyin".

4 Experiments

We explore four different classifier types described above. We choose Naive Bayesas baseline and GBDT using TF (term frequency) as the “simpler” version of neural

Page 6: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

6 Jizheng Jia and Qiyang Zhao

Naive Bayes GBDT RF Our ModelACC (mean) 80.67% 85.59% 84.87% 90.89%ACC (SD) 2.03 1.85 1.58 1.02

Table 2: Experiment results came from Chinese characters data without incorporatingPinyin information. We reported the mean and standard deviation (SD) of results fromten repeated experiments on each model with different random seed.

Pinyin (only) Hanzi (only) Hanzi-Pinyin-first Hanzi-Pinyin-fullACC (mean) 69.64% 90.89% 92.33% 93.45%ACC (SD) 1.78 0.96 1.02 1.52

Table 3: These experiments were trained on different combinations of our two datasets.Like last experiment, we used mean and SD of ACC to get unbiased result. ACC from"Hanzi-Pinyin-first" and "Hanzi-Pinyin-full" got better results than the others. And"Hanzi (only)" got better accuracy than "Pinyin (only)" when working alone.

network. The third model is based on Random forest using the word embedding trainedby fastText. Finally, our model is trained on BERT using simplified Chinese words andcorresponding Pinyin data.

Comparison Experiments were done on our first name dataset without Pinyin infor-mation. For fastText, we used Skip-gram to learn word embeddings, we set the mini-mum length of character n-grams to be 2 and the maximum length of character n-gramsto be 4. We tuned the hyper-parameters using grid search and cross-validation with 90%in the training set, 5% in the dev set. The test data (5%) was held out for final compar-ison. The experimental results could be found in Table 2. The high accuracy in test setindicated the effiency of our proposed model.

To analyze the effect of Pinyin representation, we also did a second type of experi-ment applying different merging strategies of the data training on our model. The firsttype of dataset was first name data of Pinyin format; the second one was first namedata of Simplified Chinese (Hanzi) format; the third one was first name data from bothPinyin and Simplified Chinese format, we called this as character-Pinyin first name(Hanzi-Pinyin-first); the final one used full name data from both Pinyin and Simpli-fied Chinese format (Hanzi-Pinyin-full). Table 3 showed the detail information of ourexperiment results. You could find our codes and open-source datasets here6 sooner.

5 Conclusion

Our experiments showed the efficiency of incorporating Pinyin information to enhanceprevious word embedding compared with using Chinese words only. We hold that thisidea can also expand to other NLP tasks or downstream tasks.

Pinyin did not do as well as the Simplified Chinese characters, but combined withSimplified Chinese words, it can get better results compared to using Simplified Chinese

6https://github.com/jijeng/gender-prediction

Page 7: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

Gender Prediction Based on Chinese Name 7

words alone. Also, we face problem such as unisex names, which could be the directionfor future work.

References

1. Juergen Mueller and Gerd Stumme. Gender inference using statistical name characteristicsin twitter. In Proceedings of the The 3rd Multidisciplinary International Social NetworksConference on SocialInformatics 2016, Data Science 2016, page 47. ACM, 2016.

2. Fariba Karimi, Claudia Wagner, Florian Lemmerich, Mohsen Jadidi, and Markus Strohmaier.Inferring gender from names on the web: A comparative evaluation of gender detection meth-ods. In Proceedings of the 25th International Conference Companion on World Wide Web,WWW ’16 Companion, pages 53–54, Republic and Canton of Geneva, Switzerland, 2016.International World Wide Web Conferences Steering Committee.

3. Juergen Mueller and Gerd Stumme. Gender inference using statistical name characteristicsin twitter. In Proceedings of the The 3rd Multidisciplinary International Social NetworksConference on SocialInformatics 2016, Data Science 2016, page 47. ACM, 2016.

4. Monali Y Khachane. Gender estimation from first name: A rule based approach. Interna-tional Journal of Advanced Research in Computer Science, 9(2):609, 2018.

5. Wendy Liu and Derek Ruths. What’s in a name? using first names as features for genderinference in twitter. In 2013 AAAI Spring Symposium Series, 2013.

6. Chuan Gu, Xi-ping Tian, and Jiang-de Yu. Automatic recognition of chinese personal nameusing conditional random fields and knowledge base. Mathematical Problems in Engineer-ing, 2015, 2015.

7. John D. Burger, John Henderson, George Kim, and Guido Zarrella. Discriminating genderon twitter. In Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing, EMNLP ’11, pages 1301–1309, Stroudsburg, PA, USA, 2011. Association forComputational Linguistics.

8. Ming Liu, Vasile Rus, Qiang Liao, and Li Liu. Encoding and ranking similar chinese char-acters. J. Inf. Sci. Eng., 33(5):1195–1211, 2017.

9. Shilei Huang and Jiangqin Wu. A pragmatic approach for classical chinese word segmenta-tion. In Proceedings of the Eleventh International Conference on Language Resources andEvaluation (LREC-2018), 2018.

10. Nanyun Peng, Mo Yu, and Mark Dredze. An empirical study of chinese name matching andapplications. In Proceedings of the 53rd Annual Meeting of the Association for Computa-tional Linguistics and the 7th International Joint Conference on Natural Language Process-ing (Volume 2: Short Papers), volume 2, pages 377–383, 2015.

11. Yafang Huang and Hai Zhao. Chinese pinyin aided IME, input what you have not keystrokedyet. In Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing, pages 2923–2929, Brussels, Belgium, October-November 2018. Association forComputational Linguistics.

12. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-trainingof Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, pagearXiv:1810.04805, Oct 2018.

13. Huizhong Chen, Andrew C. Gallagher, and Bernd Girod. What’s in a name? first namesas facial attributes. In The IEEE Conference on Computer Vision and Pattern Recognition(CVPR), June 2013.

14. H. Zhao and F. Kamareddine. Advance gender prediction tool of first names and its usein analysing gender disparity in computer science in the uk, malaysia and china. In 2017International Conference on Computational Science and Computational Intelligence (CSCI),pages 222–227, Dec 2017.

Page 8: Gender Prediction Based on Chinese Nametcci.ccf.org.cn/conference/2019/papers/SW4.pdf · Early work on gender prediction includes gender-guesser [22], Gender API [19] and Ngender

8 Jizheng Jia and Qiyang Zhao

15. Huiming Jin, Hao Zhu, Zhiyuan Liu, Ruobing Xie, Maosong Sun, Fen Lin, and Leyu Lin.Incorporating Chinese Characters of Words for Lexical Sememe Prediction. arXiv e-prints,page arXiv:1806.06349, Jun 2018.

16. Chuan Gu, Xi-ping Tian, and Jiang-de Yu. Automatic recognition of chinese personal nameusing conditional random fields and knowledge base. Mathematical Problems in Engineer-ing, 2015, 2015.

17. Ming Liu, Vasile Rus, Qiang Liao, and Li Liu. Encoding and ranking similar chinese char-acters. J. Inf. Sci. Eng., 33(5):1195–1211, 2017.

18. Gender Guesser, https://test.pypi.org/project/gender-guesser/. Last accessed4 May 2019

19. Namsor Gender API, https://gender-api.com/. Last accessed 4 May 201920. Ngender, https://github.com/observerss/ngender/. Last accessed 4 May 201921. pypinyin, https://pypi.org/project/pypinyin/. Last accessed 4 May 201922. Gender Guesser, https://test.pypi.org/project/gender-guesser/. Last accessed

4 May 201923. Most common surnames revealed, http://www.chinadaily.com.cn/a/201901/31/

WS5c528e7ea3106c65c34e78cb.html. Last accessed 4 May 2019