ABSTRACT
The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.
Supplemental Material
- E. Aries and K. Moorehead. The importance of ethnicity in the development of identity of black adolescents. Psychological Reports, 65:75--82, August 1989.Google ScholarCross Ref
- M. Bautin and S. Skiena. Concordance-based entity-oriented search. In IEEE/WIC/ACM Int. Conf. Web Intelligence (WI-07), pages 586--592, 2007. Google ScholarDigital Library
- M. Bautin, L. Vijayarenu, and S. Skiena. International sentiment analysis for news and blogs, 2008.Google Scholar
- E. Berchard, E. Ziv, and et. al. Importance of race and ethnic background in biomedical research and clinical practice. The New England Journal of Medicine, 348:1170--1175, March 2003.Google ScholarCross Ref
- R. W. Buechley. Generally useful ethnic search system, GUESS. In Presented at the Annual Meeting of the American Names Society, New York, NY, 1976.Google Scholar
- A. J. Coldman, T. Braun, and R. P. Gallagher. The classification of ethnic status using name information. Journal of Epidemiology and Community Health, 42:390--395, 1988.Google ScholarCross Ref
- K. Fiscella and A. M. Fremon. Use of geocoding and surname analysis to estimate race and ethnicity. Health Service Research, 41:1482:1500, August 2006.Google Scholar
- P. Gill, R. Bhopal, S. Wild, and J. Kai. Limitations and potential of country of birth as proxy for ethnic group. British Medical Journal, 330:196, 2005.Google ScholarCross Ref
- N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. In Proc. First Int. Conf. on Weblogs and Social Media, pages 219--222, Mar. 2007.Google Scholar
- S. Harding, H. Dews, and S. Simpson. The potential to identify South Asians using a computerised algorithm to classify names. Population Trends, 97:46--9, 1999.Google Scholar
- D. Honer. Identifying ethnicity: A comparison of two computer programmes designed to identify names of south asian ethnic origin. MPH Dissertation, University of Birmingham, 2003.Google Scholar
- D. S. Lauderdale and B. Kestenbaum. Asian american ethnic identification by surname. Population Research and Policy Review, 19:283--300, 2000.Google ScholarCross Ref
- L. Lloyd, P. Kaulgud, and S. Skiena. Newspapers vs. blogs: Who gets the scoop? In Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), volume AAAI Press, Technical Report SS-06-03, pages 117--124, 2006.Google Scholar
- L. Lloyd, D. Kechagias, and S. Skiena. Lydia: A system for large-scale news analysis. In String Processing and Information Retrieval (SPIRE 2005), pages 161--166, 2005. Google ScholarDigital Library
- L. Lloyd, A. Mehler, and S. Skiena. Identifying co-referential names across large corpra. In Proc. Combinatorial Pattern Matching (CPM 2006), volume LNCS 4009, pages 12--23, 2006. Google ScholarDigital Library
- P. Mateos. A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place, 2007.Google ScholarCross Ref
- P. Mateos, R. Webber, and P. Longley. The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names. Technical report, CASA Working Papers 116, Centre for Advanced Spatial Analysis University College London, March 2007.Google Scholar
- A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial Analysis of News Sources. In IEEE Trans. Vis. Comput. Graph., volume 12, pages 765--772, 2006. Google ScholarDigital Library
- A. Mehler and S. Skiena. Expanding network communities from representative examples. ACM Trans. Knowledge Discovery from Data (TKDD), 2009. Google ScholarDigital Library
- K. Nanchahal, P. Mangtani, M. Alston, and I. dos Santos Silva. Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in british health-related studies. Journal of Public Health Medicine, 23:278--285, 2001.Google ScholarCross Ref
- S. L. Stewart, K. C. Swallen, S. L. Glaser, P. L. Horn-Ross, and D. W. West. Comparison of Methods for Classifying Hispanic Ethnicity in a Population-based Cancer Registry. Am. J. Epidemiol., 149(11):1063--1071, 1999.Google ScholarCross Ref
- J. Wales. Wikipedia. http://www.wikipedia.org, 2009.Google Scholar
- C. Ward, M. Bautin, and S. Skiena. Identifying differences in news coverage between cultural/ethnic groups. submitted for publication, 2009.Google Scholar
Index Terms
- Name-ethnicity classification from open sources
Recommendations
Nationality Classification Using Name Embeddings
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementNationality identification unlocks important demographic information, with many applications in biomedical and sociological research. Existing name-based nationality classifiers use name substrings as features and are trained on small, unrepresentative ...
Name-ethnicity classification and ethnicity-sensitive name matching
AAAI'12: Proceedings of the Twenty-Sixth AAAI Conference on Artificial IntelligencePersonal names are important and common information in many data sources, ranging from social networks and news articles to patient records and scientific documents. They are often used as queries for retrieving records and also as key information for ...
Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning
AbstractIn several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are ...
Comments