skip to main content
10.1145/1557019.1557032acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Name-ethnicity classification from open sources

Published:28 June 2009Publication History

ABSTRACT

The problem of ethnicity identification from names has a variety of important applications, including biomedical research, demographic studies, and marketing. Here we report on the development of an ethnicity classifier where all training data is extracted from public, non-confidential (and hence somewhat unreliable) sources. Our classifier uses hidden Markov models (HMMs) and decision trees to classify names into 13 cultural/ethnic groups with individual group accuracy comparable accuracy to earlier binary (e.g., Spanish/non-Spanish) classifiers. We have applied this classifier to over 20 million names from a large-scale news corpus, identifying interesting temporal and spatial trends on the representation of particular cultural/ethnic groups.

Skip Supplemental Material Section

Supplemental Material

p49-male.mp4

mp4

104.9 MB

References

  1. E. Aries and K. Moorehead. The importance of ethnicity in the development of identity of black adolescents. Psychological Reports, 65:75--82, August 1989.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Bautin and S. Skiena. Concordance-based entity-oriented search. In IEEE/WIC/ACM Int. Conf. Web Intelligence (WI-07), pages 586--592, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bautin, L. Vijayarenu, and S. Skiena. International sentiment analysis for news and blogs, 2008.Google ScholarGoogle Scholar
  4. E. Berchard, E. Ziv, and et. al. Importance of race and ethnic background in biomedical research and clinical practice. The New England Journal of Medicine, 348:1170--1175, March 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. W. Buechley. Generally useful ethnic search system, GUESS. In Presented at the Annual Meeting of the American Names Society, New York, NY, 1976.Google ScholarGoogle Scholar
  6. A. J. Coldman, T. Braun, and R. P. Gallagher. The classification of ethnic status using name information. Journal of Epidemiology and Community Health, 42:390--395, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  7. K. Fiscella and A. M. Fremon. Use of geocoding and surname analysis to estimate race and ethnicity. Health Service Research, 41:1482:1500, August 2006.Google ScholarGoogle Scholar
  8. P. Gill, R. Bhopal, S. Wild, and J. Kai. Limitations and potential of country of birth as proxy for ethnic group. British Medical Journal, 330:196, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. Godbole, M. Srinivasaiah, and S. Skiena. Large-Scale Sentiment Analysis for News and Blogs. In Proc. First Int. Conf. on Weblogs and Social Media, pages 219--222, Mar. 2007.Google ScholarGoogle Scholar
  10. S. Harding, H. Dews, and S. Simpson. The potential to identify South Asians using a computerised algorithm to classify names. Population Trends, 97:46--9, 1999.Google ScholarGoogle Scholar
  11. D. Honer. Identifying ethnicity: A comparison of two computer programmes designed to identify names of south asian ethnic origin. MPH Dissertation, University of Birmingham, 2003.Google ScholarGoogle Scholar
  12. D. S. Lauderdale and B. Kestenbaum. Asian american ethnic identification by surname. Population Research and Policy Review, 19:283--300, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  13. L. Lloyd, P. Kaulgud, and S. Skiena. Newspapers vs. blogs: Who gets the scoop? In Computational Approaches to Analyzing Weblogs (AAAI-CAAW 2006), volume AAAI Press, Technical Report SS-06-03, pages 117--124, 2006.Google ScholarGoogle Scholar
  14. L. Lloyd, D. Kechagias, and S. Skiena. Lydia: A system for large-scale news analysis. In String Processing and Information Retrieval (SPIRE 2005), pages 161--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Lloyd, A. Mehler, and S. Skiena. Identifying co-referential names across large corpra. In Proc. Combinatorial Pattern Matching (CPM 2006), volume LNCS 4009, pages 12--23, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Mateos. A review of name-based ethnicity classification methods and their potential in population studies. Population, Space and Place, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  17. P. Mateos, R. Webber, and P. Longley. The cultural, ethnic and linguistic classification of populations and neighbourhoods using personal names. Technical report, CASA Working Papers 116, Centre for Advanced Spatial Analysis University College London, March 2007.Google ScholarGoogle Scholar
  18. A. Mehler, Y. Bao, X. Li, Y. Wang, and S. Skiena. Spatial Analysis of News Sources. In IEEE Trans. Vis. Comput. Graph., volume 12, pages 765--772, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Mehler and S. Skiena. Expanding network communities from representative examples. ACM Trans. Knowledge Discovery from Data (TKDD), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Nanchahal, P. Mangtani, M. Alston, and I. dos Santos Silva. Development and validation of a computerized South Asian names and group recognition algorithm (SANGRA) for use in british health-related studies. Journal of Public Health Medicine, 23:278--285, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  21. S. L. Stewart, K. C. Swallen, S. L. Glaser, P. L. Horn-Ross, and D. W. West. Comparison of Methods for Classifying Hispanic Ethnicity in a Population-based Cancer Registry. Am. J. Epidemiol., 149(11):1063--1071, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Wales. Wikipedia. http://www.wikipedia.org, 2009.Google ScholarGoogle Scholar
  23. C. Ward, M. Bautin, and S. Skiena. Identifying differences in news coverage between cultural/ethnic groups. submitted for publication, 2009.Google ScholarGoogle Scholar

Index Terms

  1. Name-ethnicity classification from open sources

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
      June 2009
      1426 pages
      ISBN:9781605584959
      DOI:10.1145/1557019

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,133of8,635submissions,13%

      Upcoming Conference

      KDD '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader