skip to main content
10.1145/2835857.2835858acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Python, performance, and natural language processing

Published:15 November 2015Publication History

ABSTRACT

We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.

References

  1. A. Drozd, A. Gladkova, and S. Matsuoka, "Discovering aspectual classes of Russian verbs in untagged large corpora," in The 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS 2015), Sydney, Australia, 2015, to appear.Google ScholarGoogle Scholar
  2. K. J. Millman and M. Aivazis, "Python for scientists and engineers," Computing in Science & Engineering, vol. 13, no. 2, pp. 9--12, Mar 2011. {Online}. Available: http://dx.doi.org/10.1109/MCSE.2011.36 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo, "Tracing the meta-level: Pypy's tracing jit compiler," in Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, ser. ICOOOLPS '09. New York, NY, USA: ACM, 2009, pp. 18--25. {Online}. Available: http://doi.acm.org/10.1145/1565824.1565827 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bernard, "Running scientific code using IPython and SciPy," Linux J., vol. 2013, no. 228, Apr. 2013. {Online}. Available: http://dl.acm.org/citation.cfm?id=2492102.2492105 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Shen, "Interactive notebooks: Sharing the code," Nature, vol. 515, no. 7525, pp. 151--152, Nov 2014. {Online}. Available: http://dx.doi.org/10.1038/515151aGoogle ScholarGoogle ScholarCross RefCross Ref
  6. S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st ed. O'Reilly Media, Inc., 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Harris, "Distributional structure," Word, vol. 10, no. 23, pp. 146--162, 1954.Google ScholarGoogle ScholarCross RefCross Ref
  8. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in Neural Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111--3119.Google ScholarGoogle Scholar
  9. O. Levy and Y. Goldberg, "Linguistic regularities in sparse and explicit word representations," in Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Ann Arbor, Michigan: Association for Computational Linguistics, June 2014, pp. 171--180.Google ScholarGoogle Scholar
  10. K. W. Church and P. Hanks, "Word association norms, mutual information, and lexicography," Comput. Linguist., vol. 16, no. 1, pp. 22--29, Mar 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. K. Landauer and S. T. Dutnais, "A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge," Psychological review, pp. 211--240, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  12. R. Rapp, "Word sense discovery based on sense descriptor dissimilarity," in Proceedings of the Ninth Machine Translation Summit, New Orleans, LA., 2003, pp. 315--322.Google ScholarGoogle Scholar
  13. J. Bullinaria and J. Levy, "Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD," Behavior Research Methods, vol. 44, no. 3, pp. 890--907, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. R. Lebret and R. Collobert, "Word embeddings through Hellinger PCA," in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, April 2014, pp. 482--490.Google ScholarGoogle Scholar
  15. P. D. Turney and P. Pantel, "From frequency to meaning: Vector space models of semantics," Journal of Artificial Intelligence Research, pp. 141--188, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Clark, Handbook of Contemporary Semantics. Wiley-Blackwell, 2015, ch. Vector Space Models of Lexical Meaning, to appear.Google ScholarGoogle Scholar
  17. M. Rooth, S. Riezler, D. Prescher, G. Carroll, and F. Beil, "Inducing a semantically annotated lexicon via EM-based clustering," in Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999, pp. 104--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Stevenson, P. Merlo, N. Kariaeva, and K. Whitehouse, "Supervised learning of lexical semantic verb classes using frequency distributions," Proceedings of SigLex99: Standardizing Lexical Resources, pp. 15--22, 1999.Google ScholarGoogle Scholar
  19. E. V. Siegel and K. R. McKeown, "Learning methods to combine linguistic indicators: Improving aspectual classification and revealing linguistic insights," Computational Linguistics, vol. 26, no. 4, pp. 595--628, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Lagus and A. Airola, "Semantic clustering of verbs," in Acquisition and Representation of Word Meaning: Theoretical and computational perspectives, Linguistica Computazionale XXII-XXIII, IEPI, Pisa-Roma, 2005, pp. 263--287.Google ScholarGoogle Scholar
  21. S. S. Im Walde, "Clustering verbs semantically according to their alternation behaviour," in Proceedings of the 18th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 2000, pp. 747--753. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. S. Im Walde, "Experiments on the automatic induction of German semantic verb classes," Computational Linguistics, vol. 32, no. 2, pp. 159--194, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. S. Im Walde, C. Hying, C. Scheible, and H. Schmid, "Combining EM Training and the MDL Principle for an Automatic Verb Classification Incorporating Selectional Preferences." in ACL. Columbus, 2008, pp. 496--504.Google ScholarGoogle Scholar
  24. J. Li and C. Brew, "Disambiguating Levin verbs using untagged data," Proceedings of RANLP 2007, 2007.Google ScholarGoogle Scholar
  25. V. Benko, "Aranea: Yet another family of (comparable) web corpora," in Text, speech, and dialogue: 17th international conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, ser. LNCS 8655, P. Sojka, A. Horák, I. Kopeček, and K. Pala, Eds. Springer, pp. 257--264.Google ScholarGoogle Scholar
  26. A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1986, ch. 4, p. 183.Google ScholarGoogle Scholar
  27. A. Dubrow. (2013) Linguists, computer scientists use supercomputers to improve natural language processing. {Online}. Available: http://nlp.hivefire.com/articles/share/40120/Google ScholarGoogle Scholar
  28. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms (2nd ed.). MIT Press & McGraw-Hill, 2001, ch. 12: Binary search trees, 15.5: Optimal binary search trees, pp. 253--272, 356--363.Google ScholarGoogle Scholar
  29. D. E. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley, 1997, ch. 6.3: Digital Searching, p. 492.Google ScholarGoogle Scholar
  30. J. Bentley and R. Sedgewick, "Fast algoprithms for sorting and searching string," in Proc. Annual ACM-SIAM Symp. on Discrete Algorithms. New Orleans, Luisiana: ACM/SIAM, 1997, pp. 360--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Lämmel, "Google's MapReduce programming model - revisited," Science of Computer Programming, vol. 70, no. 1, pp. 1--30, Jan 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Lam, Hadoop in Action, 1st ed. Greenwich, CT, USA: Manning Publications Co., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. O. Lyashevskaya and S. Sharov, Chastotnyj slovar' sovremennogo russkogo yazyka (na materialah Natsional'nogo korpusa russkogo yazyka). Moskva: Azbukovnik, 2009. {Online}. Available: http://dict.ruslang.ru/freq.phpGoogle ScholarGoogle Scholar
  34. M. Hagen, "Slovar' russkogo yazyka "Polnaya paradigma. Morfologiya", 2014. {Online}. Available: http://www.speakrus.ru/dict/Google ScholarGoogle Scholar
  35. A. Zonca, "Machine learning at scale with Python," San Diego Supercomputer Center, Tech. Rep., 2014.Google ScholarGoogle Scholar
  36. N. S. Altman, "An introduction to kernel and nearest-neighbor non-parametric regression," The American Statistician, vol. 46, no. 3, pp. 175--185, 1992.Google ScholarGoogle Scholar

Index Terms

  1. Python, performance, and natural language processing

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PyHPC '15: Proceedings of the 5th Workshop on Python for High-Performance and Scientific Computing
              November 2015
              59 pages
              ISBN:9781450340106
              DOI:10.1145/2835857

              Copyright © 2015 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 15 November 2015

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              PyHPC '15 Paper Acceptance Rate7of7submissions,100%Overall Acceptance Rate7of7submissions,100%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader