ABSTRACT
We present a case study of Python-based workflow for a data-intensive natural language processing problem, namely word classification with vector space model methodology. Problems in the area of natural language processing are typically solved in many steps which require transformation of the data to vastly different formats (in our case, raw text to sparse matrices to dense vectors). A Python implementation for each of these steps would require a different solution. We survey existing approaches to using Python for high-performance processing of large volumes of data, and we propose a sample solution for each step for our case study (aspectual classification of Russian verbs), attempting to preserve both efficiency and user-friendliness. For the most computationally intensive part of the workflow we develop a prototype distributed implementation of co-occurrence extraction module using IPython.parallel cluster.
- A. Drozd, A. Gladkova, and S. Matsuoka, "Discovering aspectual classes of Russian verbs in untagged large corpora," in The 2015 IEEE International Conference on Data Science and Data Intensive Systems (DSDIS 2015), Sydney, Australia, 2015, to appear.Google Scholar
- K. J. Millman and M. Aivazis, "Python for scientists and engineers," Computing in Science & Engineering, vol. 13, no. 2, pp. 9--12, Mar 2011. {Online}. Available: http://dx.doi.org/10.1109/MCSE.2011.36 Google ScholarDigital Library
- C. F. Bolz, A. Cuni, M. Fijalkowski, and A. Rigo, "Tracing the meta-level: Pypy's tracing jit compiler," in Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, ser. ICOOOLPS '09. New York, NY, USA: ACM, 2009, pp. 18--25. {Online}. Available: http://doi.acm.org/10.1145/1565824.1565827 Google ScholarDigital Library
- J. Bernard, "Running scientific code using IPython and SciPy," Linux J., vol. 2013, no. 228, Apr. 2013. {Online}. Available: http://dl.acm.org/citation.cfm?id=2492102.2492105 Google ScholarDigital Library
- H. Shen, "Interactive notebooks: Sharing the code," Nature, vol. 515, no. 7525, pp. 151--152, Nov 2014. {Online}. Available: http://dx.doi.org/10.1038/515151aGoogle ScholarCross Ref
- S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python, 1st ed. O'Reilly Media, Inc., 2009. Google ScholarDigital Library
- Z. Harris, "Distributional structure," Word, vol. 10, no. 23, pp. 146--162, 1954.Google ScholarCross Ref
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in Neural Information Processing Systems 26, C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, Eds. Curran Associates, Inc., 2013, pp. 3111--3119.Google Scholar
- O. Levy and Y. Goldberg, "Linguistic regularities in sparse and explicit word representations," in Proceedings of the Eighteenth Conference on Computational Natural Language Learning. Ann Arbor, Michigan: Association for Computational Linguistics, June 2014, pp. 171--180.Google Scholar
- K. W. Church and P. Hanks, "Word association norms, mutual information, and lexicography," Comput. Linguist., vol. 16, no. 1, pp. 22--29, Mar 1990. Google ScholarDigital Library
- T. K. Landauer and S. T. Dutnais, "A solution to Plato's problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge," Psychological review, pp. 211--240, 1997.Google ScholarCross Ref
- R. Rapp, "Word sense discovery based on sense descriptor dissimilarity," in Proceedings of the Ninth Machine Translation Summit, New Orleans, LA., 2003, pp. 315--322.Google Scholar
- J. Bullinaria and J. Levy, "Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD," Behavior Research Methods, vol. 44, no. 3, pp. 890--907, 2012.Google ScholarCross Ref
- R. Lebret and R. Collobert, "Word embeddings through Hellinger PCA," in Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden: Association for Computational Linguistics, April 2014, pp. 482--490.Google Scholar
- P. D. Turney and P. Pantel, "From frequency to meaning: Vector space models of semantics," Journal of Artificial Intelligence Research, pp. 141--188, 2010. Google ScholarDigital Library
- S. Clark, Handbook of Contemporary Semantics. Wiley-Blackwell, 2015, ch. Vector Space Models of Lexical Meaning, to appear.Google Scholar
- M. Rooth, S. Riezler, D. Prescher, G. Carroll, and F. Beil, "Inducing a semantically annotated lexicon via EM-based clustering," in Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, 1999, pp. 104--111. Google ScholarDigital Library
- S. Stevenson, P. Merlo, N. Kariaeva, and K. Whitehouse, "Supervised learning of lexical semantic verb classes using frequency distributions," Proceedings of SigLex99: Standardizing Lexical Resources, pp. 15--22, 1999.Google Scholar
- E. V. Siegel and K. R. McKeown, "Learning methods to combine linguistic indicators: Improving aspectual classification and revealing linguistic insights," Computational Linguistics, vol. 26, no. 4, pp. 595--628, 2000. Google ScholarDigital Library
- K. Lagus and A. Airola, "Semantic clustering of verbs," in Acquisition and Representation of Word Meaning: Theoretical and computational perspectives, Linguistica Computazionale XXII-XXIII, IEPI, Pisa-Roma, 2005, pp. 263--287.Google Scholar
- S. S. Im Walde, "Clustering verbs semantically according to their alternation behaviour," in Proceedings of the 18th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 2000, pp. 747--753. Google ScholarDigital Library
- S. S. Im Walde, "Experiments on the automatic induction of German semantic verb classes," Computational Linguistics, vol. 32, no. 2, pp. 159--194, 2006. Google ScholarDigital Library
- S. S. Im Walde, C. Hying, C. Scheible, and H. Schmid, "Combining EM Training and the MDL Principle for an Automatic Verb Classification Incorporating Selectional Preferences." in ACL. Columbus, 2008, pp. 496--504.Google Scholar
- J. Li and C. Brew, "Disambiguating Levin verbs using untagged data," Proceedings of RANLP 2007, 2007.Google Scholar
- V. Benko, "Aranea: Yet another family of (comparable) web corpora," in Text, speech, and dialogue: 17th international conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings, ser. LNCS 8655, P. Sojka, A. Horák, I. Kopeček, and K. Pala, Eds. Springer, pp. 257--264.Google Scholar
- A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, and Tools. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1986, ch. 4, p. 183.Google Scholar
- A. Dubrow. (2013) Linguists, computer scientists use supercomputers to improve natural language processing. {Online}. Available: http://nlp.hivefire.com/articles/share/40120/Google Scholar
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms (2nd ed.). MIT Press & McGraw-Hill, 2001, ch. 12: Binary search trees, 15.5: Optimal binary search trees, pp. 253--272, 356--363.Google Scholar
- D. E. Knuth, The Art of Computer Programming Volume 3: Sorting and Searching (2nd ed.). Addison-Wesley, 1997, ch. 6.3: Digital Searching, p. 492.Google Scholar
- J. Bentley and R. Sedgewick, "Fast algoprithms for sorting and searching string," in Proc. Annual ACM-SIAM Symp. on Discrete Algorithms. New Orleans, Luisiana: ACM/SIAM, 1997, pp. 360--369. Google ScholarDigital Library
- R. Lämmel, "Google's MapReduce programming model - revisited," Science of Computer Programming, vol. 70, no. 1, pp. 1--30, Jan 2008. Google ScholarDigital Library
- C. Lam, Hadoop in Action, 1st ed. Greenwich, CT, USA: Manning Publications Co., 2010. Google ScholarDigital Library
- O. Lyashevskaya and S. Sharov, Chastotnyj slovar' sovremennogo russkogo yazyka (na materialah Natsional'nogo korpusa russkogo yazyka). Moskva: Azbukovnik, 2009. {Online}. Available: http://dict.ruslang.ru/freq.phpGoogle Scholar
- M. Hagen, "Slovar' russkogo yazyka "Polnaya paradigma. Morfologiya", 2014. {Online}. Available: http://www.speakrus.ru/dict/Google Scholar
- A. Zonca, "Machine learning at scale with Python," San Diego Supercomputer Center, Tech. Rep., 2014.Google Scholar
- N. S. Altman, "An introduction to kernel and nearest-neighbor non-parametric regression," The American Statistician, vol. 46, no. 3, pp. 175--185, 1992.Google Scholar
Index Terms
- Python, performance, and natural language processing
Comments