Abstract
Location extraction, also called “toponym extraction,” is a field covering geoparsing, extracting spatial representations from location mentions in text, and geotagging, assigning spatial coordinates to content items. This article evaluates five “best-of-class” location extraction algorithms. We develop a geoparsing algorithm using an OpenStreetMap database, and a geotagging algorithm using a language model constructed from social media tags and multiple gazetteers. Third-party work evaluated includes a DBpedia-based entity recognition and disambiguation approach, a named entity recognition and Geonames gazetteer approach, and a Google Geocoder API approach. We perform two quantitative benchmark evaluations, one geoparsing tweets and one geotagging Flickr posts, to compare all approaches. We also perform a qualitative evaluation recalling top N location mentions from tweets during major news events. The OpenStreetMap approach was best (F1 0.90+) for geoparsing English, and the language model approach was best (F1 0.66) for Turkish. The language model was best (F1@1km 0.49) for the geotagging evaluation. The map database was best (R@20 0.60+) in the qualitative evaluation. We report on strengths, weaknesses, and a detailed failure analysis for the approaches and suggest concrete areas for further research.
- Ritesh J. Agrawal and James G. Shanahan. 2010. Location disambiguation in local searches using gradient boosted decision trees. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS'10). ACM, New York, NY, 129--136. http://doi.acm.org/10.1145/1869790.1869811 Google ScholarDigital Library
- Einat Amitay, Nadav Har'El, Ron Sivan, and Aya Soffer. 2004. Web-a-where: Geotagging web content. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 273--280. Google ScholarDigital Library
- Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of Recent Advances in Natural Language Processing. 83--90.Google Scholar
- Davide Buscaldi and Paolo Rosso. 2008. A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Systems 22, 3 (2008), 301--313. Google ScholarDigital Library
- Guoray Cai and Ye Tian. 2016. Towards geo-referencing infrastructure for local news. In Proceedings of the 10th Workshop on Geographic Information Retrieval (GIR’16). ACM, New York, Article 9, 10 pages. Google ScholarDigital Library
- Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where you tweet: A content-based approach to geo-locating Twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 759--768. Google ScholarDigital Library
- Jaeyoung Choi, Bart Thomee, Gerald Friedland, Liangliang Cao, Karl Ni, Damian Borth, Benjamin Elizalde, Luke Gottlieb, Carmen Carrano, Roger Pearce, and Doug Poland. 2014. The placing task: A large-scale geo-estimation challenge for social-media videos and images. In Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia (GeoMM’14). ACM, New York, NY, 27--31. Google ScholarDigital Library
- Wingyan Chung. 2014. BizPro: Extracting and categorizing business intelligence factors from textual news articles. International Journal of Information Management 34, 2 (2014), 272--284. Google ScholarDigital Library
- Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics’13). Google ScholarDigital Library
- Paul Earle, Daniel C. Bowden, and Michelle Guy. 2011. Twitter earthquake detection: Earthquake monitoring in a social world. Annals of Geophysics 54, 6 (2011), 708--715.Google Scholar
- David Flatow, Mor Naaman, Ke Eddie Xie, Yana Volkovich, and Yaron Kanza. 2015. On the accuracy of hyper-local geotagging of social media content. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15). ACM, New York, NY, 127--136. Google ScholarDigital Library
- Judith Gelernter and Shilpa Balaji. 2013. An algorithm for local geoparsing of microtext. Geoinformatica 17, 4 (2013), 635--667. Google ScholarDigital Library
- Judith Gelernter and Wei Zhang. 2013. Cross-lingual geo-parsing for non-structured data. In Proceedings of the 7th Workshop on Geographic Information Retrieval. ACM, New York, NY, 64--71. Google ScholarDigital Library
- Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. CERTH/CEA LIST at MediaEval Placing Task 2015. MediaEval 2015.Google Scholar
- Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. Geotagging social media content with a refined language modelling approach. In Proceedings of the Pacific-Asia Workshop on Intelligence and Security Informatics.Google ScholarCross Ref
- Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, and Yannis Kompatsiaris. 2016. Placing images with refined language models and similarity search with PCA-reduced VGG features. In Proceedings of MediaEval Workshop 2016.Google Scholar
- Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2017. Geotagging text content with language models and feature mining. Proceedings of the IEEE 105, 10 (2017), 1971--1986.Google ScholarCross Ref
- Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. 2014. Spatially aware term selection for geotagging. IEEE Transactions on Knowledge and Data Engineering 26, 1 (2014), 221--234. Google ScholarDigital Library
- Niels Buus Lassen, Rene Madsen, and Ravi Vatrapu. 2014. Predicting iPhone sales from iPhone tweets. In Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC’14). IEEE, Los Alamitos, CA, 81--90. Google ScholarDigital Library
- Sunshin Lee, Mohamed Farag, Tarek Kanan, and Edward A. Fox. 2015. Read between the lines: A machine learning approach for disambiguating the geo-location of tweets. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, New York, NY, 273--274. Google ScholarDigital Library
- Jochen L. Leidner. 2008, Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names. Ph.D. Dissertation. School of Informatics, University of Edinburgh.Google Scholar
- Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. 2014. Home location identification of Twitter users. ACM Transactions on Intelligent Systems and Technologies 5, 3, Article 47, 21 pages. Google ScholarDigital Library
- Stuart E. Middleton, Lee Middleton, and Stefano Modafferi. 2014. Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems 29, 2 (2104), 9--17.Google ScholarCross Ref
- Stuart E. Middleton and Vadims Krivcovs. 2016. Geoparsing and geosemantics for social media: Spatio-temporal grounding of content propagating rumours to support trust and veracity analysis during breaking news. ACM Transactions on Information Systems 34, 3, Article 16, 26 pages. Google ScholarDigital Library
- Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, and Mauro Gaio. 2014. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’14). ACM, New York, NY, 183--192. Google ScholarDigital Library
- Pavlos Paraskevopoulos and Themis Palpanas. 2016. Where has this tweet come from? Fast and fine-grained geolocalization of non-geotagged tweets. Social Network Analysis and Mining 6 (2016), 89.Google ScholarCross Ref
- Robert C. Pasley, Paul D. Clough, and Mark Sanderson. 2007. Geo-tagging for imprecise regions of different sizes. In Proceedings of the 4th ACM Workshop on Geographical Information Retrieval (GIR’07). ACM, New York, NY, 77--82. Google ScholarDigital Library
- Adrian Popescu and Nicolas Ballas. 2013. CEA LIST's participation at MediaEval 2013 placing task. In Proceedings of MediaEval Workshop 2013.Google Scholar
- Ross S. Purves, Paul Clough, Christopher B. Jones, Avi Arampatzis, Benedicte Bucher, David Finch, Gaihua Fu, Hideo Joho, Awase Khirni Syed, Subodh Vaid, and Bisheng Yang. 2007. The design and implementation of SPIRIT: A spatially aware search engine for information retrieval on the Internet. International Journal of Geographical Information Science 21, 7 (2007), 717--745. Google ScholarDigital Library
- Compton Ryan, David Jurgens, and David Allen. 2014. Geotagging one hundred million Twitter accounts with total variation minimization. In Proceedings of the IEEE International Conference on Big Data (Big Data’14).Google Scholar
- Pavel Serdyukov, Vanessa Murdock, and Roelof van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 484--491. Google ScholarDigital Library
- Craig Silverman. 2014. Verification Handbook: A Definitive Guide to Verifying Digital Content for Emergency Coverage. European Journalism Centre.Google Scholar
- David A. Smith and Gregory Crane. 2001. Disambiguating geographic names in a historical digital library. In Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’01), Panos Constantopoulos and Ingeborg Sølvberg (Eds.). Springer-Verlag, London, 127--136. Google ScholarDigital Library
- Russell Swan and James Allan. 1999. Extracting significant time varying features from text. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM’99). ACM, New York, NY, 38--45. Google ScholarDigital Library
- Tasnia Tahsin, Davy Weissenbacher, Robert Rivera, Rachel Beard, Mari Firago, Garrick Wallstrom, Matthew Scotch, and Graciela Gonzalez. 2016. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. Journal of the American Medical Information Association 23, 5 (2016), 934--941.Google ScholarCross Ref
- Dan Tasse, Zichen Liu, Alex Sciuto, and Jason I. Hong. 2017. State of the geotags: Motivations and recent changes. In Proceedings of the 11th International Conference on Weblogs and Social Media (ICWSM’17). 250--259.Google Scholar
- Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Communications of the ACM 59, 2 (2016), 64--73. Google ScholarDigital Library
- Marc Verhagen, Roser Saur, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10), 57--62. Google ScholarDigital Library
- Xingguang Wang, Yi Zhang, Min Chen, Xing Lin, Hao Yu, and Yu Liu. 2010. An evidence-based approach for toponym disambiguation. In 18th International Conference on Geoinformatics (Geoinformatics’10). Article 5567805.Google ScholarCross Ref
- Stefanie Wiegand and Stuart E. Middleton. 2016. Veracity and velocity of social media content during breaking news: Analysis of November 2015 Paris shootings. In Proceedings of the 3rd Workshop on Social News on the Web (SNOW’16), Companion of the 25th International World Wide Web Conference WWW’16). Google ScholarDigital Library
- Benjamin P. Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), Vol. 1. 955--964. Google ScholarDigital Library
- Jiangwei Yu Rafiei and Davood Rafiei. 2016. Geotagging named entities in news and online documents. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). ACM, New York, NY, 1321--1330. Google ScholarDigital Library
- Wei Zhang and Judith Gelernter. 2014. Geocoding location expressions in Twitter messages: A preference learning method. Journal of Spatial Information Science 9 (2014), 37--70.Google Scholar
Index Terms
- Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging
Recommendations
What's missing in geographical parsing?
Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as ...
A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics
AbstractEmpirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage ...
A gold-standard social media corpus for urban issues
SAC '17: Proceedings of the Symposium on Applied ComputingThis paper introduces a gold-standard corpus extracted from manually labeled tweets concerning urban issues. The main contribution is to provide a labeled tweet dataset which can be useful for building machine-learning classifiers in the urban issues ...
Comments