research-article

Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging

Authors:
Stuart E. Middleton

University of Southampton IT Innovation Centre, Southampton, UK

University of Southampton IT Innovation Centre, Southampton, UK

0000-0001-8305-8176
View Profile

,
Giorgos Kordopatis-Zilos

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece
View Profile

,
Symeon Papadopoulos

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece
View Profile

,
Yiannis Kompatsiaris

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece

Information Technologies Institute, CERTH, Thermi-Thessaloniki, Greece
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 36 Issue 4Article No.: 40pp 1–27https://doi.org/10.1145/3202662

Published:13 June 2018Publication History

ACM Transactions on Information Systems

Abstract

Location extraction, also called “toponym extraction,” is a field covering geoparsing, extracting spatial representations from location mentions in text, and geotagging, assigning spatial coordinates to content items. This article evaluates five “best-of-class” location extraction algorithms. We develop a geoparsing algorithm using an OpenStreetMap database, and a geotagging algorithm using a language model constructed from social media tags and multiple gazetteers. Third-party work evaluated includes a DBpedia-based entity recognition and disambiguation approach, a named entity recognition and Geonames gazetteer approach, and a Google Geocoder API approach. We perform two quantitative benchmark evaluations, one geoparsing tweets and one geotagging Flickr posts, to compare all approaches. We also perform a qualitative evaluation recalling top N location mentions from tweets during major news events. The OpenStreetMap approach was best (F1 0.90+) for geoparsing English, and the language model approach was best (F1 0.66) for Turkish. The language model was best (F1@1km 0.49) for the geotagging evaluation. The map database was best (R@20 0.60+) in the qualitative evaluation. We report on strengths, weaknesses, and a detailed failure analysis for the approaches and suggest concrete areas for further research.

References

Ritesh J. Agrawal and James G. Shanahan. 2010. Location disambiguation in local searches using gradient boosted decision trees. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS'10). ACM, New York, NY, 129--136. http://doi.acm.org/10.1145/1869790.1869811 Google ScholarDigital Library
Einat Amitay, Nadav Har'El, Ron Sivan, and Aya Soffer. 2004. Web-a-where: Geotagging web content. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 273--280. Google ScholarDigital Library
Kalina Bontcheva, Leon Derczynski, Adam Funk, Mark A. Greenwood, Diana Maynard, and Niraj Aswani. 2013. TwitIE: An open-source information extraction pipeline for microblog text. In Proceedings of Recent Advances in Natural Language Processing. 83--90.Google Scholar
Davide Buscaldi and Paolo Rosso. 2008. A conceptual density-based approach for the disambiguation of toponyms. International Journal of Geographical Information Systems 22, 3 (2008), 301--313. Google ScholarDigital Library
Guoray Cai and Ye Tian. 2016. Towards geo-referencing infrastructure for local news. In Proceedings of the 10th Workshop on Geographic Information Retrieval (GIR’16). ACM, New York, Article 9, 10 pages. Google ScholarDigital Library
Zhiyuan Cheng, James Caverlee, and Kyumin Lee. 2010. You are where you tweet: A content-based approach to geo-locating Twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 759--768. Google ScholarDigital Library
Jaeyoung Choi, Bart Thomee, Gerald Friedland, Liangliang Cao, Karl Ni, Damian Borth, Benjamin Elizalde, Luke Gottlieb, Carmen Carrano, Roger Pearce, and Doug Poland. 2014. The placing task: A large-scale geo-estimation challenge for social-media videos and images. In Proceedings of the 3rd ACM Multimedia Workshop on Geotagging and Its Applications in Multimedia (GeoMM’14). ACM, New York, NY, 27--31. Google ScholarDigital Library
Wingyan Chung. 2014. BizPro: Extracting and categorizing business intelligence factors from textual news articles. International Journal of Information Management 34, 2 (2014), 272--284. Google ScholarDigital Library
Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. 2013. Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics’13). Google ScholarDigital Library
Paul Earle, Daniel C. Bowden, and Michelle Guy. 2011. Twitter earthquake detection: Earthquake monitoring in a social world. Annals of Geophysics 54, 6 (2011), 708--715.Google Scholar
David Flatow, Mor Naaman, Ke Eddie Xie, Yana Volkovich, and Yaron Kanza. 2015. On the accuracy of hyper-local geotagging of social media content. In Proceedings of the 8th ACM International Conference on Web Search and Data Mining (WSDM’15). ACM, New York, NY, 127--136. Google ScholarDigital Library
Judith Gelernter and Shilpa Balaji. 2013. An algorithm for local geoparsing of microtext. Geoinformatica 17, 4 (2013), 635--667. Google ScholarDigital Library
Judith Gelernter and Wei Zhang. 2013. Cross-lingual geo-parsing for non-structured data. In Proceedings of the 7th Workshop on Geographic Information Retrieval. ACM, New York, NY, 64--71. Google ScholarDigital Library
Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. CERTH/CEA LIST at MediaEval Placing Task 2015. MediaEval 2015.Google Scholar
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2015. Geotagging social media content with a refined language modelling approach. In Proceedings of the Pacific-Asia Workshop on Intelligence and Security Informatics.Google ScholarCross Ref
Giorgos Kordopatis-Zilos, Adrian Popescu, Symeon Papadopoulos, and Yannis Kompatsiaris. 2016. Placing images with refined language models and similarity search with PCA-reduced VGG features. In Proceedings of MediaEval Workshop 2016.Google Scholar
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, and Yiannis Kompatsiaris. 2017. Geotagging text content with language models and feature mining. Proceedings of the IEEE 105, 10 (2017), 1971--1986.Google ScholarCross Ref
Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. 2014. Spatially aware term selection for geotagging. IEEE Transactions on Knowledge and Data Engineering 26, 1 (2014), 221--234. Google ScholarDigital Library
Niels Buus Lassen, Rene Madsen, and Ravi Vatrapu. 2014. Predicting iPhone sales from iPhone tweets. In Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC’14). IEEE, Los Alamitos, CA, 81--90. Google ScholarDigital Library
Sunshin Lee, Mohamed Farag, Tarek Kanan, and Edward A. Fox. 2015. Read between the lines: A machine learning approach for disambiguating the geo-location of tweets. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’15). ACM, New York, NY, 273--274. Google ScholarDigital Library
Jochen L. Leidner. 2008, Toponym Resolution in Text: Annotation, Evaluation and Applications of Spatial Grounding of Place Names. Ph.D. Dissertation. School of Informatics, University of Edinburgh.Google Scholar
Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. 2014. Home location identification of Twitter users. ACM Transactions on Intelligent Systems and Technologies 5, 3, Article 47, 21 pages. Google ScholarDigital Library
Stuart E. Middleton, Lee Middleton, and Stefano Modafferi. 2014. Real-time crisis mapping of natural disasters using social media. IEEE Intelligent Systems 29, 2 (2104), 9--17.Google ScholarCross Ref
Stuart E. Middleton and Vadims Krivcovs. 2016. Geoparsing and geosemantics for social media: Spatio-temporal grounding of content propagating rumours to support trust and veracity analysis during breaking news. ACM Transactions on Information Systems 34, 3, Article 16, 26 pages. Google ScholarDigital Library
Ludovic Moncla, Walter Renteria-Agualimpia, Javier Nogueras-Iso, and Mauro Gaio. 2014. Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’14). ACM, New York, NY, 183--192. Google ScholarDigital Library
Pavlos Paraskevopoulos and Themis Palpanas. 2016. Where has this tweet come from? Fast and fine-grained geolocalization of non-geotagged tweets. Social Network Analysis and Mining 6 (2016), 89.Google ScholarCross Ref
Robert C. Pasley, Paul D. Clough, and Mark Sanderson. 2007. Geo-tagging for imprecise regions of different sizes. In Proceedings of the 4th ACM Workshop on Geographical Information Retrieval (GIR’07). ACM, New York, NY, 77--82. Google ScholarDigital Library
Adrian Popescu and Nicolas Ballas. 2013. CEA LIST's participation at MediaEval 2013 placing task. In Proceedings of MediaEval Workshop 2013.Google Scholar
Ross S. Purves, Paul Clough, Christopher B. Jones, Avi Arampatzis, Benedicte Bucher, David Finch, Gaihua Fu, Hideo Joho, Awase Khirni Syed, Subodh Vaid, and Bisheng Yang. 2007. The design and implementation of SPIRIT: A spatially aware search engine for information retrieval on the Internet. International Journal of Geographical Information Science 21, 7 (2007), 717--745. Google ScholarDigital Library
Compton Ryan, David Jurgens, and David Allen. 2014. Geotagging one hundred million Twitter accounts with total variation minimization. In Proceedings of the IEEE International Conference on Big Data (Big Data’14).Google Scholar
Pavel Serdyukov, Vanessa Murdock, and Roelof van Zwol. 2009. Placing Flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 484--491. Google ScholarDigital Library
Craig Silverman. 2014. Verification Handbook: A Definitive Guide to Verifying Digital Content for Emergency Coverage. European Journalism Centre.Google Scholar
David A. Smith and Gregory Crane. 2001. Disambiguating geographic names in a historical digital library. In Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’01), Panos Constantopoulos and Ingeborg Sølvberg (Eds.). Springer-Verlag, London, 127--136. Google ScholarDigital Library
Russell Swan and James Allan. 1999. Extracting significant time varying features from text. In Proceedings of the 8th International Conference on Information and Knowledge Management (CIKM’99). ACM, New York, NY, 38--45. Google ScholarDigital Library
Tasnia Tahsin, Davy Weissenbacher, Robert Rivera, Rachel Beard, Mari Firago, Garrick Wallstrom, Matthew Scotch, and Graciela Gonzalez. 2016. A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records. Journal of the American Medical Information Association 23, 5 (2016), 934--941.Google ScholarCross Ref
Dan Tasse, Zichen Liu, Alex Sciuto, and Jason I. Hong. 2017. State of the geotags: Motivations and recent changes. In Proceedings of the 11th International Conference on Weblogs and Social Media (ICWSM’17). 250--259.Google Scholar
Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. 2016. YFCC100M: The new data in multimedia research. Communications of the ACM 59, 2 (2016), 64--73. Google ScholarDigital Library
Marc Verhagen, Roser Saur, Tommaso Caselli, and James Pustejovsky. 2010. SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation (SemEval’10), 57--62. Google ScholarDigital Library
Xingguang Wang, Yi Zhang, Min Chen, Xing Lin, Hao Yu, and Yu Liu. 2010. An evidence-based approach for toponym disambiguation. In 18th International Conference on Geoinformatics (Geoinformatics’10). Article 5567805.Google ScholarCross Ref
Stefanie Wiegand and Stuart E. Middleton. 2016. Veracity and velocity of social media content during breaking news: Analysis of November 2015 Paris shootings. In Proceedings of the 3rd Workshop on Social News on the Web (SNOW’16), Companion of the 25th International World Wide Web Conference WWW’16). Google ScholarDigital Library
Benjamin P. Wing and Jason Baldridge. 2011. Simple supervised document geolocation with geodesic grids. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), Vol. 1. 955--964. Google ScholarDigital Library
Jiangwei Yu Rafiei and Davood Rafiei. 2016. Geotagging named entities in news and online documents. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM’16). ACM, New York, NY, 1321--1330. Google ScholarDigital Library
Wei Zhang and Judith Gelernter. 2014. Geocoding location expressions in Twitter messages: A preference learning method. Journal of Spatial Information Science 9 (2014), 37--70.Google Scholar

Index Terms

Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

What's missing in geographical parsing?

Geographical data can be obtained by converting place names from free-format text into geographical coordinates. The ability to geo-locate events in textual reports represents a valuable source of information in many real-world applications such as ...
Read More
A gold-standard social media corpus for urban issues
SAC '17: Proceedings of the Symposium on Applied Computing

This paper introduces a gold-standard corpus extracted from manually labeled tweets concerning urban issues. The main contribution is to provide a labeled tweet dataset which can be useful for building machine-learning classifiers in the urban issues ...
Read More
A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics
Abstract
Empirical methods in geoparsing have thus far lacked a standard evaluation framework describing the task, metrics and data used to compare state-of-the-art systems. Evaluation is further made inconsistent, even unrepresentative of real world usage ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 36, Issue 4
October 2018
365 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3211967
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2018
- Accepted: 1 March 2018
- Revised: 1 February 2018
- Received: 1 April 2017
Published in tois Volume 36, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Location extraction
benchmark
disambiguation
geocoding
geoparsing
geotagging
information extraction
location
social media
toponym
toponym extraction
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 74
  Total Citations
  View Citations
- 1,441
  Total Downloads
- Downloads (Last 12 months)236
- Downloads (Last 6 weeks)37
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Location Extraction from Social Media: Geoparsing, Location Disambiguation, and Geotagging

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

What's missing in geographical parsing?

A gold-standard social media corpus for urban issues

A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics