research-article

Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

Authors:
Muhammad Bilal Zafar

Max Planck Institute for Software Systems, Germany

Max Planck Institute for Software Systems, Germany
View Profile

,
Parantapa Bhattacharya

Indian Institute of Technology Kharagpur, India; Max Planck Institute for Software Systems, Germany

Indian Institute of Technology Kharagpur, India; Max Planck Institute for Software Systems, Germany
View Profile

,
Niloy Ganguly

Indian Institute of Technology Kharagpur, India

Indian Institute of Technology Kharagpur, India
View Profile

,
Krishna P. Gummadi

Max Planck Institute for Software Systems, Germany

Max Planck Institute for Software Systems, Germany
View Profile

,
Saptarshi Ghosh

Max Planck Institute for Software Systems, Germany; Indian Institute of Engineering Science and Technology Shibpur, India

Max Planck Institute for Software Systems, Germany; Indian Institute of Engineering Science and Technology Shibpur, India
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 9 Issue 3Article No.: 12pp 1–33https://doi.org/10.1145/2743023

Published:04 June 2015Publication History

ACM Transactions on the Web

Abstract

Analysis of content streams gathered from social networking sites such as Twitter has several applications ranging from content search and recommendation, news detection to business analytics. However, processing large amounts of data generated on these sites in real-time poses a difficult challenge. To cope with the data deluge, analytics companies and researchers are increasingly resorting to sampling. In this article, we investigate the crucial question of how to sample content streams generated by users in online social networks. The traditional method is to randomly sample all the data. For example, most studies using Twitter data today rely on the 1% and 10% randomly sampled streams of tweets that are provided by Twitter. In this paper, we analyze a different sampling methodology, one where content is gathered only from a relatively small sample (<1%) of the user population, namely, the expert users. Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the popularity, topical diversity, trustworthiness, and timeliness of the information contained within them, and on the sentiment/opinion expressed on specific topics. Our analysis reveals several important differences in data obtained through the different sampling methodologies, which have serious implications for applications such as topical search, trustworthy content recommendations, breaking news detection, and opinion mining.

References

Xavier Amatriain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, and Nuria Oliver. 2009. The wisdom of the few: A collaborative filtering approach based on expert opinions from the web. In Proceedings of ACM International SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 532--539. Google ScholarDigital Library
Sebastien Ardon, Amitabha Bagchi, Anirban Mahanti, Amit Ruhela, Aaditeshwar Seth, Rudra Mohan Tripathy, and Sipat Triukose. 2013. Spatio-temporal and events based analysis of topic popularity in Twitter. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM’13). ACM, New York, NY, 219--228. Google ScholarDigital Library
Sitaram Asur and Bernardo A. Huberman. 2010. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, Washington, DC, 492--499. Google ScholarDigital Library
Parantapa Bhattacharya, Saptarshi Ghosh, Juhi Kulshrestha, Mainack Mondal, Muhammad Bilal Zafar, Niloy Ganguly, and Krishna P. Gummadi. 2014. Deep Twitter diving: Exploring topical groups in microblogs at scale. In Proceedings of ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW’’14). ACM, New York, NY, 197--210. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (March 2003), 993--1022. Google ScholarDigital Library
M. M. Bradley and P. J. Lang. 1999. Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology, University of Florida (1999).Google Scholar
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 107--117. Google ScholarDigital Library
E. J. Candes and M. B. Wakin. 2008. An introduction to compressive sampling. IEEE Signal Processing Magazine 25, 2 (2008), 21--30.Google ScholarCross Ref
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google Scholar
Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011a. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’11). AAAI Press.Google Scholar
Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011b. Identifying relevant social media content: Leveraging information diversity and user cognition. In Proceedings of ACM Conference on Hypertext and Social Media. ACM, New York, NY, 161--170. Google ScholarDigital Library
Munmun De Choudhury, Yu-Ru Lin, Hari Sundaram, K. Selcuk Candan, Lexing Xie, and Aisling Kelliher. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media&quest; In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.Google Scholar
Daantje Derks, Arjan E. R. Bos, and Jasper von Grumbkow. 2007. Emoticons and social interaction on the internet: The importance of social context. Computers in Human Behavior 23, 1 (2007), 842--849.Google ScholarCross Ref
Eugene F. Fama. 1970. Efficient capital markets: A review of theory and empirical work. The Journal of Finance 25, 2 (1970), 383--417.Google ScholarCross Ref
Ove Frank. 1978. Sampling and estimation in large social networks. Social Networks 1, 1 (1978), 91--101.Google ScholarCross Ref
Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012a. Cognos: Crowdsourcing search for topic experts in microblogs. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 575--584. Google ScholarDigital Library
Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Sharma, Gautam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012b. Understanding and combating link farming in the Twitter social network. In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 61--70. Google ScholarDigital Library
Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of ACM International Conference on Conference on Information & Knowledge Management (CIKM). ACM, New York, NY, USA, 1739--1744. Google ScholarDigital Library
Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE Conference on Information Communications (INFOCOM’10). IEEE Press, Piscataway, NJ, 2498--2506. Google ScholarDigital Library
Sandra Gonzalez-Bailon, Ning Wang, Alejandro Rivero, Javier Borge-Holthoefer, and Yamir Moreno. 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014), 16--27.Google Scholar
Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with Mechanical Turk. In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179. Google ScholarDigital Library
Mark Granovetter. 1976. Network sampling: Some first steps. American Journal of Sociology 81, 6 (1976), 1287--1303.Google ScholarCross Ref
Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. 2010. @spam: The underground on 140 characters or less. In Proceedings of ACM Conference on Computer and Communications Security (CCS’10). ACM, New York, NY, 27--37. Google ScholarDigital Library
Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30. VLDB Endowment, 576--587. Google ScholarDigital Library
Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin’ in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’12). AAAI Press, Dublin, Ireland.Google Scholar
Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 597--606. Google ScholarDigital Library
W. Kellogg. 2006. Information rates in sampling and quantization. IEEE Transactions on Information Theory 13, 3 (2006), 506--511. Google ScholarDigital Library
Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of ACM Workshop on Online Social Networks (WOSN). ACM, New York, NY, USA, 19--24. Google ScholarDigital Library
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media&quest; In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 591--600. Google ScholarDigital Library
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 631--636. Google ScholarDigital Library
Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named entity recognition in targeted Twitter stream. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 721--730. Google ScholarDigital Library
Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 422--429. Google ScholarDigital Library
lists-howtouse. 2013. Twitter Help Center—Using Twitter Lists. Retrieved from https://support.twitter.com/articles/76460-using-twitter-lists.Google Scholar
Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer-Verlag. Google ScholarDigital Library
Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1155--1158. Google ScholarDigital Library
Fred Morstatter, Jürgen Pfeffer, and Huan Liu. 2014. When is it biased&quest;: Assessing the representativeness of Twitter’s streaming API. In Proceedings of International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 555--556. Google ScholarDigital Library
Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough&quest; Comparing data from Twitter’s streaming API with Twitter’s firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’13). AAAI Press.Google Scholar
Xuan-Hieu Phan and Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http://gibbslda.sourceforge.net/.Google Scholar
R. M. Poses, C. Bekes, R. L. Winkler, W. E. Scott, and F. J. Copare. 1990. Are two (inexperienced) heads better than one (experienced) head&quest; Averaging house officers’ prognostic judgments for critically ill patients. Archives of Internal Medicine 150, 9 (Sept. 1990), 1874--1878.Google ScholarCross Ref
Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google Scholar
J. Romberg. 2008. Imaging via compressive sampling. Signal Processing Magazine, IEEE 25, 2 (2008), 14--20.Google ScholarCross Ref
Paat Rusmevichientong, David M. Pennock, Steve Lawrence, and C. Lee Giles. 2001. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128.Google Scholar
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860. Google ScholarDigital Library
Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). ACM, New York, NY, 42--51. Google ScholarDigital Library
Naveen Kumar Sharma, Saptarshi Ghosh, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012. Inferring who-is-who in the Twitter social network. ACM SIGCOMM Computer Communication Review 42, 4 (Sept. 2012), 533--538. Google ScholarDigital Library
spritzer-gnip-blog. 2011. Guide to the Twitter API—Part 3 of 3: An Overview of Twitter’s Streaming API. Retrieved from http://blog.gnip.com/tag/spritzer/.Google Scholar
Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proceedings of International ACM Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 35--44. Google ScholarDigital Library
Kurt Thomas, Chris Grier, Vern Paxson, and Dawn Song. 2011. Suspended accounts in retrospect: An analysis of Twitter spam. In Proceedings of ACM Internet Measurement Conference (IMC’11). ACM, New York, NY, 243--258. Google ScholarDigital Library
A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press, 178--185.Google Scholar
twitter-rate-limit. 2013. Rate Limiting—Twitter Developers. Retrieved from https://dev.twitter.com/docs/rate-limiting.Google Scholar
Twitter-stats. 2014. Twitter Statistics—Statistics Brain. Retrieved from http://www.statisticbrain.com/twitter-statistics/.Google Scholar
Twitter-stream-api. 2012. GET Statuses/Sample—Twitter Developers. Retrieved from https://dev.twitter.com/docs/api/1/get/statuses/sample.Google Scholar
Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier. 2012. It’s not in their tweets: Modeling Topical expertise of Twitter users. In Proceedings of AASE/IEEE International Conference on Social Computing (SocialCom’12). 91--100. Google ScholarDigital Library
Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Who says what to whom on Twitter. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 705--714. Google ScholarDigital Library
Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. 2012b. We know what @you #tag: does the dual role affect hashtag adoption&quest; In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 261--270. Google ScholarDigital Library
Xintian Yang, Amol Ghoting, Yiye Ruan, and Srinivasan Parthasarathy. 2012a. A framework for summarizing and analyzing Twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 370--378. Google ScholarDigital Library
Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang. 2011. Geographical topic discovery and comparison. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 247--256. Google ScholarDigital Library

Index Terms

Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

Recommendations

Sampling online social networks: an experimental study of twitter
SIGCOMM'14

Online social networks (OSNs) are an important source of information for scientists in different fields such as computer science, sociology, economics, etc. However, it is hard to study OSNs as they are very large. For instance, Facebook has 1.28 ...
Read More
On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Several applications today rely upon content streams crowd-sourced from online social networks. Since real-time processing of large amounts of data generated on these sites is difficult, analytics companies and researchers are increasingly resorting to ...
Read More
What is Twitter, a social network or a news media?
WWW '10: Proceedings of the 19th international conference on World wide web

Twitter, a microblogging service less than three years old, commands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 9, Issue 3
June 2015
187 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2788341
Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2015
- Accepted: 1 March 2015
- Revised: 1 October 2014
- Received: 1 February 2014
Published in tweb Volume 9, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Sampling content streams
Twitter
Twitter Lists
random sampling
sampling from experts
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 1,021
  Total Downloads
- Downloads (Last 12 months)45
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Sampling online social networks: an experimental study of twitter

On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream

What is Twitter, a social network or a news media?