skip to main content
research-article

Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

Authors Info & Claims
Published:04 June 2015Publication History
Skip Abstract Section

Abstract

Analysis of content streams gathered from social networking sites such as Twitter has several applications ranging from content search and recommendation, news detection to business analytics. However, processing large amounts of data generated on these sites in real-time poses a difficult challenge. To cope with the data deluge, analytics companies and researchers are increasingly resorting to sampling. In this article, we investigate the crucial question of how to sample content streams generated by users in online social networks. The traditional method is to randomly sample all the data. For example, most studies using Twitter data today rely on the 1% and 10% randomly sampled streams of tweets that are provided by Twitter. In this paper, we analyze a different sampling methodology, one where content is gathered only from a relatively small sample (<1%) of the user population, namely, the expert users. Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the popularity, topical diversity, trustworthiness, and timeliness of the information contained within them, and on the sentiment/opinion expressed on specific topics. Our analysis reveals several important differences in data obtained through the different sampling methodologies, which have serious implications for applications such as topical search, trustworthy content recommendations, breaking news detection, and opinion mining.

References

  1. Xavier Amatriain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, and Nuria Oliver. 2009. The wisdom of the few: A collaborative filtering approach based on expert opinions from the web. In Proceedings of ACM International SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 532--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Sebastien Ardon, Amitabha Bagchi, Anirban Mahanti, Amit Ruhela, Aaditeshwar Seth, Rudra Mohan Tripathy, and Sipat Triukose. 2013. Spatio-temporal and events based analysis of topic popularity in Twitter. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM’13). ACM, New York, NY, 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sitaram Asur and Bernardo A. Huberman. 2010. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, Washington, DC, 492--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Parantapa Bhattacharya, Saptarshi Ghosh, Juhi Kulshrestha, Mainack Mondal, Muhammad Bilal Zafar, Niloy Ganguly, and Krishna P. Gummadi. 2014. Deep Twitter diving: Exploring topical groups in microblogs at scale. In Proceedings of ACM Conference on Computer Supported Cooperative Work &amp; Social Computing (CSCW&rsquo;’14). ACM, New York, NY, 197--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (March 2003), 993--1022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. M. Bradley and P. J. Lang. 1999. Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology, University of Florida (1999).Google ScholarGoogle Scholar
  7. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. J. Candes and M. B. Wakin. 2008. An introduction to compressive sampling. IEEE Signal Processing Magazine 25, 2 (2008), 21--30.Google ScholarGoogle ScholarCross RefCross Ref
  9. Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google ScholarGoogle Scholar
  10. Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011a. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’11). AAAI Press.Google ScholarGoogle Scholar
  11. Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011b. Identifying relevant social media content: Leveraging information diversity and user cognition. In Proceedings of ACM Conference on Hypertext and Social Media. ACM, New York, NY, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Munmun De Choudhury, Yu-Ru Lin, Hari Sundaram, K. Selcuk Candan, Lexing Xie, and Aisling Kelliher. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media&quest; In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.Google ScholarGoogle Scholar
  13. Daantje Derks, Arjan E. R. Bos, and Jasper von Grumbkow. 2007. Emoticons and social interaction on the internet: The importance of social context. Computers in Human Behavior 23, 1 (2007), 842--849.Google ScholarGoogle ScholarCross RefCross Ref
  14. Eugene F. Fama. 1970. Efficient capital markets: A review of theory and empirical work. The Journal of Finance 25, 2 (1970), 383--417.Google ScholarGoogle ScholarCross RefCross Ref
  15. Ove Frank. 1978. Sampling and estimation in large social networks. Social Networks 1, 1 (1978), 91--101.Google ScholarGoogle ScholarCross RefCross Ref
  16. Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012a. Cognos: Crowdsourcing search for topic experts in microblogs. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 575--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Sharma, Gautam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012b. Understanding and combating link farming in the Twitter social network. In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 61--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of ACM International Conference on Conference on Information &amp; Knowledge Management (CIKM). ACM, New York, NY, USA, 1739--1744. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE Conference on Information Communications (INFOCOM’10). IEEE Press, Piscataway, NJ, 2498--2506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Sandra Gonzalez-Bailon, Ning Wang, Alejandro Rivero, Javier Borge-Holthoefer, and Yamir Moreno. 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014), 16--27.Google ScholarGoogle Scholar
  21. Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with Mechanical Turk. In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mark Granovetter. 1976. Network sampling: Some first steps. American Journal of Sociology 81, 6 (1976), 1287--1303.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. 2010. @spam: The underground on 140 characters or less. In Proceedings of ACM Conference on Computer and Communications Security (CCS’10). ACM, New York, NY, 27--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30. VLDB Endowment, 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin’ in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’12). AAAI Press, Dublin, Ireland.Google ScholarGoogle Scholar
  26. Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 597--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Kellogg. 2006. Information rates in sampling and quantization. IEEE Transactions on Information Theory 13, 3 (2006), 506--511. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of ACM Workshop on Online Social Networks (WOSN). ACM, New York, NY, USA, 19--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media&quest; In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 631--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named entity recognition in targeted Twitter stream. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 721--730. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 422--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. lists-howtouse. 2013. Twitter Help Center—Using Twitter Lists. Retrieved from https://support.twitter.com/articles/76460-using-twitter-lists.Google ScholarGoogle Scholar
  34. Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1155--1158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Fred Morstatter, Jürgen Pfeffer, and Huan Liu. 2014. When is it biased&quest;: Assessing the representativeness of Twitter’s streaming API. In Proceedings of International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 555--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough&quest; Comparing data from Twitter’s streaming API with Twitter’s firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’13). AAAI Press.Google ScholarGoogle Scholar
  38. Xuan-Hieu Phan and Cam-Tu Nguyen. 2007. GibbsLDA&plus;&plus;: A C/C&plus;&plus; Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http://gibbslda.sourceforge.net/.Google ScholarGoogle Scholar
  39. R. M. Poses, C. Bekes, R. L. Winkler, W. E. Scott, and F. J. Copare. 1990. Are two (inexperienced) heads better than one (experienced) head&quest; Averaging house officers’ prognostic judgments for critically ill patients. Archives of Internal Medicine 150, 9 (Sept. 1990), 1874--1878.Google ScholarGoogle ScholarCross RefCross Ref
  40. Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google ScholarGoogle Scholar
  41. J. Romberg. 2008. Imaging via compressive sampling. Signal Processing Magazine, IEEE 25, 2 (2008), 14--20.Google ScholarGoogle ScholarCross RefCross Ref
  42. Paat Rusmevichientong, David M. Pennock, Steve Lawrence, and C. Lee Giles. 2001. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128.Google ScholarGoogle Scholar
  43. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). ACM, New York, NY, 42--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Naveen Kumar Sharma, Saptarshi Ghosh, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012. Inferring who-is-who in the Twitter social network. ACM SIGCOMM Computer Communication Review 42, 4 (Sept. 2012), 533--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. spritzer-gnip-blog. 2011. Guide to the Twitter API—Part 3 of 3: An Overview of Twitter’s Streaming API. Retrieved from http://blog.gnip.com/tag/spritzer/.Google ScholarGoogle Scholar
  47. Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proceedings of International ACM Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 35--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kurt Thomas, Chris Grier, Vern Paxson, and Dawn Song. 2011. Suspended accounts in retrospect: An analysis of Twitter spam. In Proceedings of ACM Internet Measurement Conference (IMC’11). ACM, New York, NY, 243--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press, 178--185.Google ScholarGoogle Scholar
  50. twitter-rate-limit. 2013. Rate Limiting—Twitter Developers. Retrieved from https://dev.twitter.com/docs/rate-limiting.Google ScholarGoogle Scholar
  51. Twitter-stats. 2014. Twitter Statistics—Statistics Brain. Retrieved from http://www.statisticbrain.com/twitter-statistics/.Google ScholarGoogle Scholar
  52. Twitter-stream-api. 2012. GET Statuses/Sample—Twitter Developers. Retrieved from https://dev.twitter.com/docs/api/1/get/statuses/sample.Google ScholarGoogle Scholar
  53. Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier. 2012. It’s not in their tweets: Modeling Topical expertise of Twitter users. In Proceedings of AASE/IEEE International Conference on Social Computing (SocialCom’12). 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Who says what to whom on Twitter. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 705--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. 2012b. We know what @you #tag: does the dual role affect hashtag adoption&quest; In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 261--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Xintian Yang, Amol Ghoting, Yiye Ruan, and Srinivasan Parthasarathy. 2012a. A framework for summarizing and analyzing Twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 370--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang. 2011. Geographical topic discovery and comparison. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 247--256. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on the Web
            ACM Transactions on the Web  Volume 9, Issue 3
            June 2015
            187 pages
            ISSN:1559-1131
            EISSN:1559-114X
            DOI:10.1145/2788341
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 4 June 2015
            • Accepted: 1 March 2015
            • Revised: 1 October 2014
            • Received: 1 February 2014
            Published in tweb Volume 9, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader