Abstract
Analysis of content streams gathered from social networking sites such as Twitter has several applications ranging from content search and recommendation, news detection to business analytics. However, processing large amounts of data generated on these sites in real-time poses a difficult challenge. To cope with the data deluge, analytics companies and researchers are increasingly resorting to sampling. In this article, we investigate the crucial question of how to sample content streams generated by users in online social networks. The traditional method is to randomly sample all the data. For example, most studies using Twitter data today rely on the 1% and 10% randomly sampled streams of tweets that are provided by Twitter. In this paper, we analyze a different sampling methodology, one where content is gathered only from a relatively small sample (<1%) of the user population, namely, the expert users. Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the popularity, topical diversity, trustworthiness, and timeliness of the information contained within them, and on the sentiment/opinion expressed on specific topics. Our analysis reveals several important differences in data obtained through the different sampling methodologies, which have serious implications for applications such as topical search, trustworthy content recommendations, breaking news detection, and opinion mining.
- Xavier Amatriain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, and Nuria Oliver. 2009. The wisdom of the few: A collaborative filtering approach based on expert opinions from the web. In Proceedings of ACM International SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 532--539. Google ScholarDigital Library
- Sebastien Ardon, Amitabha Bagchi, Anirban Mahanti, Amit Ruhela, Aaditeshwar Seth, Rudra Mohan Tripathy, and Sipat Triukose. 2013. Spatio-temporal and events based analysis of topic popularity in Twitter. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM’13). ACM, New York, NY, 219--228. Google ScholarDigital Library
- Sitaram Asur and Bernardo A. Huberman. 2010. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, Washington, DC, 492--499. Google ScholarDigital Library
- Parantapa Bhattacharya, Saptarshi Ghosh, Juhi Kulshrestha, Mainack Mondal, Muhammad Bilal Zafar, Niloy Ganguly, and Krishna P. Gummadi. 2014. Deep Twitter diving: Exploring topical groups in microblogs at scale. In Proceedings of ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW’’14). ACM, New York, NY, 197--210. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (March 2003), 993--1022. Google ScholarDigital Library
- M. M. Bradley and P. J. Lang. 1999. Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology, University of Florida (1999).Google Scholar
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 107--117. Google ScholarDigital Library
- E. J. Candes and M. B. Wakin. 2008. An introduction to compressive sampling. IEEE Signal Processing Magazine 25, 2 (2008), 21--30.Google ScholarCross Ref
- Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google Scholar
- Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011a. Find me the right content! Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’11). AAAI Press.Google Scholar
- Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011b. Identifying relevant social media content: Leveraging information diversity and user cognition. In Proceedings of ACM Conference on Hypertext and Social Media. ACM, New York, NY, 161--170. Google ScholarDigital Library
- Munmun De Choudhury, Yu-Ru Lin, Hari Sundaram, K. Selcuk Candan, Lexing Xie, and Aisling Kelliher. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media? In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.Google Scholar
- Daantje Derks, Arjan E. R. Bos, and Jasper von Grumbkow. 2007. Emoticons and social interaction on the internet: The importance of social context. Computers in Human Behavior 23, 1 (2007), 842--849.Google ScholarCross Ref
- Eugene F. Fama. 1970. Efficient capital markets: A review of theory and empirical work. The Journal of Finance 25, 2 (1970), 383--417.Google ScholarCross Ref
- Ove Frank. 1978. Sampling and estimation in large social networks. Social Networks 1, 1 (1978), 91--101.Google ScholarCross Ref
- Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012a. Cognos: Crowdsourcing search for topic experts in microblogs. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 575--584. Google ScholarDigital Library
- Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Sharma, Gautam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012b. Understanding and combating link farming in the Twitter social network. In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 61--70. Google ScholarDigital Library
- Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of ACM International Conference on Conference on Information & Knowledge Management (CIKM). ACM, New York, NY, USA, 1739--1744. Google ScholarDigital Library
- Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE Conference on Information Communications (INFOCOM’10). IEEE Press, Piscataway, NJ, 2498--2506. Google ScholarDigital Library
- Sandra Gonzalez-Bailon, Ning Wang, Alejandro Rivero, Javier Borge-Holthoefer, and Yamir Moreno. 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014), 16--27.Google Scholar
- Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with Mechanical Turk. In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179. Google ScholarDigital Library
- Mark Granovetter. 1976. Network sampling: Some first steps. American Journal of Sociology 81, 6 (1976), 1287--1303.Google ScholarCross Ref
- Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. 2010. @spam: The underground on 140 characters or less. In Proceedings of ACM Conference on Computer and Communications Security (CCS’10). ACM, New York, NY, 27--37. Google ScholarDigital Library
- Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30. VLDB Endowment, 576--587. Google ScholarDigital Library
- Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin’ in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’12). AAAI Press, Dublin, Ireland.Google Scholar
- Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 597--606. Google ScholarDigital Library
- W. Kellogg. 2006. Information rates in sampling and quantization. IEEE Transactions on Information Theory 13, 3 (2006), 506--511. Google ScholarDigital Library
- Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of ACM Workshop on Online Social Networks (WOSN). ACM, New York, NY, USA, 19--24. Google ScholarDigital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 591--600. Google ScholarDigital Library
- Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 631--636. Google ScholarDigital Library
- Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named entity recognition in targeted Twitter stream. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 721--730. Google ScholarDigital Library
- Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 422--429. Google ScholarDigital Library
- lists-howtouse. 2013. Twitter Help Center—Using Twitter Lists. Retrieved from https://support.twitter.com/articles/76460-using-twitter-lists.Google Scholar
- Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer-Verlag. Google ScholarDigital Library
- Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1155--1158. Google ScholarDigital Library
- Fred Morstatter, Jürgen Pfeffer, and Huan Liu. 2014. When is it biased?: Assessing the representativeness of Twitter’s streaming API. In Proceedings of International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 555--556. Google ScholarDigital Library
- Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough? Comparing data from Twitter’s streaming API with Twitter’s firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’13). AAAI Press.Google Scholar
- Xuan-Hieu Phan and Cam-Tu Nguyen. 2007. GibbsLDA++: A C/C++ Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http://gibbslda.sourceforge.net/.Google Scholar
- R. M. Poses, C. Bekes, R. L. Winkler, W. E. Scott, and F. J. Copare. 1990. Are two (inexperienced) heads better than one (experienced) head? Averaging house officers’ prognostic judgments for critically ill patients. Archives of Internal Medicine 150, 9 (Sept. 1990), 1874--1878.Google ScholarCross Ref
- Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.Google Scholar
- J. Romberg. 2008. Imaging via compressive sampling. Signal Processing Magazine, IEEE 25, 2 (2008), 14--20.Google ScholarCross Ref
- Paat Rusmevichientong, David M. Pennock, Steve Lawrence, and C. Lee Giles. 2001. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128.Google Scholar
- Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860. Google ScholarDigital Library
- Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). ACM, New York, NY, 42--51. Google ScholarDigital Library
- Naveen Kumar Sharma, Saptarshi Ghosh, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012. Inferring who-is-who in the Twitter social network. ACM SIGCOMM Computer Communication Review 42, 4 (Sept. 2012), 533--538. Google ScholarDigital Library
- spritzer-gnip-blog. 2011. Guide to the Twitter API—Part 3 of 3: An Overview of Twitter’s Streaming API. Retrieved from http://blog.gnip.com/tag/spritzer/.Google Scholar
- Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proceedings of International ACM Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 35--44. Google ScholarDigital Library
- Kurt Thomas, Chris Grier, Vern Paxson, and Dawn Song. 2011. Suspended accounts in retrospect: An analysis of Twitter spam. In Proceedings of ACM Internet Measurement Conference (IMC’11). ACM, New York, NY, 243--258. Google ScholarDigital Library
- A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press, 178--185.Google Scholar
- twitter-rate-limit. 2013. Rate Limiting—Twitter Developers. Retrieved from https://dev.twitter.com/docs/rate-limiting.Google Scholar
- Twitter-stats. 2014. Twitter Statistics—Statistics Brain. Retrieved from http://www.statisticbrain.com/twitter-statistics/.Google Scholar
- Twitter-stream-api. 2012. GET Statuses/Sample—Twitter Developers. Retrieved from https://dev.twitter.com/docs/api/1/get/statuses/sample.Google Scholar
- Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier. 2012. It’s not in their tweets: Modeling Topical expertise of Twitter users. In Proceedings of AASE/IEEE International Conference on Social Computing (SocialCom’12). 91--100. Google ScholarDigital Library
- Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Who says what to whom on Twitter. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 705--714. Google ScholarDigital Library
- Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. 2012b. We know what @you #tag: does the dual role affect hashtag adoption? In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 261--270. Google ScholarDigital Library
- Xintian Yang, Amol Ghoting, Yiye Ruan, and Srinivasan Parthasarathy. 2012a. A framework for summarizing and analyzing Twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 370--378. Google ScholarDigital Library
- Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang. 2011. Geographical topic discovery and comparison. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 247--256. Google ScholarDigital Library
Index Terms
- Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream
Recommendations
Sampling online social networks: an experimental study of twitter
SIGCOMM'14Online social networks (OSNs) are an important source of information for scientists in different fields such as computer science, sociology, economics, etc. However, it is hard to study OSNs as they are very large. For instance, Facebook has 1.28 ...
On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementSeveral applications today rely upon content streams crowd-sourced from online social networks. Since real-time processing of large amounts of data generated on these sites is difficult, analytics companies and researchers are increasingly resorting to ...
What is Twitter, a social network or a news media?
WWW '10: Proceedings of the 19th international conference on World wide webTwitter, a microblogging service less than three years old, commands more than 41 million users as of July 2009 and is growing fast. Twitter users tweet about any topic within the 140-character limit and follow others to receive their tweets. The goal ...
Comments