Abstract
The increased popularity of smartphones has attracted a large number of developers to offer various applications for the different smartphone platforms via the respective app markets. One consequence of this popularity is that the app markets are also becoming populated with spam apps. These spam apps reduce the users’ quality of experience and increase the workload of app market operators to identify these apps and remove them. Spam apps can come in many forms such as apps not having a specific functionality, those having unrelated app descriptions or unrelated keywords, or similar apps being made available several times and across diverse categories. Market operators maintain antispam policies and apps are removed through continuous monitoring. Through a systematic crawl of a popular app market and by identifying apps that were removed over a period of time, we propose a method to detect spam apps solely using app metadata available at the time of publication. We first propose a methodology to manually label a sample of removed apps, according to a set of checkpoint heuristics that reveal the reasons behind removal. This analysis suggests that approximately 35% of the apps being removed are very likely to be spam apps. We then map the identified heuristics to several quantifiable features and show how distinguishing these features are for spam apps. We build an Adaptive Boost classifier for early identification of spam apps using only the metadata of the apps. Our classifier achieves an accuracy of over 95% with precision varying between 85% and 95% and recall varying between 38% and 98%. We further show that a limited number of features, in the range of 10--30, generated from app metadata is sufficient to achieve a satisfactory level of performance. On a set of 180,627 apps that were present at the app market during our crawl, our classifier predicts 2.7% of the apps as potential spam. Finally, we perform additional manual verification and show that human reviewers agree with 82% of our classifier predictions.
- Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sakkis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos. 2000. Learning to filter spam e-mail: A comparison of a naive Bayesian and a memory-based approach. arXiv preprint cs/0009009 (2000).Google Scholar
- App Annie. 2016. App Forecast: Over $100 Billion In Revenue by 2020. Retrieved from http://blog.appannie.com/app-annie-releases-inaugural-mobile-app-forecast/.Google Scholar
- AppBrain, Inc. 2016. New Android apps per month. Retrieved from http://www.appbrain.com/stats/number-of-android-apps.Google Scholar
- Apple. 2014. Common App Rejections. Retrieved from https://developer.apple.com/app-store/review/rejections/.Google Scholar
- Apple. 2016. App Store Review Guidelines. Retrieved from https://developer.apple.com/app-store/review/guidelines/.Google Scholar
- Hrishikesh B. Aradhye, Gregory K. Myers, and James A. Herson. 2005. Image analysis for efficient categorization of image-based spam e-mail. In Proceedings of the 8th International Conference on Document Analysis and Recognition. IEEE, 914--918. Google ScholarDigital Library
- Vitalii Avdiienko, Konstantin Kuznetsov, Alessandra Gorla, Andreas Zeller, Steven Arzt, Siegfried Rasthofer, and Eric Bodden. 2015. Mining apps for abnormal usage of sensitive data. In Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 426--436. Google ScholarCross Ref
- AVG. 2014. Website Safety Ratings and Reputation. Retrieved from http://www.avgthreatlabs.com/website-safety-reports/app.Google Scholar
- Sugato Basu, Mikhail Bilenko, and Raymond J. Mooney. 2004. A probabilistic framework for semi-supervised clustering. In Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining. ACM, 59--68. Google ScholarDigital Library
- Fabrıcio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgılio Almeida. 2010. Detecting spammers on twitter. In Proceedings of the 7th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference.Google Scholar
- Enrico Blanzieri and Anton Bryl. 2008. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review 29, 1 (2008), 63--92. Google ScholarDigital Library
- Leo Breiman, Jerome Friedman, Charles J. Stone, and Richard A. Olshen. 1984. Classification and Regression Trees. CRC Press.Google Scholar
- Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. 2011. Crowdroid: Behavior-based malware detection system for android. In Proceedings of the 1st Workshop on Security and Privacy in Smartphones and Mobile Devices. ACM, 15--26. Google ScholarDigital Library
- Omar Canales, Vinnie Monaco, Thomas Murphy, Edyta Zych, John Stewart, Charles Tappert, Alex Castro, Ola Sotoye, Linda Torres, and Greg Truley. 2011. A stylometry system for authenticating students taking online tests. In Proceedings of the Student-Faculty CSIS Research Day (2011).Google Scholar
- Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval. ACM, 423--430. Google ScholarDigital Library
- Rishi Chandy and Haijie Gu. 2012. Identifying spam in the iOS app store. In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality. ACM, 56--59. Google ScholarDigital Library
- Nitesh V. Chawla, Aleksandar Lazarevic, Lawrence O. Hall, and Kevin W. Bowyer. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. In Proceedings of the European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 107--119. Google ScholarCross Ref
- Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175--186. Google ScholarDigital Library
- Paul-Alexandru Chirita, Jörg Diederich, and Wolfgang Nejdl. 2005. MailRank: Using ranking for spam detection. In Proceedings of the 14th International Conference on Information and Knowledge Management. ACM, 373--380. Google ScholarDigital Library
- Gordon V. Cormack, José María Gómez Hidalgo, and Enrique Puertas Sánz. 2007. Spam filtering for short messages. In Proceedings of the 16th Conference on Information and Knowledge Management. ACM, 313--320. Google ScholarDigital Library
- Jonathan Crussell, Clint Gibler, and Hao Chen. 2013. AnDarwin: Scalable detection of semantically similar android applications. In Computer Security--ESORICS 2013. Springer, 182--199. Google ScholarCross Ref
- Andrea Di Sorbo, Sebastiano Panichella, Carol V. Alexandru, Junji Shimagaki, Corrado A. Visaggio, Gerardo Canfora, and Harald Gall. 2016. What would users change in my app? Summarizing app reviews for recommending software changes. In Proceedings of the 2016 ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE). Google ScholarDigital Library
- Harris Drucker, S. Wu, and Vladimir N. Vapnik. 1999. Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10, 5 (1999), 1048--1054.Google ScholarDigital Library
- Miklós Erdélyi, András Garzó, and András A. Benczúr. 2011. Web spam classification: A few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality. ACM, 27--34. Google ScholarDigital Library
- Jeffrey Erman, Anirban Mahanti, Martin Arlitt, Ira Cohen, and Carey Williamson. 2007. Offline/realtime traffic classification using semi-supervised learning. Performance Evaluation 64, 9--12 (Oct. 2007), 1194--1213.Google ScholarDigital Library
- Adnan Farooqui. 2016. Apple Promises To Clamp Down On Spam Apps. Retrieved from http://www.ubergizmo.com/2016/03/apple-promises-to-clamp-down-on-spam-apps/.Google Scholar
- Yu Feng, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semantics-based detection of android malware through static analysis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 576--587. Google ScholarDigital Library
- Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases. ACM, 1--6. Google ScholarDigital Library
- Rudolph Flesch. 1948. A new readability yardstick. Journal of Applied Psychology 32, 3 (1948), 221.Google ScholarCross Ref
- Yoav Freund and Robert E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, Vol. 96. Morgan Kaufmann, 148--156.Google Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer Series in Statistics. Springer, Berlin. 367--370.Google Scholar
- Bin Fu, Jialiu Lin, Lei Li, Christos Faloutsos, Jason Hong, and Norman Sadeh. 2013. Why people hate your app: Making sense of user feedback in a mobile app store. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1276--1284. Google ScholarDigital Library
- José María Gómez Hidalgo, Guillermo Cajigas Bringas, Enrique Puertas Sánz, and Francisco Carrero García. 2006. Content based SMS spam filtering. In Proceedings of the 2006 Symposium on Document Engineering. ACM, 107--114.Google ScholarDigital Library
- Google. 2014. Rating your application content for Google Play. Retrieved from https://support.google.com/googleplay/android-developer/answer/188189.Google Scholar
- Google. 2016a. Google Play Developer Policy Center. Retrieved from https://play.google.com/about/developer-content-policy-print/.Google Scholar
- Google. 2016b. Impersonation and Intellectual Property. Retrieved from https://play.google.com/about/ip-deception-spam/impersonation-ip/.Google Scholar
- Google. 2016c. Spam. Retrieved from https://play.google.com/about/ip-deception-spam/spam.Google Scholar
- Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. 2014. Checking app behavior against app descriptions. In Proceedings of the 36th International Conference on Software Engineering. 1025--1035. Google ScholarDigital Library
- Michael Grace, Yajin Zhou, Qiang Zhang, Shihong Zou, and Xuxian Jiang. 2012. Riskranker: Scalable and accurate zero-day android malware detection. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services. ACM, 281--294. Google ScholarDigital Library
- Emitza Guzman and Walid Maalej. 2014. How do users like this feature? A fine grained sentiment analysis of app reviews. In Proceedings of the 2014 IEEE 22nd International Requirements Engineering Conference (RE). IEEE, 153--162. Google ScholarCross Ref
- Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of the 13th International Conference on Very Large Databases. VLDB Endowment, 576--587.Google Scholar
- Mark Harman, Yue Jia, and Yuanyuan Zhang. 2012. App store mining and analysis: MSR for app stores. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories. IEEE Press, 108--111. Google ScholarCross Ref
- Claudia Iacob and Rachel Harrison. 2013. Retrieving and analyzing mobile apps feature requests from online reviews. In Proceedings of the 2013 10th IEEE Working Conference on Mining Software Repositories (MSR). IEEE, 41--44. Google ScholarCross Ref
- Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis 6, 5 (2002), 429--449.Google ScholarDigital Library
- Nitin Jindal and Bing Liu. 2007. Review spam detection. In Proceedings of the 16th International Conference on World Wide Web. ACM, 1189--1190. Google ScholarDigital Library
- Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM, 219--230. Google ScholarDigital Library
- Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324. Google ScholarDigital Library
- Vijay Krishnan and Rashmi Raj. 2006. Web spam detection with anti-trust rank. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, Vol. 6. 37--40.Google Scholar
- Barry Leiba, Joel Ossher, V. T. Rajan, Richard Segal, and Mark N. Wegman. 2005. SMTP path analysis. In Proceedings of the 2nd Conference on Email and Anti-Spam.Google Scholar
- Walid Maalej and Hadeer Nabil. 2015. Bug report, feature request, or simply praise? On automatically classifying app reviews. In Proceedings of the 2015 IEEE 23rd International Requirements Engineering Conference (RE). IEEE, 116--125. Google ScholarCross Ref
- Dragos D. Margineantu and Thomas G. Dietterich. 1997. Pruning adaptive boosting. In ICML, Vol. 97. 211--218.Google Scholar
- Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. 2006. Spam filtering with naive Bayes—Which naive Bayes? In Proceedings of 3rd Conference on Email and Anti-Spam. 27--28.Google Scholar
- Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Vol. 5. 1--6.Google Scholar
- Arjun Mukherjee and Bing Liu. 2010. Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 207--217.Google ScholarDigital Library
- Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on World Wide Web. ACM, 83--92. Google ScholarDigital Library
- Jon Oberheide and Charlie Miller. 2012. Dissecting the Android bouncer. Retrieved from https://jon.oberheide.org/files/summercon12-bouncer.pdf.Google Scholar
- Oracle. 2014. Naming a Package. Retrieved from http://docs.oracle.com/javase/tutorial/java/package/namingpkgs.html.Google Scholar
- Boykin P. Oscar and Vwani P. Roychowdbury. 2005. Leveraging social networks to fight spam. IEEE Computer 38, 4 (2005), 61--68. Google ScholarDigital Library
- Sebastiano Panichella, Andrea Di Sorbo, Emitza Guzman, Corrado A. Visaggio, Gerardo Canfora, and Harald C. Gall. 2015. How can I improve my app? Classifying user reviews for software maintenance and evolution. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 281--290. Google ScholarDigital Library
- Patrick Pantel and Dekang Lin. 1998. Spamcop: A spam classification 8 organization program. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization. 95--98.Google Scholar
- Hao Peng, Chris Gates, Bhaskar Sarma, Ninghui Li, Yuan Qi, Rahul Potharaju, Cristina Nita-Rotaru, and Ian Molloy. 2012. Using probabilistic generative models for ranking risks of android apps. In Proceedings of the Conference on Computer and Communications Security. ACM, 241--252. Google ScholarDigital Library
- Sarah Perez. 2013a. Developer Spams Google Play With Ripoffs of Well-Known Apps Again. Retrieved from http://techcrunch.com.Google Scholar
- Sarah Perez. 2013b. Nearly 60K Low-Quality Apps Booted From Google Play Store in February, Points To Increased Spam-Fighting. (2013). http://tcrn.ch/14SwCQj.Google Scholar
- Sarah Perez. 2016. Apple’s Phil Schiller promises to address the issue of spammy apps being featured in the App Store. Retrieved from https://techcrunch.com/2016/03/14/apples-phil-schiller-promises-to-address-the-issue-of-spammy-apps-being-featured-in-the-app-store/.Google Scholar
- Thanasis Petsas, Antonis Papadogiannakis, Michalis Polychronakis, Evangelos P. Markatos, and Thomas Karagiannis. 2013. Rise of the planet of the apps: A systematic study of the mobile app ecosystem. In Proceedings of the 2013 Conference on Internet Measurement Conference. ACM, 277--290. Google ScholarDigital Library
- PocketGamer.biz. 2016. Count of Application Submissions. Retrieved from http://www.pocketgamer.biz/metrics/app-store/submissions/.Google Scholar
- J. R. Quinlan. 1996. Bagging, boosting, and C4.S. In Proceedings of the 13th National Conference on Artificial Intelligence - Volume 1. AAAI Press, 725--730.Google Scholar
- Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. 1998. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 Workshop, Vol. 62. 98--105.Google Scholar
- David Sculley and Gabriel M. Wachman. 2007. Relaxed online SVMs for spam filtering. In Proceedings of the 30th Annual International Conference on Research and Development in Information Retrieval. ACM, 415--422. Google ScholarDigital Library
- Chris Seiffert, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2010), 185--197. Google ScholarDigital Library
- Suranga Seneviratne, Aruna Seneviratne, Dali Kaafar, Anirban Mahanti, and Prasant Mohapatra. 2014a. Why My App Got Deleted: Detection of Spam Mobile Apps. Technical Report. NICTA, Australia.Google Scholar
- Suranga Seneviratne, Aruna Seneviratne, Mohamed Ali Kaafar, Anirban Mahanti, and Prasant Mohapatra. 2015. Early detection of spam mobile apps. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). International World Wide Web Conferences Steering Committee, 949--959. Google ScholarDigital Library
- Suranga Seneviratne, Aruna Seneviratne, Prasant Mohapatra, and Anirban Mahanti. 2014b. Predicting user traits from a snapshot of apps installed on a smartphone. ACM SIGMOBILE Mobile Computing and Communications Review 18, 2 (2014), 1--8. Google ScholarDigital Library
- R. J. Senter and E. A. Smith. 1967. Automated Readability Index. Technical Report AMRL-TR-66-220. Aerospace Medical Research Laboratories.Google Scholar
- Ian Soboroff, Iadh Ounis, J. Lin, and I. Soboroff. 2012. Overview of the TREC-2012 microblog track. In Proceedings of the 21st Text Retrieval Conference.Google Scholar
- Statista, Inc. 2016. Number of apps available in leading app stores as of June 2016. Retrieved from http://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/.Google Scholar
- Tecno Buffalo. 2016. Apple exec responds flood of spam apps in App Store. Retrieved from http://www.technobuffalo.com/2016/03/14/apple-exec-responds-flood-of-spam-apps-in-app-store/.Google Scholar
- Nicolas Viennot, Edward Garcia, and Jason Nieh. 2014. A measurement study of google play. In Proceedings of the 2014 International Conference on Measurement and Modeling of Computer Systems. ACM, 221--233. Google ScholarDigital Library
- Alex Hai Wang. 2010. Don’t follow me: Spam detection in twitter. In Proceedings of the 2010 International Conference on Security and Cryptography. IEEE, 1--10.Google Scholar
- Wikipedia. 2014. Wikipedia: Lists of common misspellings. Retrieved from http://en.wikipedia.org/wiki/.Google Scholar
- Wei Yang, Xusheng Xiao, Benjamin Andow, Sihan Li, Tao Xie, and William Enck. 2015. Appcontext: Differentiating malicious and benign mobile app behaviors using context. In Proceedings of the 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 303--313.Google ScholarCross Ref
- Yueqian Zhang, Xiapu Luo, and Haoyang Yin. 2015. Dexhunter: Toward extracting hidden code from packed android applications. In Computer Security--ESORICS 2015. Springer, 293--311. Google ScholarCross Ref
- Yajin Zhou, Zhi Wang, Wu Zhou, and Xuxian Jiang. 2012. Hey, you, get off of my market: Detecting malicious apps in official and alternative android markets. In Proceedings of the 2012 Network and Distributed System Security Symposium. The Internet Society.Google Scholar
Recommendations
Early Detection of Spam Mobile Apps
WWW '15: Proceedings of the 24th International Conference on World Wide WebIncreased popularity of smartphones has attracted a large number of developers to various smartphone platforms. As a result, app markets are also populated with spam apps, which reduce the users' quality of experience and increase the workload of app ...
An Explorative Study of the Mobile App Ecosystem from App Developers' Perspective
WWW '17: Proceedings of the 26th International Conference on World Wide WebWith the prevalence of smartphones, app markets such as Apple App Store and Google Play has become the center stage in the mobile app ecosystem, with millions of apps developed by tens of thousands of app developers in each major market. This paper ...
Mining and characterizing hybrid apps
WAMA 2016: Proceedings of the International Workshop on App Market AnalyticsMobile apps have grown tremendously over the past few years. To capitalize on this growth and to attract more users, implementing the same mobile app for different platforms has become a common industry practice. Building the same app natively for each ...
Comments