Abstract
Data mining is used for finding meaningful information out of a vast expanse of data. With the advent of Big Data concept, data mining has come to much more prominence. Discovering knowledge out of a gigantic volume of data efficiently is a major concern as the resources are limited. Cloud computing plays a major role in such a situation. Cloud data mining fuses the applicability of classical data mining with the promises of cloud computing. This allows it to perform knowledge discovery out of huge volumes of data with efficiency. This article presents the existing frameworks, services, platforms, and algorithms for cloud data mining. The frameworks and platforms are compared among each other based on similarity, data mining task support, parallelism, distribution, streaming data processing support, fault tolerance, security, memory types, storage systems, and others. Similarly, the algorithms are grouped on the basis of parallelism type, scalability, streaming data mining support, and types of data managed. We have also provided taxonomies on the basis of data mining techniques such as clustering, classification, and association rule mining. We also have attempted to discuss and identify the major applications of cloud data mining. The various taxonomies for cloud data mining frameworks, platforms, and algorithms have been identified. This article aims at gaining better insight into the present research realm and directing the future research toward efficient cloud data mining in future cloud systems.
- Ronald C. Taylor. 2010. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinform. 11, 12 (2010), S1.Google ScholarCross Ref
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- X. Geng and Z. Yang. 2013. Data mining in cloud computing. In Proceedings of the International Conference on Information Science and Computer Applications (ISCA’13). 1--7.Google Scholar
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 7. Google ScholarDigital Library
- A. X. Tan, V. L. Liu, M. Kantarcioglu, and B. Thuraisingham. 2010. A comparison of approaches for large-scale data mining. Technical Report UTDCS-24-10.Google Scholar
- Yunhong Gu and Robert L. Grossman. 2009. Sector and sphere: The design and implementation of a high-performance data cloud. Philos. Trans. Roy. Soc. London A: Math. Phys. Eng. Sci. 367.1897 (2009), 2429--2445.Google ScholarCross Ref
- Uzma Ali and Punam Khandar. 2013. Data mining for data cloud and compute cloud. International Journal of Innovative Research in Computer and Communication Engineering 1, 5 (July 2013), 1137--1141.Google Scholar
- Yunhong Gu, Li Lu, Robert Grossman, and Andy Yoo. 2010. Processing massive sized graphs using Sector/Sphere. In Proceedings of the IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS’10). IEEE, 1--10.Google ScholarCross Ref
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and S. I. Spark. 2010. Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. USENIX Association Berkeley, CA, 10--10. Google ScholarDigital Library
- Fabrizio Marozzo, Domenico Talia, and Paolo Trunfio. 2011. A cloud framework for parameter sweeping data mining applications. In Proceedings of the IEEE 3rd International Conference on Cloud Computing Technology and Science (CloudCom’11). IEEE, 367--374. Google ScholarDigital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8 (2012), 716--27. Google ScholarDigital Library
- Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-scale graph computation on just a PC. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. Google ScholarDigital Library
- Amy Xuyang Tan, Valerie Li Liu, Murat Kantarcioglu, and Bhavani Thuraisingham. 2010. A comparison of approaches for large-scale data mining. Technical Report UTDCS-24-10.Google Scholar
- A. Mahendiran, N. Saravanan, N. Venkata Subramanian, and N. Sairam. 2012. Implementation of K-means clustering in cloud computing environment. Res. J. Appl. Sci. Eng. Technol. 4, 10 (2012), 1391--1394.Google Scholar
- K. Srivastava, R. Shah, D. Valia, and H. Swaminarayan. 2013. Data mining using hierarchical agglomerative clustering algorithm in distributed cloud computing environment. Int. J. Comput. Theory Eng. 5, 3 (2013), 520.Google ScholarCross Ref
- Tugdual Sarazin, Mustapha Lebbah, and Hanane Azzag. 2014. Biclustering using Spark-MapReduce. In Proceedings of the IEEE International Conference on Big Data (BigData’14). IEEE, 58--60.Google ScholarCross Ref
- Wei Liu and Ling Chen. 2008. A parallel algorithm for gene expressing data biclustering. J. Comput. Phys. 3, 10 (2008), 71--77.Google Scholar
- Spiros Papadimitriou and Jimeng Sun. 2008. Disco: Distributed co-clustering with MapReduce: A case study towards petabyte-scale end-to-end mining. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08). IEEE, 512--521. Google ScholarDigital Library
- Esha Sarkar and C. H. Sekhar. 2014. Organizing data in cloud using clustering approach. Int. J. Sci. Eng. Res. 5, 5 (2014).Google Scholar
- Madhuri H. Parekh. {n.d.}. Enhancement clustering of cloud datasets using improved agglomerative technique. Int. J. Adv. Netw. Appl. 128--131.Google Scholar
- Renu Ansari. 2015. A distributed k-mean clustering algorithm for cloud data mining. Int. J. Eng. Trends Technol. 30, 7 (2015).Google Scholar
- Xianfeng Yang and Pengfei Liu. 2013. A new algorithm of the data mining model in cloud computing based on web fuzzy clustering analysis. J. Theor. Appl. Info. Technol. 49, 1 (2013).Google Scholar
- S. Guha, R. Rastogi, and K. Shim. 1998. June. CURE: An efficient clustering algorithm for large databases. In ACM SIGMOD Record, Vol. 27, No. 2. ACM, 73--84. Google ScholarDigital Library
- Madhuri H. Parekh and Ishan K. Rajani. 2015. Improve performance of clustering on cloud datasets using improved agglomerative CURE hierarchical algorithm. Int. J. Sci. Eng. Technol. Res. 4, 6 (2015).Google Scholar
- Kun Qin, Min Xu, Yi Du, and Shuying Yue. 2008. Cloud model and hierarchical clustering-based spatial data mining method and application. Int. Arch. Photogram. Remote Sens. Spatial Info. Sci. 37, B2 (2008), 241--245.Google Scholar
- Ran Jin, Chunhai Kou, Ruijuan Liu, and Yefeng Li. 2013. Efficient parallel spectral clustering algorithm design for large data sets under cloud computing environment. J. Cloud Comput.: Adv. Syst. Appl. 2, 1 (2013), 18. Google ScholarDigital Library
- Nivranshu Hans, Sana Mahajan, and S. Omkar. 2015. Big data clustering using genetic algorithm on Hadoop MapReduce. Int. J. Sci. Technol. Res. 4 (2015).Google Scholar
- M. Shindler, A. Wong, and A. W. Meyerson. 2011. Fast and accurate k-means for large datasets. In Advances in Neural Information Processing Systems. MIT Press, 2375--2383. Google ScholarDigital Library
- Bhupendra Panchal and R. K. Kapoor. 2013. Performance enhancement of cloud computing with clustering. Int. J. Eng. Adv. Technol. 2, 5 (2013).Google Scholar
- Pooja Bisht and Kulvinder Singh. 2016. Big data mining: Analysis of genetic K- means algorithm for big data clustering. Int. J. Adv. Res. Comput. Sci. Software Eng. 6, 7 (2016).Google Scholar
- Alessandro Lulli, Matteo Dell’Amico, Pietro Michiardi, and Laura Ricci. 2016. NG-DBSCAN: Scalable density-based clustering for arbitrary data. Proc. VLDB Endow. 10, 3 (2016), 157--168. Google ScholarDigital Library
- Yaobin He, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, and Jianping Fan. 2011. Mr-dbscan: An efficient parallel density-based clustering algorithm using MapReduce. In Proceedings of the IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS’11). IEEE, 473--480. Google ScholarDigital Library
- Dianwei Han, Ankit Agrawal, Wei-Keng Liao, and Alok Choudhary. 2016. A novel scalable DBSCAN algorithm with Spark. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops. IEEE, 1393--1402.Google ScholarCross Ref
- F. Ozgur Catak and M. Erdal Balaban. 2012. CloudSVM: Training an SVM classifier in cloud computing systems. In Proceedings of the Joint International Conference on Pervasive Computing and the Networked World. Springer, Berlin, 57--68. Google ScholarDigital Library
- Lijuan Zhang and Shuguang Zhao. 2013. The strategy of classification mining based on cloud computing. In Proceedings of the International Workshop on Cloud Computing and Information Security (CCIS’13).Google Scholar
- Lijuan Zhou, Hui Wang, and Wenbo Wang. 2012. Parallel implementation of classification algorithms based on cloud computing environment. TELKOMNIKA Indones. J. Electr. Eng. 10, 5 (2012), 1087--1092.Google Scholar
- Jing Ding and Shanlin Yang. 2012. Classification rules mining model with genetic algorithm in cloud computing. Int. J. Comput. Appl. 48, 18 (2012), 24--32.Google Scholar
- Jian Wang. 2012. A novel K-NN classification algorithm for privacy preserving in cloud computing. Res. J. Appl. Sci. Eng. Technol. 22, 4 (2012), 4865--4870.Google Scholar
- Pooja Bajare, Monika Bhoyate, Yogita Bhujbal, Erandole Monika, and Vaishali Shinde. {n.d.}. k-nearest neighbor classification over encrypted cloud data. IOSR Journal of Computer Engineering (IOSR-JCE). 45--48.Google Scholar
- Apexa B. Kamdar and Jay M. Jagani. 2014. A survey: Classification of huge cloud datasets with efficient map-reduce policy. International Journal of Engineering Trends and Technology (IJETT) 18, 2 (2014), 103--107.Google ScholarCross Ref
- Kun Liu and Jan Boehm. 2015. Classification of big point cloud data using cloud computing. Int. Arch. Photogram. Remote Sens. Spatial Info. Sci. 40, 3 (2015), 553.Google ScholarCross Ref
- Zhang Danping, Yu Haoran, and Zheng Linyu. 2014. Apriori algorithm research based on MapReduce in cloud computing environments. Open Autom. Control Syst. J. 6 (2014), 368--373.Google ScholarCross Ref
- Juan Li, Pallavi Roy, Samee U. Khan, Lizhe Wang, and Yan Bai. 2012. Data mining using clouds: An experimental implementation of Apriori over MapReduce. In Proceedings of the 12th International Conference on Scalable Computing and Communications (ScalCom’13). 1--8.Google Scholar
- Kuldeep Mishra, Ravi Rai Chaudhary, and Dheresh Soni. 2013. A premeditated CDM algorithm in cloud computing environment for FPM. Int. J. Comput. Eng. Technol. 4, 4 (2013), 213--223.Google Scholar
- Dheresh Soni, Atish Mishra, and Hitesh Gupta. 2016. An efficient cloud data mining (CDM) algorithm for frequent pattern mining in cloud computing environment. Lecture Notes Software Eng. 4, 3 (2016).Google Scholar
- Dheresh Soni, Atish Mishra, Satyendra Singh Thakur, and Nishant Chaurasia. 2011. Applying frequent pattern mining in cloud computing environment. Int. J. Adv. Comput. Res. 1 (2011), 84--87.Google Scholar
- N. Khurana and R. K. Datta. 2013. Pruning large data sets for finding association rule in cloud: CBPA (Count-based Pruning Algorithm). Int. J. Softw. Web Sci. (2013), 118--122.Google Scholar
- Lijuan Zhou and Xiang Wang. 2014. Research of the FP-growth algorithm based on cloud environments. J. Software 9, 3 (2014), 676--683.Google ScholarCross Ref
- Lingjuan Li and Min Zhang. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the International Conference on Business Computing and Global Informatization (BCGIN’11). IEEE, 475--478. Google ScholarDigital Library
- Pooja Godse, Tejal Zete, Mohit Bhanushali, and Shubhangi Kale. 2019. The strategy of mining association rule based on cloud computing. Technical Report. Retrieved 2019 from http://kddlab.zjgsu.edu.cn:7200/research/DistributedMining.Google Scholar
- Daniele Apiletti, Elena Baralis, Tania Cerquitelli, Silvia Chiusano, and Luigi Grimaudo. 2013. SeaRum: A cloud-based service for association rule mining. In Proceedings of the 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications. 1283--1290. Google ScholarDigital Library
- K. Mangayarkkarasi and M. Chidambaram. 2017. An intelligent service recommendation model for service usage pattern discovery in secure cloud computing environment. J. Theor. Appl. Info. Technol. 95, 15 (2017).Google Scholar
- Daniele Apiletti, Elena Baralis, Tania Cerquitelli, Paolo Garza, Pietro Michiardi, and Fabio Pulvirenti. 2015. PaMPa-HD: A parallel MapReduce-based frequent Pattern miner for high-dimensional data. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’15). IEEE, 839--846. Google ScholarDigital Library
- Arkan Al-Hamodi, Songfeng Lu, and Yahya Al-Salhi. 2016. An enhanced frequent pattern growth based on MapReduce for mining association rules. Int. J. Data Min. Knowl. Manage. Process 6, 2 (2016), 19--28.Google ScholarCross Ref
- Bo He. 2012. Fast mining algorithm of association rules base on cloud computing. In Proceedings of the 2nd International Conference on Electronic 8 Mechanical Engineering and Information Technology. Atlantis Press.Google ScholarCross Ref
- Wenzheng Zhu and Changhoon Lee. 2014. A new approach to web data mining based on cloud computing. J. Comput. Sci. Eng. 8, 4 (2014), 181--186.Google ScholarCross Ref
- R. Farivar et al. 2009. Mithra: Multiple data independent tasks on heterogeneous resource architecture. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops. 1--10.Google ScholarCross Ref
- Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. 2012. Parallel data processing with MapReduce: A survey. ACM SIGMOD Rec. 40, 4 (2012), 11--20. Google ScholarDigital Library
- Indrajit Roy, Srinath T. V. Setty, Ann Kilzer, Vitaly Shmatikov, and Emmett Witchel. 2010. Airavat: Security and privacy for MapReduce. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation. 297--312. Google ScholarDigital Library
- C. Dwork. 2006. Differential privacy. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP’06).Google ScholarDigital Library
- C. Dwork. 2007. An ad omnia approach to defining and achieving pri-vate data analysis. In Proceedings of the ACM SIGKDD International Workshop on Privacy, Security, and Trust in Knowledge, Discovery, and Data Mining (PinKDD’07).Google Scholar
- C. Dwork. 2007. Ask a better question, get a better answer: A new approach to private data analysis. In Proceedings of the International Conference on Database Theory (ICDT’07). Google ScholarDigital Library
- C. Dwork. 2008. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation (TAMC’08). Google ScholarDigital Library
- Hanna M. Said, Ibrahim El Emary, Bader A. Alyoubi, and Adel A. Alyoubi. {n.d.}. Application of intelligent data mining approach in securing the cloud computing. Int. J. Adv. Comput. Sci. Appl. 1, 7, 151--159.Google Scholar
- Eric A. Brewer. 2000. Towards robust distributed systems. In Proceedings of the ACM Symposium on Principles of Distributed Computing (PODC’00), Vol. 7. Google ScholarDigital Library
- Werner Vogels. 2008. Eventually consistent. Queue 6, 6 (2008), 14--19. Google ScholarDigital Library
- Daniel Abadi. 2012. Consistency tradeoffs in modern distributed database system design: CAP is only part of the story. Computer 45, 2 (2012), 37--42. Google ScholarDigital Library
- Domenico Talia. 2013. Toward cloud-based big-data analytics. IEEE Comput. Sci. (2013), 98--101. Google ScholarDigital Library
- Robert Grossman and Yunhong Gu. 2008. Data mining using high performance data clouds: Experimental studies using sector and sphere. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 920--927. Google ScholarDigital Library
- Robert L. Grossman, Yunhong Gu, Michael Sabala, and Wanzhi Zhang. 2009. Compute and storage clouds using wide area high performance networks. Future Gen. Comput. Syst. 25, 2 (2009), 179--183. Google ScholarDigital Library
- C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. 2007. Evaluating MapReduce for multi-core and multiprocessor systems. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture. 13--24. Google ScholarDigital Library
- Zhenhua Guo, Geoffrey Fox, and Mo Zhou. 2012. Investigation of data locality in MapReduce. In Proceedings of the 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’12). IEEE, 419--426. Google ScholarDigital Library
- Domenico Talia and Paolo Trunfio. 2010. How distributed data mining tasks can thrive as knowledge services. Commun. ACM 53, 7 (2010), 132--137. Google ScholarDigital Library
- Shivnath Babu. 2010. Towards automatic optimization of MapReduce programs. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 137--142. Google ScholarDigital Library
- Eaman Jahani, Michael J. Cafarella, and Christopher R. 2011. Automatic optimization for MapReduce programs. Proc. VLDB Endow. 4, 6 (2011), 385--396. Google ScholarDigital Library
- Praveen Kumar Lakkimsetti. 2011. A framework for automatic optimization of MapReduce programs based on job parameter configurations. PhD dissertation, Kansas State University (2011).Google Scholar
- Nezih Yigitbasi, Theodore L. Willke, Guangdeng Liao, and Dick Epema. 2013. Towards machine-learning-based auto-tuning of MapReduce. In Proceedings of the IEEE 21st International Symposium on Modeling, Analysis 8 Simulation of Computer and Telecommunication Systems (MASCOTS’13). IEEE, 11--20. Google ScholarDigital Library
- Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, and Shivnath Babu. 2011. Starfish: A self-tuning system for big data analytics. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’11) 11, 2011 (2011), 261--272.Google Scholar
- Vasiliki Kalavri and Vladimir Vlassov. 2013. MapReduce: Limitations, optimizations and open issues. In Proceedings of the12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom’13). IEEE, 1031--1038. Google ScholarDigital Library
- Robert Grossman and Yunhong Gu. 2008. Data mining using high performance data clouds: experimental studies using sector and sphere. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 920--927. Google ScholarDigital Library
- F. Ferrucci, P. Salza, M. Kechadi, and F. Sarro. 2015. A parallel genetic algorithms framework based on Hadoop MapReduce. In Proceedings of the 30th Annual ACM Symposium on Applied Computing. ACM, 1664--1667. Google ScholarDigital Library
- M. Ester, H. P. Kriegel, J. Sander, and X. Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the International Conference on Knowledge Discovery in Databases and Data Mining (KDD’96). 226--231. Google ScholarDigital Library
- A. Maithili, R. V. Kumari, and S. Rajamanickam. 2012. Neural networks cum cloud computing approach in diagnosis of cancer. Int. J. Eng. Res. Appl. 2, 2 (2012), 428--35.Google Scholar
- I. Kaur. 2019. Security of cloud from data mining-based attacks. Technical Report. Retrieved 2019 from https://studyres.com/doc/572585/security-of-cloud-from-data-mining-based-attacks-inderjit.Google Scholar
- S. Sharma. 2014. Improving cloud security using data mining. IOSR J. Comput. Eng. 1, 16 (2014), 66--69.Google ScholarCross Ref
- Sakshi Aggarwal and Ritu Sindhu. 2014. A survey on cloud mining with privacy protection. Int. J. Adv. Res. Comput. Sci. Software Eng. 4, 10 (2014).Google Scholar
- Chintada. Srinivasa Rao and Chinta. Chandra Sekhar. 2014. Dynamic massive data storage security challenges in cloud computing environments. Int. J. Innovat. Res. Comput. Commun. Eng. 2, 3 (Mar. 2014), ISSN(Online): 2320-9801.Google Scholar
- W. Lian, X. Zhu, J. Zhang, and S. Li. 2015. Cloud computing environments parallel data mining policy research. Int. J. Grid Distrib. Comput. 8, 4 (2015), 135--144.Google ScholarCross Ref
- Jiong Xie, Shu Yin, and Zhiyang Ding. 2010. Improving MapReduc performance through data placement in heterogeneous clusters. Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’10).Google Scholar
- A. S. Saabith, E. Sundararajan, and A. A. Bakar. 2016. Parallel implementation of Apriori algorithms on the Hadoop-MapReduce platform—An evaluation of literature. J. Theor. Appl. Info. Technol. 85, 3 (2016), 321.Google Scholar
- A. A. Pandagale and A. R. Surve. 2016. Hadoop-HBase for finding association rules using Apriori MapReduce algorithm. In Proceedings of the IEEE International Conference on Recent Trends in Electronics, Information 8 Communication Technology (RTEICT’16). IEEE, 795--798.Google Scholar
- K. Chandy and L. Lamport. 1985. Distributed snapshots: Determining global states of distributed systems. ACM Trans. Comput. Syst. 3, 1 (1985), 63--75. Google ScholarDigital Library
- K. Chandy and J. Misra. 1981. Asynchronous distributed simulation via a sequence of parallel computations. Commun. ACM 24, 2 (1981), 198--205. Google ScholarDigital Library
- L. Ismail, M. M. Masud, and L. Khan. 2014. FSBD: A framework for scheduling of big data mining in cloud computing. In Proceedings of the IEEE International Congress on Big Data (BigData’14). IEEE, 514--521. Google ScholarDigital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. 2009. Pegasus: A peta-scale graph mining system implementation and observations. In Proceedings of the 9th IEEE International Conference on Data Mining (ICDM’09). IEEE, 229--238. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 135--146. Google ScholarDigital Library
- Apache giraph. 2019. Retrieved from http://giraph.apache.org.Google Scholar
- Giraph. 2019. Retrieved from jira. https://issues.apache.org/jira/browse/GIRAPH.Google Scholar
- Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One trillion edges: Graph processing at facebook-scale. Proc. VLDB Endow. 8, 12 (2015). Google ScholarDigital Library
- R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. 2013. Graphx: A resilient distributed graph system on spark. In Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. ACM, 2. Google ScholarDigital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. 2014. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Vol. 14, 599--613. Google ScholarDigital Library
- R. S. Xin, D. Crankshaw, A. Dave, J. E. Gonzalez, M. J. Franklin, and I. Stoica. 2014. Graphx: Unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394.Google Scholar
- S. Mishra, Y. C. Lee, and A. Nayak. 2016. Distributed genetic algorithm on GraphX. In Proceedings of the Australasian Joint Conference on Artificial Intelligence. Springer, 548--554.Google Scholar
- E. Y. Chang, H. Bai, and K. Zhu. 2009. Parallel algorithms for mining large-scale rich-media data. In Proceedings of the 17th ACM International Conference on Multimedia. ACM, 917--918. Google ScholarDigital Library
- L. Zhou, Z. Zhong, J. Chang, J. Li, J. Z. Huang, and S. Feng. 2010. Balanced parallel fp-growth with MapReduce. In Proceedings of the IEEE Youth Conference on Information Computing and Telecommunications (YC-ICT’10). IEEE, 243--246.Google Scholar
- W. Zhang, H. Liao, and N. Zhao. 2008. Research on the FP growth algorithm about association rule mining. In Proceedings of the International Seminar on Business and Information Management (ISBIM’08). IEEE (Vol. 1, pp. 315--318). Google ScholarDigital Library
- I. Pramudiono and M. Kitsuregawa. 2003. Parallel FP-growth on PC cluster. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, Berlin, 467--473. Google ScholarDigital Library
- R. Mishra and A. Choubey. 2012. Discovery of frequent patterns from web log data by using FP-growth algorithm for web usage mining. Int. J. Adv. Res. Comput. Sci. Software Eng. 2, 9 (2012).Google Scholar
- B. S. Kumar and K. V. Rukmani. 2010. Implementation of web usage mining using Apriori and FP growth algorithms. Int. J. Adv. Netw. Appl. 1, 06 (2010), 400--404.Google Scholar
- Y. Qiu, Y. J. Lan, and Q. S. Xie. 2004. An improved algorithm of mining from FP-tree. In Proceedings of the International Conference on Machine Learning and Cybernetics. IEEE, Vol. 3, 1665--1670.Google Scholar
- J. Han, J. Pei, and Y. Yin. 2000. Mining frequent patterns without candidate generation. In ACM SIGMOD Record. ACM, Vol. 29, No. 2, 1--12. Google ScholarDigital Library
- M. N. Vora. 2011. Hadoop-HBase for large-scale data. In Proceedings of the International Conference on Computer Science and Network Technology (ICCSNT’11). IEEE, (Vol. 1, pp. 601--605).Google Scholar
- D. Carstoiu, E. Lepadatu, and M. Gaspar. 2010. Hbase-non SQL database, performances evaluation. International Journal of Advancements in Computing Technology 2, 5 (Dec. 2010).Google Scholar
- S. Nishimura, S. Das, D. Agrawal, and A. El Abbadi. 2011. MD-HBase: A scalable multi-dimensional data infrastructure for location aware services. In Proceedings of the 12th IEEE International Conference on Mobile Data Management (MDM’11). IEEE, Vol. 1, 7--16. Google ScholarDigital Library
- T. Harter, D. Borthakur, S. Dong, A. S. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. 2014. Analysis of HDFS under HBase: A Facebook messages case study. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’14), Vol. 14, 12. Google ScholarDigital Library
- W. Zhao, H. Ma, and Q. He. 2009. Parallel k-means clustering based on MapReduce. In Proceedings of the IEEE International Conference on Cloud Computing. Springer, Berlin, 674--679. Google ScholarDigital Library
- R. M. Esteves, R. Pais, and C. Rong. 2011. K-means clustering in the cloud—A Mahout test. In Proceedings of the IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA’11). IEEE, 514--519. Google ScholarDigital Library
- X. Cui, P. Zhu, X. Yang, K. Li, and C. Ji. 2014. Optimized big data K-means clustering using MapReduce. J. Supercomput. 70, 3 (2014), 1249--1259. Google ScholarDigital Library
- S. Liu and Y. Cheng. 2012. Research on k-means algorithm based on cloud computing. In Proceedings of the International Conference on Computer Science 8 Service System (CSSS’12). IEEE, 1762--1765. Google ScholarDigital Library
- T. Sajana, C. S. Rani, and K. V. Narayana. 2016. A survey on clustering techniques for big data mining. Indian J. Sci. Technol. 9, 3 (2016).Google ScholarCross Ref
- M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic. 2015. Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015), 1.Google ScholarCross Ref
- D. Agrawal, S. Das, and A. El Abbadi. 2011. Big data and cloud computing: Current state and future opportunities. In Proceedings of the 14th International Conference on Extending Database Technology. ACM, 530--533. Google ScholarDigital Library
- X. Wu, X. Zhu, G. Q. Wu, and W. Ding. 2014. Data mining with big data. IEEE Trans. Knowl. Data Eng. 26, 1 (2014), 97--107. Google ScholarDigital Library
- Y. Simmhan, S. Aman, A. Kumbhare, R. Liu, S. Stevens, Q. Zhou, and V. Prasanna. 2013. Cloud-based software platform for big data analytics in smart grids. Comput. Sci. Eng. 15, 4 (2013), 38--47. Google ScholarDigital Library
- L. Wei, H. Zhu, Z. Cao, X. Dong, W. Jia, Y. Chen, and A. V. Vasilakos. 2014. Security and privacy for storage and computation in cloud computing. Info. Sci. 258 (2014), 371--386. Google ScholarDigital Library
- B. McCarty. 2004. SELinux: NSA’s open source security enhanced Linux. O’Reilly Media. Google ScholarDigital Library
- J. Da Silva, C. Giannella, R. Bhargava, H. Kargupta, and M. Klusch. 2005. Distributed data mining and agents. Int. J. Eng. App. Artific. Intell. 18, 4 (2005), 791--807. Elsevier Science. Google ScholarDigital Library
- H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. 2001. Distributed clustering using collective principal component analysis. Knowl. Info. Syst. J. 3, 4 (2001), 422--448. Google ScholarDigital Library
- L. Ismail and L. Khan. 2014. Implementation and Performance Evaluation of a Scheduling Algorithmfor Divisible Load Parallel Applications in a Cloud Computing Environment. Software: Practice and Experience. Wiley.Google Scholar
- M. Shee, S. Bhavsar, and M. Parashar. 1999. Characterizing the performance of dynamic distribution and load-balancing techniques for adaptive grid hierarchies. In Proceedings of the IASTED International Conference of Parallel and Distributed Computing and Systems, Vol. 4.Google Scholar
- Apache Mahout. 2019. Retrieved from http://mahout.apache.org.Google Scholar
- S. Schelter and S. Owen. 2012. Collaborative filtering with apache mahout. In Proceedings of the ACM RecSys Challenge.Google Scholar
- R. Nair. 2015. Big data needs approximate computing: Technical perspective. Commun. ACM 58, 1 (2015), 104--104. Google ScholarDigital Library
- S. Mitra, S. K. Pal, and P. Mitra. 2002. Data mining in soft computing framework: A survey. IEEE Trans. Neural Netw. 13, 1 (2002), 3--14. Google ScholarDigital Library
- Foto N. Afrati. 2006. On approximation algorithms for data mining applications. In Efficient Approximation and Online Algorithms. Springer, 1--29. Google ScholarDigital Library
- InfoQ. 2019. Approximate Methods for Scalable Data Mining. Retrieved from https://www.infoq.com/presentations/scalability-data-mining.Google Scholar
- G. Kollios, D. Gunupulos, N. Koudas, and S. Berchtold. 2001. An efficient approximation scheme for data mining tasks. In Proceedings of the 17th International Conference on Data Engineering. IEEE, 453--462. Google ScholarDigital Library
- P. Gupta, S. Agnihotri, and S. Saha. 2013. Approximate data mining using sketches for massive data. Procedia Technol. 10 (2013), 781--787.Google ScholarCross Ref
- B. Welton, E. Samanas, and B. P. Miller. 2013. Mr. scan: Extreme scale density-based clustering using a tree-based network of GPGPU nodes. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 84. Google ScholarDigital Library
- J. Han and M. Kamber. 2004. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.Google Scholar
- L. Qian, Z. Luo, Y. Du, and L. Guo. 2009. Cloud computing: An overview. In IEEE International Conference on Cloud Computing. Springer, 626--631. Google ScholarDigital Library
- R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. 2009. Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Gen. Comput. Syst. 25, 6 (2009), 599--616. Google ScholarDigital Library
- T. B. Winans and J. S. Brown. 2009. Cloud computing: A collection of working papers. Deloitte LLC.Google Scholar
- S. Mittal. 2016. A survey of techniques for approximate computing. ACM Comput. Surveys 48, 4 (2016), 62. Google ScholarDigital Library
- J. Gruska. 1999. Quantum Computing, Vol. 2005. McGraw-Hill, London.Google Scholar
- P. Wittek. 2014. Quantum Machine Learning: What Quantum Computing Means to Data Mining. Academic Press.Google Scholar
- M. Ykhlef. 2011. A quantum swarm evolutionary algorithm for mining association rules in large databases. J. King Saud Univ.-Comput. Info. Sci. 23, 1 (2011), 1--6. Google ScholarDigital Library
- S. Wang and G. Long. 2015. Big data and quantum computation. Chinese Sci. Bull. 60, 5--6 (2015), 499--508.Google Scholar
- P. Rebentrost, M. Mohseni, and S. Lloyd. 2014. Quantum support vector machine for big data classification. Phys. Rev. Lett. 113, 13 (2014), 130503.Google ScholarCross Ref
- H. K. Lo, T. Spiller, and S. Popescu. 1998. Introduction to Quantum Computation and Information. World Scientific, Singapore.Google Scholar
- C. H. Yu, F. Gao, Q. L. Wang, and Q. Y. Wen. 2016. Quantum algorithm for association rules mining. Phys. Rev. A 94, 4 (2016), 042311.Google ScholarCross Ref
- D. A. Reed and J. Dongarra. 2015. Exascale computing and big data. Commun. ACM 58, 7 (2015), 56--68. Google ScholarDigital Library
- M. Weinstein. 2010. Strange bedfellows: Quantum mechanics and data mining. Nuclear Phys. B-Proc. Suppl. 199, 1 (2010), 74--84.Google ScholarCross Ref
- Nature. 2019. IBM's Quantum Cloud Computer Goes Commercial. Retrieved from http://www.nature.com/news/ibm-s-quantum-cloud-computer-goes-commercial-1.21585.Google Scholar
- Livemint. 2019. Google's Quantum Computing Push Opens New Front in Cloud Battle. Retrieved from http://www.livemint.com/Technology/FtFrwgaQFFa07m0BenyGIK/Googles-quantum-computing-push-opens-new-front-in-cloud-bat.html.Google Scholar
- Engadget. 2019. Google Wants to Sell Quantum Computing in the Cloud. Retrieved from https://www.engadget.com/2017/07/17/google-puts-quantum-computers-to-work-in-cloud/.Google Scholar
- Theregister. 2019. Google Tests its Own Quantum Computer -- Both Qubits of it. Retrieved from https://www.theregister.co.uk/2016/07/21/google_tests_a_quantum_computer_its_own_both_qubits_of_it/.Google Scholar
- Quantum computing -- Wikipedia. 2019. Retrieved from https://en.wikipedia.org/wiki/Quantum_computing.Google Scholar
- E. Rieffel and W. Polak. 2000. An introduction to quantum computing for non-physicists. ACM Comput. Surveys 32, 3 (2000), 300--335. Google ScholarDigital Library
- V. S. Denchev and G. Pandurangan. 2008. Distributed quantum computing: A new frontier in distributed systems or science fiction? ACM SIGACT News 39, 3 (2008), 77--95. Google ScholarDigital Library
- I. A. T. Hashem, I. Yaqoob, N. B. Anuar, S. Mokhtar, A. Gani, and S. U. Khan. 2015. The rise of big data on cloud computing: Review and open research issues. Info. Syst. 47 (2015), 98--115. Google ScholarDigital Library
- T. Mastelic, A. Oleksiak, H. Claussen, I. Brandic, J. M. Pierson, and A. V. Vasilakos. 2015. Cloud computing: Survey on energy efficiency. ACM Comput. Surveys 47, 2 (2015), 33. Google ScholarDigital Library
- D. Chakrabarti and C. Faloutsos. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surveys 38, 1 (2006), 2. Google ScholarDigital Library
- S. Venugopal, R. Buyya, and K. Ramamohanarao. 2006. A taxonomy of data grids for distributed data sharing, management, and processing. ACM Comput. Surveys 38, 1 (2006), 3. Google ScholarDigital Library
- I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. 2015. Approxhadoop: Bringing approximations to MapReduce frameworks. In ACM SIGARCH Computer Architecture News. ACM, Vol. 43, No. 1, 383--397. Google ScholarDigital Library
- O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. 2014. The rise of RaaS: The resource-as-a-service cloud. Commun. ACM 57, 7 (2014), 76--84. Google ScholarDigital Library
- F. Pan, G. Cong, A. K. Tung, J. Yang, and M. J. Zaki. 2003. Carpenter: Finding closed patterns in long biological datasets. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 637--642. Google ScholarDigital Library
- K. A. Shakil and M. Alam. 2016. Recent developments in cloud-based systems: State of art. Int. J. Comput. Sci. Info. Secur. 14, 12 (2016), 242.Google Scholar
- V. Nekvapil. 2015. Cloud computing in data mining-A survey. J. Syst. Integr. 6, 1 (2015), 12.Google ScholarCross Ref
- M. Marjani, F. Nasaruddin, A. Gani, A. Karim, I. A. T. Hashem, A. Siddiqa, and I. Yaqoob. 2017. Big IoT data analytics: Architecture, opportunities, and open research challenges. IEEE Access 5 (2017), 5247--5261.Google ScholarCross Ref
- T. Hu, H. Chen, L. Huang, and X. Zhu. 2012. A survey of mass data mining based on cloud-computing. In Proceedings of the International Conference on Anti-Counterfeiting, Security and Identification (ASID’12). IEEE, 1--4.Google Scholar
- C. W. Tsai, C. F. Lai, H. C. Chao, and A. V. Vasilakos. 2015. Big data analytics: A survey. J. Big Data 2, 1 (2015), 21.Google ScholarCross Ref
- A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan. 2014. Big data clustering: A review. In Proceedings of the International Conference on Computational Science and Its Applications. Springer, Cham, 707--720.Google Scholar
- B. Zerhari, A. A. Lahcen, and S. Mouline. 2015. Big data clustering: Algorithms and challenges. In Proceedings of the International Conference on Big Data, Cloud and Applications (BDCA’15).Google Scholar
- A. Mohebi, S. Aghabozorgi, T. Ying Wah, T. Herawan, and R. Yahyapour. 2016. Iterative big data clustering algorithms: A review. Software: Pract. Exper. 46, 1 (2016), 107--129. Google ScholarDigital Library
- A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A. Y. Zomaya, S. Foufou, and A. Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Trans. Emerg. Topics Comput. 2, 3 (2014), 267--279.Google ScholarCross Ref
- D. Singh and C. K. Reddy. 2015. A survey on platforms for big data analytics. J. Big Data 2, 1 (2015), 8.Google ScholarCross Ref
- H. Tong and U. Kang. 2013. Big Data Clustering. Data Clustering: Algorithms and Applications, Chapter 11. CRC Press, Taylor 8 Francis Group, 259--276.Google Scholar
- X. Lin. 2014. Mr-Apriori: Association rules algorithm based on MapReduce. In Proceedings of the 5th IEEE International Conference on Software Engineering and Service Science (ICSESS’14). IEEE, 141--144.Google ScholarCross Ref
- Q. He, F. Zhuang, J. Li, and Z. Shi. 2010. Parallel implementation of classification algorithms based on MapReduce. In Proceedings of the International Conference on Rough Sets and Knowledge Technology. Springer, Berlin, 655--662. Google ScholarDigital Library
- IBM. 2019. Bluemix is now IBM Cloud. Retrieved from https://www.ibm.com/blogs/bluemix/2017/10/bluemix-is-now-ibm-cloud/.Google Scholar
- A. Gheith et al. 2016, IBM Bluemix mobile cloud services. IBM J. Res. Dev. 60, 2-3 (Mar. 2016), 7:1--7:12. Google ScholarDigital Library
- Google Cloud. 2019. Cloud Machine Learning Engine. Retrieved from https://cloud.google.com/ml-engine/.Google Scholar
- GE. 2019. Predix Platform Brief-GE. Retrieved from https://www.ge.com/digital/sites/default/files/Predix-The-Industrial-Internet-Platform-Brief.pdf.Google Scholar
- TCS. 2019. TCS Connected Universe Platform. Retrieved from https://www.tcs.com/tcs-connected-universe-platform.Google Scholar
- IBM Watson | IBM. 2019. Retrieved from https://www.ibm.com/watson/.Google Scholar
- Machine Learning Studio | Microsoft Azure. 2019. Retrieved from https://azure.microsoft.com/en-in/services/machine-learning-studio/.Google Scholar
- D. R. Krishnan, D. L. Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues. 2016. Incapprox: A data analytics system for incremental approximate computing. In Proceedings of the 25th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1133--1144. Google ScholarDigital Library
- Spark Streaming | Apache Spark. 2019. Retrieved from https://spark.apache.org/streaming/.Google Scholar
- A. Bifet, S. Maniu, J. Qian, G. Tian, C. He, and W. Fan. 2015. StreamDM: Advanced data mining in Spark streaming. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’15). IEEE, 1608--1611. Google ScholarDigital Library
- Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. 2018. Deep learning for IoT big data and streaming analytics: A survey. IEEE Commun. Surveys Tutor. 20, 4 (2018), 2923--2960.Google ScholarDigital Library
- A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. 2010. Moa: Massive online analysis. J. Mach. Learn. Res. 11 (May 2010), 1601--1604. Google ScholarDigital Library
- B. R. Prasad and S. Agarwal. 2016. Stream data mining: Platforms, algorithms, performance evaluators, and research trends. Int. J. Database Theory Appl. 9, 9 (2016), 201--218.Google ScholarCross Ref
- G. D. F. Morales and A. Bifet. 2015. SAMOA: Scalable advanced massive online analysis. J. Mach. Learn. Res. 16, 1 (2015), 149--153. Google ScholarDigital Library
- A. Amini, T. Y. Wah, and H. Saboohi. 2014. On density-based data streams clustering algorithms: A survey. J. Comput. Sci. Technol. 29, 1 (2014), 116--141.Google ScholarCross Ref
- H. Song and J. G. Lee. 2018. RP-DBSCAN: A superfast parallel DBSCAN algorithm based on random partitioning. In Proceedings of the International Conference on Management of Data. ACM, 1173--1187. Google ScholarDigital Library
- O. Backhoff and E. Ntoutsi. 2016. Scalable online-offline stream clustering in apache spark. In Proceedings of the IEEE 16th International Conference on Data Mining Workshops (ICDMW’16). IEEE, 37--44. Google ScholarDigital Library
- J. Zgraja and M. Woniak. 2018. Drifted data stream clustering based on ClusTree algorithm. In Proceedings of the International Conference on Hybrid Artificial Intelligence Systems. Springer, Cham, 338--349.Google Scholar
- C. Sauvanaud, G. Silvestre, M. Kaniche, and K. Kanoun. 2015. Data stream clustering for online anomaly detection in cloud applications. In Proceedings of the 11th European Dependable Computing Conference (EDCC’15). IEEE, 120--131. Google ScholarDigital Library
- L. Tu and Y. Chen. 2009. Stream data clustering based on grid density and attraction. ACM Trans. Knowl. Discov. Data 3, 3 (2009), 12. Google ScholarDigital Library
- R. Latif, H. Abbas, S. Latif, and A. Masood. 2015. EVFDT: An enhanced very fast decision tree algorithm for detecting distributed denial of service attack in cloud-assisted wireless body area network. Mobile Info. Syst. 2015, Article 260594 (2015), 13 pages.Google Scholar
- T. M. Al-Khateeb, M. M. Masud, L. Khan, and B. Thuraisingham. 2012. Cloud guided stream classification using class-based ensemble. In Proceedings of the IEEE 5th International Conference on Cloud Computing (CLOUD’12). IEEE, 694--701. Google ScholarDigital Library
- J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, and K. Li. 2017. A parallel random forest algorithm for big data in a spark cloud computing environment. IEEE Trans. Parallel Distrib. Syst. 1 (2017), 1--1. Google ScholarDigital Library
Index Terms
- A Comprehensive Survey on Cloud Data Mining (CDM) Frameworks and Algorithms
Recommendations
Item-centric mining of frequent patterns from big uncertain data
AbstractHigh volumes of wide varieties of valuable data of different veracity (e.g., imprecise and uncertain data) can be easily generated or collected at a high velocity for various knowledge-based and intelligent information & engineering systems in ...
New Spark solutions for distributed frequent itemset and association rule mining algorithms
AbstractThe large amount of data generated every day makes necessary the re-implementation of new methods capable of handle with massive data efficiently. This is the case of Association Rules, an unsupervised data mining tool capable of extracting ...
Study of big data mining based on cloud computing
The discovery of meaningful knowledge with high level of applicability in decision making relies on the maxim of efficient information management and analysis mechanism. Since organisations are functioning in a global scenario with exponentially high data ...
Comments