ABSTRACT
The purpose of this study is to demonstrate the benefit of using common data mining techniques on survey data where statistical analysis is routinely applied. The statistical survey is commonly used to collect quantitative information about an item in a population. Statistical analysis is usually carried out on survey data to test hypothesis. We report in this paper an application of data mining methodologies to breast feeding survey data which have been conducted and analysed by statisticians. The purpose of the research is to study the factors leading to deciding whether or not to breast feed a new born baby. Various data mining methods are applied to the data. Feature or variable selection is conducted to select the most discriminative and least redundant features using an information theory based method and a statistical approach. Decision tree and regression approaches are tested on classification tasks using features selected. Risk pattern mining method is also applied to identify groups with high risk of not breast feeding. The success of data mining in this study suggests that using data mining approaches will be applicable to other similar survey data. The data mining methods, which enable a search for hypotheses, may be used as a complementary survey data analysis tool to traditional statistical analysis.
- Chen, J., He, H., Li, J., Jin, H., McAullay, D., Williams, G., Sparks, R. & Kelman, C. (2005), Representing association classification rules mined from health data, in Proceedings of 9th International Conference on Knowledge-Based & Intelligent Information & Engineering Systems (KES2005), Melbourne, Australia, pp. 1225--1231. Google ScholarDigital Library
- Cover, T. M. & Thomas., J. A. (1991), Elements of Information Theory, Wiley-Interscience. Google ScholarDigital Library
- Fleiss, J. L. (1981), Statistical Methods for Rates and Proportions, Wiley.Google Scholar
- Fleuret, F. (2004), 'Fast binary feature selection with conditional mutual information', Journal of Machine Learning Research 5, 1531--1555. Google ScholarDigital Library
- Gu, L., Li, J., He, H., Williams, G., Hawkins, S. & Kelman, C. (2003), Association rule discovery with unbalanced class, in Proceedings of the 16th Australian Joint Conference on Artificial Intelligence (AI03), Lecture Notes in Artificial Intelligence, Perth, Western Australia, pp. 221--232.Google Scholar
- He, H., Jin, H. & Chen, J. (2005), Automatic feature selection for classification of health data, in Proceedings of The 18th Australian Joint Conference on Artificial Intelligence (AI2005), Sydney, Australia, pp. 910--913. Google ScholarDigital Library
- Hegney, D., Fallon, T., O'Brien, M., Plank, A., Doolan, J., Brodribb, W., Hennessy, J., Laurent, K. & Baker, S. (2003), The Toowoomba Infant Feeding Support Service Project: Report on Phase 1 A Longitudinal Needs Analysis of Breastfeeding Behaviours and Supports in the Toowoomba Region.Google Scholar
- Jin, H., Chen, J., Kelman, C., He, H., McAullay, D. & O'Keefe, C. M. (2006), Mining unexpected associations for signalling potential adverse drug reactions from administrative health databases, in PAKDD'06, pp. 867--876. Google ScholarDigital Library
- Jin, H.-D., Shum, W., Leung, K.-S. & Wong, M.-L. (2004), 'Expanding self-organizing map for data visualization and cluster analysis', Information Sciences 163, 157--173. Google ScholarDigital Library
- Jin, H., Wong, M.-L. & Leung, K.-S. (2005), 'Scalable model-based clustering for large databases based on data summarization', IEEE Transactions on Pattern Analysis and Machine Intelligence 27(11), 1710--1719. Google ScholarDigital Library
- Kohavi, R. & John, G. (1997), 'Wrappers for feature selection', Artificial Intelligence pp. 273--324. Google ScholarDigital Library
- Kramer, M. S. & Kakuma, R. (2003), Optimal duration of exclusive breastfeeding, The Cochrane Library.Google Scholar
- Li, J., Fu, A. W.-C., He, H., Chen, J., Jin, H., McAullay, D., Williams, G., Sparks, R. & Kelman, C. (2005), Mining risk patterns in medical data, in Proceedings of KDD'05, pp. 770--775. Google ScholarDigital Library
- McAullay, D., Williams, G., Chen, J., Jin, H., He, H., Sparks, R. & Kelman, C. (2005), A delivery framework for health data mining and analytics, in V. Estivill-Castro, ed., Twenty-Eighth Australasian Computer Science Conference (ACSC2005), Vol. 38 of CRPIT, ACS, Newcastle, Australia, pp. 381--390. Google ScholarDigital Library
- Quinlan, J. (1993), C4.5: Programs for Machine Learning, Morgan Kaufmann. Google ScholarDigital Library
- Riodan, J. M. (1997), 'Commentary. the cost of not breastfeeding: a commentary.', Journal of Human Lactation 13(2), 93--97.Google ScholarCross Ref
- Shannon., C. E. (1948), 'A mathematical theory of communication', Bell System Technical Journal 27, 379--423, 623--656.Google ScholarCross Ref
- Smith, J. (2001), Mothers milk, money and markets, Ann Congress Perinatal Society Australia and New Zealand.Google Scholar
- Smith, J. P., Thompson, J. F. & Ellwood, D. A. (2002), 'Hospital system costs of artificial infant feeding: Estimates for the australian capital territory', Australian and New Zealand Journal of Public Health 26(6), 543--551.Google ScholarCross Ref
- Wang, G., Lochovsky, F. H. & Yang, Q. (2004), Feature selection with conditional mutual information maxmin in text categorization, in Proceedings of CIKM'04, Washington, US, pp. 8--13. Google ScholarDigital Library
- WHO (2001), The optimal duration of exclusive breastfeeding, World Health Organization.Google Scholar
- Yang, Y. & Pedersen, J. O. (1997), A comparative study on feature selection in text categorization, in Proceedings of International Conference on Machine Learning, Nashville, TN, USA. Google ScholarDigital Library
- Yu, L. & Liu, H. (2004), Redundancy based feature selection for microarray data, in Proceedings of KDD'04, ACM Press, New York, NY, USA, pp. 737--742. Google ScholarDigital Library
Index Terms
- Analysis of breast feeding data using data mining methods
Recommendations
Mining top-k frequent closed itemsets over data streams using the sliding window model
Association rule mining is an important research topic in the data mining community. There are two difficulties occurring in mining association rules. First, the user must specify a minimum support for mining. Typically it may require tuning the value ...
Mining top-k regular-frequent itemsets using database partitioning and support estimation
Temporal regularity of itemset appearance can be regarded as an important criterion for measuring the interestingness of itemsets in several applications. A frequent itemset can be said to be regular-frequent in a database if it appears at a regular ...
Image mining using association rules derived from feature matrix
ICAC3 '09: Proceedings of the International Conference on Advances in Computing, Communication and ControlAssociation rule mining is a very important research topic in the field of data mining. Discovering frequent itemsets is the key process in association rule mining. Traditional association rule algorithms adopt an iterative method to discovery, which ...
Comments