Abstract
This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to networked data used in prior machine learning research. NetKit is based on a node-centric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing node-centric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of class-linkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple network-classification models perform quite well---well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes---that is, Gaussian-field classifiers, Hopfield networks, and relational-neighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.
- J. C. Almack. The influence of intelligence on the selection of associates. School and Society, 16: 529-530, 1922.Google Scholar
- A. Bernstein, S. Clearwater, and F. Provost. The relational vector-space model and industry classification. In Proceedings of the Learning Statistical Models from Relational Data Workshop at the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), 2003.Google Scholar
- J. Besag. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society, 36(2):192-236, 1974.Google Scholar
- J. Besag. Statistical analysis of non-lattice data. The Statistician, 24(3):179-195, 1975.Google ScholarCross Ref
- J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3): 259-302, 1986.Google Scholar
- P. M. Blau. Inequality and Heterogeneity: A Primitive Theory of Social Structure. New York: Free Press, 1977.Google Scholar
- A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 19-26, 2001. Google ScholarDigital Library
- A. Blum, J. Lafferty, R. Reddy, and M. R. Rwebangira. Semi-supervised learning using randomized mincuts. In Proceedings of the Twentyfirst International Conference on Machine Learning (ICML), pages 97-104, 2004. Google ScholarDigital Library
- H. Bott. Observation of play activities in a nursery school. Genetic Psychology Monographs, 4: 44-88, 1928.Google Scholar
- Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 23(11):1222-1239, 2001. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 307-319, 1998. Google ScholarDigital Library
- C. Cortes, D. Pregibon, and C. T. Volinsky. Communities of interest. In Proceedings of the Fourth International Conference on Advances in Intelligent Data Analysis (IDA), pages 105-114, 2001. Google ScholarDigital Library
- R. G. Cowell, A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. Probabilistic networks and expert systems. Springer, 1999. Google ScholarDigital Library
- M. Craven, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and C. Y. Quek. Learning to extract symbolic knowledge from the World Wide Web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 509-516, 1998. Google ScholarDigital Library
- L. De Raedt, H. Blockeel, L. Dehaspe, and W. Van Laer. Three companions for data mining in first order logic. In S. Dzeroski and N. Lavrac, editors, Relational Data Mining, pages 105-139. Berlin; New York: Springer, 2001. Google ScholarDigital Library
- I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 269-274, 2001. Google ScholarDigital Library
- R. L. Dobrushin. The description of a random field by means of conditional probabilities and conditions of its regularity. Theory of Probability and its Applications, 13(2):197-224, 1968.Google ScholarCross Ref
- P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 57-66, 2001. Google ScholarDigital Library
- S. Dzeroski and N. Lavrac. Relational Data Mining. Berlin; New York: Springer, 2001. Google ScholarDigital Library
- T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and Knowledge Discovery, 3: 291-316, 1997. Google ScholarDigital Library
- T. Fawcett and F. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 53-62, 1999. Google ScholarDigital Library
- P. A. Flach and N. Lachiche. Naive Bayesian classification of structured data. Machine Learning, 57:233-269, 2004. Google ScholarDigital Library
- N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), pages 1300-1309, 1999. Google ScholarDigital Library
- A. Galstyan and P. Cohen. Is guilt by association a bad thing? In Proceedings of the First International Conference on Intelligence Analysis (IA), 2005.Google Scholar
- S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 6:721-741, 1984.Google ScholarDigital Library
- W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC, 1995.Google ScholarCross Ref
- D. Greig, B. Porteous, and A. Seheult. Exact maximum a posteriori estimation for binary images. Journal of the Royal Statistical Society, 51(2):271-279, 1989.Google Scholar
- D. D. Heckathorn and J. Jeffri. Jazz networks: Using respondent-driven sampling to study stratification in two jazz musician communities. Unpublished paper presented at American Sociological Association Annual Meeting, August 2003.Google Scholar
- D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie. Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research (JMLR), 1:49-75, 2000. Google ScholarDigital Library
- S. Hill, F. Provost, and C. Volinsky. Network-based marketing: Identifying likely adopters via consumer networks. Statistical Science, 22(2):256-276, 2006a.Google ScholarCross Ref
- S. Hill, D. K. Agarwal, R. Bell, and C. Volinsky. Building an effective representation for dynamic networks. Journal of Computational & Graphical Statistics, 15(3):584-608, 2006b.Google ScholarCross Ref
- G. E. Hinton and T. J. Sejnowski. Learning and relearning in Boltzmann machines. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1: Foundations:282-317, 1986. Google ScholarDigital Library
- J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79(8): 2554-2558, 1982.Google Scholar
- Z. Huang, H. Chen, and D. Zeng. Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering. ACM Transactions on Information Systems (TOIS), 22(1): 116-142, 2004. Google ScholarDigital Library
- R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(3):267-287, 1983.Google ScholarDigital Library
- E. Ising. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift f. Physik, 31:253-258, 1925. {German}.Google ScholarCross Ref
- D. Jensen and J. Neville. Data mining in social networks. In National Academy of Sciences Symposium on Dynamic Social Network Modeling and Analysis, 2002a.Google Scholar
- D. Jensen and J. Neville. Linkage and autocorrelation cause feature selection bias in relational learning. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 259-266, 2002b. Google ScholarDigital Library
- D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 593-598, 2004. Google ScholarDigital Library
- T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 290-297, 2003.Google Scholar
- J. Kleinberg and É. Tardos. Approximation algorithms for classification problems with pairwise relations: Metric labeling and Markov random fields. In Proceedings of the Fortieth Annual Symposium on Foundations of Computer Science (FOCS), pages 14-23, 1999. Google ScholarDigital Library
- R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2): 273-324, 1997. Google ScholarDigital Library
- D. Koller and A. Pfeffer. Probabilistic frame-based systems. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI), pages 580-587, 1998. Google ScholarDigital Library
- S. Kramer, N. Lavrac, and P. Flach. Propositionalization approaches to relational data mining. In S. Dzeroski and N. Lavrac, editors, Relational Data Mining, pages 262-291. Berlin; New York: Springer, 2001. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML), pages 282-289, 2001. Google ScholarDigital Library
- P. Langley. Crafting papers on machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 1207-1212, 2000. Google ScholarDigital Library
- P. Lazarsfeld and R. K. Merton. Friendship as a social process: A substantive and methodological analysis. In M. Berger, T. Abel, and C. H. Page, editors, Freedom and Control in Modern Society, pages 18-66. Van Nostrand, 1954.Google Scholar
- C. P. Loomis. Political and occupational cleavages in a Hanoverian village. Sociometry, 9:316- 3333, 1946.Google ScholarCross Ref
- Q. Lu and L. Getoor. Link-based classification. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 496-503, 2003.Google Scholar
- S. A. Macskassy and F. Provost. A simple relational classifier. In Proceedings of the Multi-Relational Data Mining Workshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.Google ScholarCross Ref
- S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-association, collective inference, and focused data access. In Proceedings of the First International Conference on Intelligence Analysis (IA), 2005.Google Scholar
- A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3(2):127-163, 2000. Google ScholarDigital Library
- M. McPherson, L. Smith-Lovin, and J. M. Cook. Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27:415-444, 2001.Google Scholar
- K. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief-propagation for approximate inference: An empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 467-475, 1999. Google ScholarDigital Library
- J. Neville and D. Jensen. Iterative classification in relational data. In Proceedings of the Workshop on Learning Statistical Models from Relational Data at the Seventeenth National Conference on Artificial Intelligence (AAAI), pages 13-20, 2000.Google Scholar
- J. Neville and D. Jensen. Collective classification with relational dependency networks. In Proceedings of the Multi-Relational Data Mining Workshop (MRDM) at the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.Google Scholar
- J. Neville and D. Jensen. Dependency networks for relational data. In Proceedings of the Fourth IEEE International Conference in Data Mining (ICDM), pages 170-177, 2004. Google ScholarDigital Library
- J. Neville and D. Jensen. Leveraging relational autocorrelation with latent group models. In Proceedings of the Fifth IEEE International Conference in Data Mining (ICDM), pages 322-329, 2005. Google ScholarDigital Library
- J. Neville and D. Jensen. Relational dependency networks. Journal of Machine Learning Research (JMLR), 8(Mar):653-692, 2007. Google ScholarDigital Library
- J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning relational probability trees. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 625-630, 2003. Google ScholarDigital Library
- J. Neville, Ö. Simsek, D. Jensen, J. Komoroske, K. Palmer, and H. Goldberg. Using relational knowledge discovery to prevent securities fraud. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 449-458, 2005. Google ScholarDigital Library
- M. E. J. Newman. Mixing patterns in networks. Physical Review E, 67, 2003. 026126.Google Scholar
- J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. Google ScholarDigital Library
- C. Perlich. Citation-based document classification. In Workshop on Information Technology and Systems (WITS), 2003.Google Scholar
- C. Perlich and F. Provost. Aggregation-based feature invention and relational concept classes. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 167-176, 2003. Google ScholarDigital Library
- C. Perlich and F. Provost. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning, 62(1/2):65-105, 2006. Google ScholarDigital Library
- C. Perlich, F. Provost, and J. Simonoff. Tree induction vs. logistic regression: A learning-curve analysis. Journal of Machine Learning Research (JMLR), 4:211-255, 2003. Google ScholarDigital Library
- A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In Proceedings of the Learning Statistical Models from Relational Data Workshop at the Nineteenth International Joint Conference on Artificial Intelligence (IJCAI), 2003.Google Scholar
- R. B. Potts. Some generalized order-disorder transformations. Cambridge Philosophic Society, 48: 106-109, 1952.Google ScholarCross Ref
- H. M. Richardson. Community of values as a factor in friendships of college and adult women. Journal of Social Psychology, 11:303-312, 1940.Google ScholarCross Ref
- M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1/2):107-136, 2006. Google ScholarDigital Library
- J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, chapter 14, pages 313-323. Prentice-Hall, 1971.Google Scholar
- A. Rosenfeld, R. Hummel, and S. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6:420-433, 1976.Google ScholarCross Ref
- L. J. Savage. The Foundations of Statistics. John Wiley and Sons, 1954.Google Scholar
- E. Segal, H. Wang, and D. Koller. Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19:I264-I272, Jul 2003a.Google ScholarCross Ref
- E. Segal, R. Yelensky, and D. Koller. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19:I273-I282, Jul 2003b.Google ScholarCross Ref
- B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI), pages 485-492, 2002. Google ScholarDigital Library
- B. Taskar, E. Segal, and D. Koller. Probabilistic classification and clustering in relational data. In Proceedings of the Seventeenthth International Joint Conference on Artificial Intelligence (IJCAI) , pages 870-878, 2001. Google ScholarDigital Library
- B. Taskar, V. Chatalbashev, and D. Koller. Learning associative Markov networks. In Proceedings of the Twentyfirst International Conference on Machine Learning (ICML), 2004. Google ScholarDigital Library
- K. Tumulty. Inside Bush's Secret Spy Net {electronic version}. Time, 167(21), 2006. Retrieved May 25, 2006, from http://www.time.com/time/archive/preview/0,10987,1194021, 00.html.Google Scholar
- V. N. Vapnik. The support vector method of function estimation. In J. Suykens and J. Vandewalle, editors, Nonlinear Modeling: Advanced Black-Box Techniques, pages 55-86. Kluwer, Boston, 1998a.Google Scholar
- V. N. Vapnik. Statistical Learning Theory. John Wiley, NY, 1998b.Google Scholar
- M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, University of California, Berkeley, 2003.Google Scholar
- G. Winkler. Image Analysis, Random Fields and Markov Chain Monte Carlo Methods. Springer-Verlag, 2nd edition, 2003. Google ScholarDigital Library
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools with Java Implementations . Morgan Kaufmann, San Francisco, 2000. Google ScholarDigital Library
- X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 912-919, 2003.Google Scholar
Index Terms
- Classification in Networked Data: A Toolkit and a Univariate Case Study
Recommendations
Classification in Networked Data: A Toolkit and a Univariate Case Study
This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a case-study of its application to ...
Unsupervised Ensemble Classification With Sequential and Networked Data
Ensemble learning, the machine learning paradigm where multiple models are combined, has exhibited promising perfomance in a variety of tasks. The present work focuses on unsupervised ensemble classification. The term unsupervised refers to the ensemble ...
Modified nearest neighbour classifier for hyperspectral data classification
A modified k-nearest neighbour k-NN classifier is proposed for supervised remote sensing classification of hyperspectral data. To compare its performance in terms of classification accuracy and computational cost, k-NN and a back-propagation neural ...
Comments