ABSTRACT
In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.
- Aggarwal, C., Al-Garawi, F., & Yu, P. (2001). Intelligent crawling on the World Wide Web with arbitrary predicates. Proc. of 10th Intl. Conf. on WWW. Google ScholarDigital Library
- Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. Proc. of 11th Intl. Conf. on World Wide Web, 148--159. Google ScholarDigital Library
- Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. WWW Conference. Google ScholarDigital Library
- Chu, W., & Keerthi, S. (2005). New approaches to support vector ordinal regression. Proc. of 22nd Intl. Conf. on Machine learning, 145--152. Google ScholarDigital Library
- Crammer, K., & Singer, Y. (2002). Pranking with ranking. NIPS, 14.Google Scholar
- Davison, B. (2000). Topical locality in the Web. Proc. of 23rd Intl. Conf. on Research and development in Information Retrieval, 272--279. Google ScholarDigital Library
- Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000). Focused crawling using context graphs. Proc. of 26th Intl. Conf. on VLDB. Google ScholarDigital Library
- Erdougan, E., & Iyengar, G. (2006). An active set method for single-cone second-order cone programs. SIAM J. on Optimization, 17, 459--484. Google ScholarDigital Library
- Grangier, D., & Bengio, S. (2005). Exploiting Hyperlinks to Learn a Retrieval Model. Proc. of NIPS Workshop.Google Scholar
- Har-Peled, S., Roth, D., & Zimak, D. Constraint classification: A new approach to multiclass classification and ranking. NIPS.Google Scholar
- Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115--132.Google Scholar
- Kleinberg, J. (1999). Authoritative sources in a hyper-linked environment. Journal of the ACM (JACM), 46, 604--632. Google ScholarDigital Library
- Nath, J. S., Bhattacharyya, C., & Murty, M. N. (2006). Clustering based large margin classification: a scalable approach using socp formulation. Proc. of 12th Intl. Conf. on KDD (pp. 674--679). Google ScholarDigital Library
- Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods---Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarDigital Library
- Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two approaches. NIPS, 15.Google Scholar
- Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. Proc. of Intl. Conf. on Management of data, 103--114. Google ScholarDigital Library
Recommendations
Exploiting Interclass Rules for Focused Crawling
A focused crawler is an agent that concentrates on a particular target topic and tries to visit and gather only relevant pages from the Web. A crucial issue for a focused crawler is the underlying heuristic for deciding the page to visit next. The ...
Efficient Topical Focused Crawling Through Neighborhood Feature
AbstractA focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Sentiment-Focused Web Crawling
Sentiments and opinions expressed in Web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. In the last decade, the analysis of such content has gained importance due to its high ...
Comments