skip to main content
10.1145/1273496.1273504acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
Article

Focused crawling with scalable ordinal regression solvers

Published:20 June 2007Publication History

ABSTRACT

In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.

References

  1. Aggarwal, C., Al-Garawi, F., & Yu, P. (2001). Intelligent crawling on the World Wide Web with arbitrary predicates. Proc. of 10th Intl. Conf. on WWW. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. Proc. of 11th Intl. Conf. on World Wide Web, 148--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. WWW Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chu, W., & Keerthi, S. (2005). New approaches to support vector ordinal regression. Proc. of 22nd Intl. Conf. on Machine learning, 145--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Crammer, K., & Singer, Y. (2002). Pranking with ranking. NIPS, 14.Google ScholarGoogle Scholar
  6. Davison, B. (2000). Topical locality in the Web. Proc. of 23rd Intl. Conf. on Research and development in Information Retrieval, 272--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000). Focused crawling using context graphs. Proc. of 26th Intl. Conf. on VLDB. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Erdougan, E., & Iyengar, G. (2006). An active set method for single-cone second-order cone programs. SIAM J. on Optimization, 17, 459--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Grangier, D., & Bengio, S. (2005). Exploiting Hyperlinks to Learn a Retrieval Model. Proc. of NIPS Workshop.Google ScholarGoogle Scholar
  10. Har-Peled, S., Roth, D., & Zimak, D. Constraint classification: A new approach to multiclass classification and ranking. NIPS.Google ScholarGoogle Scholar
  11. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115--132.Google ScholarGoogle Scholar
  12. Kleinberg, J. (1999). Authoritative sources in a hyper-linked environment. Journal of the ACM (JACM), 46, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nath, J. S., Bhattacharyya, C., & Murty, M. N. (2006). Clustering based large margin classification: a scalable approach using socp formulation. Proc. of 12th Intl. Conf. on KDD (pp. 674--679). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods---Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two approaches. NIPS, 15.Google ScholarGoogle Scholar
  16. Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. Proc. of Intl. Conf. on Management of data, 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    ICML '07: Proceedings of the 24th international conference on Machine learning
    June 2007
    1233 pages
    ISBN:9781595937933
    DOI:10.1145/1273496

    Copyright © 2007 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 20 June 2007

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    Overall Acceptance Rate140of548submissions,26%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader