Article

Focused crawling with scalable ordinal regression solvers

Authors:
Rashmin Babaria

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

,
J. Saketha Nath

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

,
Krishnan S

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

,
Sivaramakrishnan K R

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

,
Chiranjib Bhattacharyya

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

,
M. N. Murty

Indian Institute of Science, Bangalore

Indian Institute of Science, Bangalore
View Profile

ICML '07: Proceedings of the 24th international conference on Machine learningJune 2007Pages 57–64https://doi.org/10.1145/1273496.1273504

Published:20 June 2007Publication History

ICML '07: Proceedings of the 24th international conference on Machine learning

Pages 57–64

ABSTRACT

In this paper we propose a novel, scalable, clustering based Ordinal Regression formulation, which is an instance of a Second Order Cone Program (SOCP) with one Second Order Cone (SOC) constraint. The main contribution of the paper is a fast algorithm, CB-OR, which solves the proposed formulation more eficiently than general purpose solvers. Another main contribution of the paper is to pose the problem of focused crawling as a large scale Ordinal Regression problem and solve using the proposed CB-OR. Focused crawling is an efficient mechanism for discovering resources of interest on the web. Posing the problem of focused crawling as an Ordinal Regression problem avoids the need for a negative class and topic hierarchy, which are the main drawbacks of the existing focused crawling methods. Experiments on large synthetic and benchmark datasets show the scalability of CB-OR. Experiments also show that the proposed focused crawler outperforms the state-of-the-art.

References

Aggarwal, C., Al-Garawi, F., & Yu, P. (2001). Intelligent crawling on the World Wide Web with arbitrary predicates. Proc. of 10th Intl. Conf. on WWW. Google ScholarDigital Library
Chakrabarti, S., Punera, K., & Subramanyam, M. (2002). Accelerated focused crawling through online relevance feedback. Proc. of 11th Intl. Conf. on World Wide Web, 148--159. Google ScholarDigital Library
Chakrabarti, S., van den Berg, M., & Dom, B. (1999). Focused Crawling: A New Approach for Topic-Specific Resource Discovery. WWW Conference. Google ScholarDigital Library
Chu, W., & Keerthi, S. (2005). New approaches to support vector ordinal regression. Proc. of 22nd Intl. Conf. on Machine learning, 145--152. Google ScholarDigital Library
Crammer, K., & Singer, Y. (2002). Pranking with ranking. NIPS, 14.Google Scholar
Davison, B. (2000). Topical locality in the Web. Proc. of 23rd Intl. Conf. on Research and development in Information Retrieval, 272--279. Google ScholarDigital Library
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C., & Gori, M. (2000). Focused crawling using context graphs. Proc. of 26th Intl. Conf. on VLDB. Google ScholarDigital Library
Erdougan, E., & Iyengar, G. (2006). An active set method for single-cone second-order cone programs. SIAM J. on Optimization, 17, 459--484. Google ScholarDigital Library
Grangier, D., & Bengio, S. (2005). Exploiting Hyperlinks to Learn a Retrieval Model. Proc. of NIPS Workshop.Google Scholar
Har-Peled, S., Roth, D., & Zimak, D. Constraint classification: A new approach to multiclass classification and ranking. NIPS.Google Scholar
Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 115--132.Google Scholar
Kleinberg, J. (1999). Authoritative sources in a hyper-linked environment. Journal of the ACM (JACM), 46, 604--632. Google ScholarDigital Library
Nath, J. S., Bhattacharyya, C., & Murty, M. N. (2006). Clustering based large margin classification: a scalable approach using socp formulation. Proc. of 12th Intl. Conf. on KDD (pp. 674--679). Google ScholarDigital Library
Platt, J. (1999). Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods---Support Vector Learning (pp. 185--208). Cambridge, MA: MIT Press. Google ScholarDigital Library
Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two approaches. NIPS, 15.Google Scholar
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. Proc. of Intl. Conf. on Management of data, 103--114. Google ScholarDigital Library

Recommendations

Exploiting Interclass Rules for Focused Crawling

A focused crawler is an agent that concentrates on a particular target topic and tries to visit and gather only relevant pages from the Web. A crucial issue for a focused crawler is the underlying heuristic for deciding the page to visit next. The ...
Read More
Efficient Topical Focused Crawling Through Neighborhood Feature
Abstract
A focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Read More
Sentiment-Focused Web Crawling

Sentiments and opinions expressed in Web pages towards objects, entities, and products constitute an important portion of the textual content available in the Web. In the last decade, the analysis of such content has gained importance due to its high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICML '07: Proceedings of the 24th international conference on Machine learning
June 2007
1233 pages
ISBN:9781595937933
DOI:10.1145/1273496
Editor:
Zoubin Ghahramani
University of Cambridge, United Kingdom
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate140of548submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 305
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Focused crawling with scalable ordinal regression solvers

ICML '07: Proceedings of the 24th international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Exploiting Interclass Rules for Focused Crawling

Efficient Topical Focused Crawling Through Neighborhood Feature

Sentiment-Focused Web Crawling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Focused crawling with scalable ordinal regression solvers

ICML '07: Proceedings of the 24th international conference on Machine learning

ABSTRACT

References

Cited By

Recommendations

Exploiting Interclass Rules for Focused Crawling

Efficient Topical Focused Crawling Through Neighborhood Feature

Sentiment-Focused Web Crawling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media