ABSTRACT
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.
- E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarDigital Library
- K. Bharat, B. Chang, M. Henzinger, and M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In 2001 IEEE International Conference on Data Mining, Nov. 2001. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference, Apr. 1997. Google ScholarDigital Library
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener. Graph Structure in the Web. In 9th International World Wide Web Conference, May 2000. Google ScholarDigital Library
- A. Broder, M. Najork and J. Wiener. Efficient URL Caching for World Wide Web Crawling. In 12th International World Wide Web Conference, May 2003. Google ScholarDigital Library
- J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In 26th International Conference on Very Large Databases, Sep. 2000. Google ScholarDigital Library
- B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google Scholar
- D. Fetterly, M. Manasse, M. Najork and J. Wiener. A large-scale study of the evolution of web pages. In 12th International World Wide Web Conference, May 2003. Google ScholarDigital Library
- D. Fetterly, M. Manasse and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress, Nov. 2003. Google ScholarDigital Library
- M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.Google Scholar
Index Terms
- Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Recommendations
Spam: It's Not Just for Inboxes Anymore
E-mail spam is a nuisance that every user has come to expect. But Web spammers prey on unsuspecting users and undermine search engines by subverting search results to increase the visibility of their pages.
SAAD, a content based Web Spam Analyzer and Detector
Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible ...
Spam double-funnel: connecting web spammers with advertisers
WWW '07: Proceedings of the 16th international conference on World Wide WebSpammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam - redirection spam - where one can identify spam pages by the third-party ...
Comments