skip to main content
10.1145/1017074.1017077acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebdbConference Proceedingsconference-collections
Article

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Published:17 June 2004Publication History

ABSTRACT

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.

References

  1. E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Bharat, B. Chang, M. Henzinger, and M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In 2001 IEEE International Conference on Data Mining, Nov. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference, Apr. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener. Graph Structure in the Web. In 9th International World Wide Web Conference, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Broder, M. Najork and J. Wiener. Efficient URL Caching for World Wide Web Crawling. In 12th International World Wide Web Conference, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In 26th International Conference on Very Large Databases, Sep. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google ScholarGoogle Scholar
  8. D. Fetterly, M. Manasse, M. Najork and J. Wiener. A large-scale study of the evolution of web pages. In 12th International World Wide Web Conference, May 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Fetterly, M. Manasse and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress, Nov. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.Google ScholarGoogle Scholar

Index Terms

  1. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
            June 2004
            100 pages
            ISBN:9781450377881
            DOI:10.1145/1017074

            Copyright © 2004 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 June 2004

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate30of100submissions,30%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader