Article

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Authors:
Dennis Fetterly

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Mark Manasse

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

,
Marc Najork

Microsoft Research, Mountain View, CA

Microsoft Research, Mountain View, CA
View Profile

WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004June 2004Pages 1–6https://doi.org/10.1145/1017074.1017077

Published:17 June 2004Publication History

WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004

Pages 1–6

ABSTRACT

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam.This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.

References

E. Amitay, D. Carmel, A. Darlow, R. Lempel and A. Soffer. The Connectivity Sonar: Detecting Site Functionality by Structural Patterns. In 14th ACM Conference on Hypertext and Hypermedia, Aug. 2003. Google ScholarDigital Library
K. Bharat, B. Chang, M. Henzinger, and M. Ruhl. Who Links to Whom: Mining Linkage between Web Sites. In 2001 IEEE International Conference on Data Mining, Nov. 2001. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse and G. Zweig. Syntactic Clustering of the Web. In 6th International World Wide Web Conference, Apr. 1997. Google ScholarDigital Library
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins and J. Wiener. Graph Structure in the Web. In 9th International World Wide Web Conference, May 2000. Google ScholarDigital Library
A. Broder, M. Najork and J. Wiener. Efficient URL Caching for World Wide Web Crawling. In 12th International World Wide Web Conference, May 2003. Google ScholarDigital Library
J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In 26th International Conference on Very Large Databases, Sep. 2000. Google ScholarDigital Library
B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.Google Scholar
D. Fetterly, M. Manasse, M. Najork and J. Wiener. A large-scale study of the evolution of web pages. In 12th International World Wide Web Conference, May 2003. Google ScholarDigital Library
D. Fetterly, M. Manasse and M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. In 1st Latin American Web Congress, Nov. 2003. Google ScholarDigital Library
M. Henzinger, R. Motwani, C. Silverstein. Challenges in Web Search Engines. SIGIR Forum 36(2), 2002. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report, Stanford Digital Libraries Technologies Project, Jan. 1998.Google Scholar

Index Terms

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Recommendations

Spam: It's Not Just for Inboxes Anymore

E-mail spam is a nuisance that every user has come to expect. But Web spammers prey on unsuspecting users and undermine search engines by subverting search results to increase the visibility of their pages.

Read More
SAAD, a content based Web Spam Analyzer and Detector

Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible ...
Read More
Spam double-funnel: connecting web spammers with advertisers
WWW '07: Proceedings of the 16th international conference on World Wide Web

Spammers use questionable search engine optimization (SEO) techniques to promote their spam links into top search results. In this paper, we focus on one prevalent type of spam - redirection spam - where one can identify spam pages by the third-party ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
June 2004
100 pages
ISBN:9781450377881
DOI:10.1145/1017074
Conference Chairs:
Luis Gravano,
Sihem Amer-Yahia
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
statistical properties of web pages
web characterization
web spam
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate30of100submissions,30%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 191
  Total Citations
  View Citations
- 2,052
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004

ABSTRACT

References

Cited By

Index Terms

Recommendations

Spam: It's Not Just for Inboxes Anymore

SAAD, a content based Web Spam Analyzer and Detector

Spam double-funnel: connecting web spammers with advertisers