Article

Relaxed online SVMs for spam filtering

Authors:
D. Sculley

Tufts University, Medford, MA

Tufts University, Medford, MA
View Profile

,
Gabriel M. Wachman

Tufts University, Medford, MA

Tufts University, Medford, MA
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 415–422https://doi.org/10.1145/1277741.1277813

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 415–422

ABSTRACT

Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.

References

A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, L jubljana, Slovenia, 2005.Google Scholar
G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google Scholar
G. V. Cormack. TREC 2006 spam track overview. In To appear in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings, 2006.Google Scholar
G. V. Cormack and A. Bratko. Batch and on-line spam filter comparison. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
G. V. Cormack and T. R. Lynam. TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005.Google Scholar
G. V. Cormack and T. R. Lynam. On-line supervised spam filter evaluation. Technical report, David R. Cheriton School of Computer Science, University of Waterloo, Canada, February 2006.Google Scholar
N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines. Cambridge University Press, 2000. Google ScholarDigital Library
D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 345--349, 2000. Google ScholarDigital Library
H. Drucker, V. Vapnik, and D. Wu. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999. Google ScholarDigital Library
J. Goodman and W. Yin. Online discriminative spam filter training. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
P. Graham. A plan for spam. 2002.Google Scholar
P. Graham. Better bayesian filtering. 2003.Google Scholar
Z. Gyongi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. Computer, 38(10):28--34, 2005. Google ScholarDigital Library
T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In ECML '98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarDigital Library
T. Joachims. Training linear svms in linear time. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217--226, 2006. Google ScholarDigital Library
J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In Advances in Neural Information Processing Systems 14, pages 785--793. MIT Press, 2002.Google Scholar
P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: Blog identification and splog detection. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google Scholar
W. Krauth and M. Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A, 20(11):745--752, 1987.Google ScholarCross Ref
T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006. Google ScholarDigital Library
V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.Google Scholar
J. Platt. Sequenital minimal optimization: A fast algorithm for training support vector machines. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.Google Scholar
B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001. Google ScholarDigital Library
G. L. Wittel and S. F. Wu. On attacking statistical spam filters. CEAS: First Conference on Email and Anti-Spam, 2004.Google Scholar

Index Terms

Relaxed online SVMs for spam filtering
1. Information systems
  1. Information retrieval

Recommendations

A study of spam filtering using support vector machines

Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful ...
Read More
Spam filtering in twitter using sender-receiver relationship
RAID'11: Proceedings of the 14th international conference on Recent Advances in Intrusion Detection

Twitter is one of the most visited sites in these days. Twitter spam, however, is constantly increasing. Since Twitter spam is different from traditional spam such as email and blog spam, conventional spam filtering methods are inappropriate to detect ...
Read More
Spam Filtering With Dynamically Updated URL Statistics

Many URL-based spam filters rely on "white" and "black" lists to classify email. The authors' proposed URL-based spam filter instead analyzes URL statistics to dynamically calculate the probabilities of whether email with specific URLs are spam or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
blogs
spam filtering
splogs
support vector machines
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 148
  Total Citations
  View Citations
- 1,868
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Relaxed online SVMs for spam filtering

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A study of spam filtering using support vector machines

Spam filtering in twitter using sender-receiver relationship

Spam Filtering With Dynamically Updated URL Statistics