ABSTRACT
Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.
- A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, L jubljana, Slovenia, 2005.Google Scholar
- G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google Scholar
- G. V. Cormack. TREC 2006 spam track overview. In To appear in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings, 2006.Google Scholar
- G. V. Cormack and A. Bratko. Batch and on-line spam filter comparison. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
- G. V. Cormack and T. R. Lynam. TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005.Google Scholar
- G. V. Cormack and T. R. Lynam. On-line supervised spam filter evaluation. Technical report, David R. Cheriton School of Computer Science, University of Waterloo, Canada, February 2006.Google Scholar
- N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines. Cambridge University Press, 2000. Google ScholarDigital Library
- D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 345--349, 2000. Google ScholarDigital Library
- H. Drucker, V. Vapnik, and D. Wu. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999. Google ScholarDigital Library
- J. Goodman and W. Yin. Online discriminative spam filter training. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
- P. Graham. A plan for spam. 2002.Google Scholar
- P. Graham. Better bayesian filtering. 2003.Google Scholar
- Z. Gyongi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. Computer, 38(10):28--34, 2005. Google ScholarDigital Library
- T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In ECML '98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarDigital Library
- T. Joachims. Training linear svms in linear time. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217--226, 2006. Google ScholarDigital Library
- J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In Advances in Neural Information Processing Systems 14, pages 785--793. MIT Press, 2002.Google Scholar
- P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: Blog identification and splog detection. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google Scholar
- W. Krauth and M. Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A, 20(11):745--752, 1987.Google ScholarCross Ref
- T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006. Google ScholarDigital Library
- V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? Third Conference on Email and Anti-Spam (CEAS), 2006.Google Scholar
- G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.Google Scholar
- J. Platt. Sequenital minimal optimization: A fast algorithm for training support vector machines. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.Google Scholar
- B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001. Google ScholarDigital Library
- G. L. Wittel and S. F. Wu. On attacking statistical spam filters. CEAS: First Conference on Email and Anti-Spam, 2004.Google Scholar
Index Terms
- Relaxed online SVMs for spam filtering
Recommendations
A study of spam filtering using support vector machines
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful ...
Spam filtering in twitter using sender-receiver relationship
RAID'11: Proceedings of the 14th international conference on Recent Advances in Intrusion DetectionTwitter is one of the most visited sites in these days. Twitter spam, however, is constantly increasing. Since Twitter spam is different from traditional spam such as email and blog spam, conventional spam filtering methods are inappropriate to detect ...
Spam Filtering With Dynamically Updated URL Statistics
Many URL-based spam filters rely on "white" and "black" lists to classify email. The authors' proposed URL-based spam filter instead analyzes URL statistics to dynamically calculate the probabilities of whether email with specific URLs are spam or ...
Comments