skip to main content
10.1145/1277741.1277813acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Relaxed online SVMs for spam filtering

Published:23 July 2007Publication History

ABSTRACT

Spam is a key problem in electronic communication, including large-scale email systems and the growing number of blogs. Content-based filtering is one reliable method of combating this threat in its various forms, but some academic researchers and industrial practitioners disagree on how best to filter spam. The former have advocated the use of Support Vector Machines (SVMs) for content-based filtering, as this machine learning methodology gives state-of-the-art performance for text classification. However, similar performance gains have yet to be demonstrated for online spam filtering. Additionally, practitioners cite the high cost of SVMs as reason to prefer faster (if less statistically robust) Bayesian methods. In this paper, we offer a resolution to this controversy. First, we show that online SVMs indeed give state-of-the-art classification performance on online spam filtering on large benchmark data sets. Second, we show that nearly equivalent performance may be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Our results are experimentally verified on email spam, blog spam, and splog detection tasks.

References

  1. A. Bratko and B. Filipic. Spam filtering using compression models. Technical Report IJS-DP-9227, Department of Intelligent Systems, Jozef Stefan Institute, L jubljana, Slovenia, 2005.Google ScholarGoogle Scholar
  2. G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine learning. In NIPS, pages 409--415, 2000.Google ScholarGoogle Scholar
  3. G. V. Cormack. TREC 2006 spam track overview. In To appear in: The Fifteenth Text REtrieval Conference (TREC 2006) Proceedings, 2006.Google ScholarGoogle Scholar
  4. G. V. Cormack and A. Bratko. Batch and on-line spam filter comparison. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google ScholarGoogle Scholar
  5. G. V. Cormack and T. R. Lynam. TREC 2005 spam track overview. In The Fourteenth Text REtrieval Conference (TREC 2005) Proceedings, 2005.Google ScholarGoogle Scholar
  6. G. V. Cormack and T. R. Lynam. On-line supervised spam filter evaluation. Technical report, David R. Cheriton School of Computer Science, University of Waterloo, Canada, February 2006.Google ScholarGoogle Scholar
  7. N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines. Cambridge University Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In KDD '00: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 345--349, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Drucker, V. Vapnik, and D. Wu. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Goodman and W. Yin. Online discriminative spam filter training. In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), 2006.Google ScholarGoogle Scholar
  11. P. Graham. A plan for spam. 2002.Google ScholarGoogle Scholar
  12. P. Graham. Better bayesian filtering. 2003.Google ScholarGoogle Scholar
  13. Z. Gyongi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. Computer, 38(10):28--34, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In ECML '98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Joachims. Training linear svms in linear time. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217--226, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In Advances in Neural Information Processing Systems 14, pages 785--793. MIT Press, 2002.Google ScholarGoogle Scholar
  17. P. Kolari, T. Finin, and A. Joshi. SVMs for the blogosphere: Blog identification and splog detection. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Google ScholarGoogle Scholar
  18. W. Krauth and M. Mézard. Learning algorithms with optimal stability in neural networks. Journal of Physics A, 20(11):745--752, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  19. T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive bayes - which naive bayes? Third Conference on Email and Anti-Spam (CEAS), 2006.Google ScholarGoogle Scholar
  21. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.Google ScholarGoogle Scholar
  22. J. Platt. Sequenital minimal optimization: A fast algorithm for training support vector machines. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1998.Google ScholarGoogle Scholar
  23. B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. L. Wittel and S. F. Wu. On attacking statistical spam filters. CEAS: First Conference on Email and Anti-Spam, 2004.Google ScholarGoogle Scholar

Index Terms

  1. Relaxed online SVMs for spam filtering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2007
      946 pages
      ISBN:9781595935977
      DOI:10.1145/1277741

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader