skip to main content
10.1145/2882903.2915211acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Local Similarity Search for Unstructured Text

Published:26 June 2016Publication History

ABSTRACT

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

References

  1. P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In SIGMOD Conference, pages 927--938, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agrawal, K. Chakrabarti, S. Chaudhuri, and V. Ganti. Scalable ad-hoc entity extraction from text collections. PVLDB, 1(1):945--957, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, pages 83--97, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In SIGMOD Conference, pages 398--409, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Z. Broder. On the resemblance and containment of documents. In SEQS, pages 21--29, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin. An efficient filter for approximate membership checking. In SIGMOD Conference, pages 805--818, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Chowdhury, O. Frieder, D. A. Grossman, and M. C. McCabe. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171--191, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Deng, G. Li, J. Feng, Y. Duan, and Z. Gong. A unified framework for approximate dictionary-based entity extraction. VLDB J., 24(1):143--167, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In ICDE, pages 267--276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. A. Hamid, B. Behzadi, S. Christoph, and M. R. Henzinger. Detecting the origin of text segments efficiently. In WWW, pages 61--70, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. W. Kim, K. S. Candan, and J. Tatemura. Efficient overlap and content reuse detection in blogs and online news articles. In WWW, pages 81--90, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Li, D. Deng, J. Wang, and J. Feng. Pass-Join: A partition-based method for similarity joins. PVLDB, 5(1):253--264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Li, J. He, D. Deng, and J. Li. Efficient similarity join and search on multi-attribute data. In SIGMOD Conference, pages 1137--1151, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Scaling up copy detection. In ICDE, pages 89--100, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  25. U. Manber. Finding similar files in a large file system. In USENIX Winter, pages 1--10, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Qin, W. Wang, C. Xiao, Y. Lu, X. Lin, and H. Wang. Asymmetric signature schemes for efficient exact edit similarity query processing. ACM Trans. Database Syst., 38(3):16, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In SIGMOD Conference, pages 76--85, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Seo and W. B. Croft. Local text reuse detection. In SIGIR, pages 571--578, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Sun, J. Qin, and W. Wang. Near duplicate text detection using frequency-biased signatures. In WISE, pages 277--291, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  32. M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Yang, Y. Wang, B. Wang, and W. Wang. Local filtering: Improving the performance of approximate queries on string collections. In SIGMOD Conference, pages 377--392, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Q. Zhang, Y. Wu, Z. Ding, and X. Huang. Learning hash codes for efficient content reuse detection. In SIGIR, pages 405--414, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Local Similarity Search for Unstructured Text

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
      June 2016
      2300 pages
      ISBN:9781450335317
      DOI:10.1145/2882903

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader