skip to main content
10.1145/3178876.3186111acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article
Free Access

Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples

Authors Info & Claims
Published:23 April 2018Publication History

ABSTRACT

The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. Estimating the degree distribution of real-world graphs poses a significant challenge, due to their heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the h-index and the z-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10%$ of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.

References

  1. D. Achlioptas, A. Clauset, D. Kempe, and C. Moore. 2009. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. J. ACM Vol. 56, 4 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N.K. Ahmed, J. Neville, and R. Kompella. 2010. Reconsidering the Foundations of Network Sampling. In WIN 10.Google ScholarGoogle Scholar
  3. N. Ahmed, J. Neville, and R. Kompella. 2012. Space-Efficient Sampling from Social Activity Streams SIGKDD BigMine. 1--8.Google ScholarGoogle Scholar
  4. Nesreen K Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014 a. Graph sample and hold: A framework for big-graph analytics SIGKDD. ACM, ACM, 1446--1455. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nesreen K Ahmed, Jennifer Neville, and Ramana Kompella. 2014 b. Network sampling: From static to streaming graphs. TKDD Vol. 8, 2 (2014), 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sinan G. Aksoy, Tamara G. Kolda, and Ali Pinar. 2017. Measuring and modeling bipartite graphs with community structure. Journal of Complex Networks (2017). to appear.Google ScholarGoogle Scholar
  7. Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science Vol. 286 (Oct. 1999), 509--512.Google ScholarGoogle Scholar
  8. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. 2000. Graph structure in the web. Computer Networks Vol. 33 (2000), 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph Mining: Laws, Generators, and Algorithms. Comput. Surveys Vol. 38, 1 (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Chierichetti, A. Dasgupta, R. Kumar, S. Lattanzi, and T. Sarlos. 2016. On Sampling Nodes in a Network. In Conference on the World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Clauset and C. Moore. 2005. Accuracy and scaling phenomena in internet mapping. Phys. Rev. Lett. Vol. 94 (2005), 018701.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Clauset, C. R. Shalizi, and M. E. J. Newman. 2009. Power-Law Distributions in Empirical Data. SIAM Rev. Vol. 51, 4 (2009), 661--703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Cohen, K. Erez, D. ben Avraham, and S. Havlin. 2000. Resilience of the Internet to Random Breakdowns. Phys. Rev. Lett. Vol. 85, 4626--8 (2000).Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Dasgupta, R. Kumar, and T. Sarlos. 2014. On estimating the average degree. In Conference on the World Wide Web (WWW). 795--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Dubhashi and A. Panconesi. 2012. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Durak, T.G. Kolda, A. Pinar, and C. Seshadhri. 2013. A scalable null model for directed graphs matching all degree distributions: In, out, and reciprocal. In Network Science Workshop (NSW), 2013 IEEE 2nd. 23--30.Google ScholarGoogle Scholar
  17. Peter Ebbes, Zan Huang, Arvind Rangaswamy, Hari P Thadakamalla, and ORGB Unit. 2008. Sampling large-scale social networks: Insights from simulated networks 18th Annual Workshop on Information Technologies and Systems, Paris, France.Google ScholarGoogle Scholar
  18. Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri. 2017 a. Provable and practical approximations for the degree distribution using sublinear graph samples. CoRR Vol. abs/1710.08607 (2017). {arxiv}1710.08607 http://arxiv.org/abs/1710.08607Google ScholarGoogle Scholar
  19. T. Eden, A. Levi, D. Ron, and C. Seshadhri. 2015. Approximately Counting Triangles in Sublinear Time Foundations of Computer Science (FOCS), GRS11 (Ed.). 614--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Eden, D. Ron, and C. Seshadhri. 2017 b. Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection. In International Colloquium on Automata, Languages, and Programming (ICALP), GRS11 (Ed.). 614--633.Google ScholarGoogle Scholar
  21. M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology SIGCOMM. 251--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. U. Feige. 2006. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. Vol. 35, 4 (2006), 964--984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. O. Goldreich and D. Ron. 2002. Property Testing in Bounded Degree Graphs. Algorithmica (2002), 302--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. O. Goldreich and D. Ron. 2008. Approximating average parameters of graphs. Random Structures and Algorithms Vol. 32, 4 (2008), 473--493. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Gonen, D. Ron, and Y. Shavitt. 2011. Counting stars and other small subgraphs in sublinear-time. SIAM Journal on Discrete Math Vol. 25, 3 (2011), 1365--1411.Google ScholarGoogle ScholarCross RefCross Ref
  26. Mira Gonen, Dana Ron, Udi Weinsberg, and Avishai Wool. 2008. Finding a dense-core in Jellyfish graphs. Computer Networks Vol. 52, 15 (2008), 2831--2841. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. E. Hirsch. 2005. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences Vol. 102, 46 (2005), 16569--16572.Google ScholarGoogle ScholarCross RefCross Ref
  28. A. Lakhina, J. Byers, M. Crovella, and P. Xie.. 2003. Sampling biases in IP topology measurements. In Proceedings of INFOCOMM, Vol. Vol. 1. 332--341.Google ScholarGoogle Scholar
  29. Sang Hoon Lee, Pan-Jun Kim, and Hawoong Jeong. 2006. Statistical properties of sampled networks. Physical Review E Vol. 73, 1 (2006), 016102.Google ScholarGoogle ScholarCross RefCross Ref
  30. Jure Leskovec. 2015. SNAP Stanford Network Analysis Project. http://snap.standord.edu. (2015).Google ScholarGoogle Scholar
  31. Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Knowledge Data and Discovery (KDD). ACM, 631--636. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. S. Maiya and T. Y. Berger-Wolf. 2011. Benefits of Bias: Towards Better Characterization of Network Sampling, In Knowledge Data and Discovery (KDD). ArXiv e-prints, 105--113. {arxiv}1109.3911 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Andrew McGregor. 2014. Graph stream algorithms: A survey. SIGMOD Vol. 43, 1 (2014), 9--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Mitzenmacher. 2003. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics Vol. 1, 2 (2003), 226--251.Google ScholarGoogle ScholarCross RefCross Ref
  35. M. E. J. Newman. 2003. The Structure and Function of Complex Networks. SIAM Rev. Vol. 45, 2 (2003), 167--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. E. J. Newman, S. Strogatz, and D. Watts. 2001. Random graphs with arbitrary degree distributions and their applications. Physical Review E Vol. 64 (2001), 026118.Google ScholarGoogle ScholarCross RefCross Ref
  37. D. Pennock, G. Flake, S. Lawrence, E. Glover, and C. L. Giles. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences Vol. 99, 8 (2002), 5207--5211.Google ScholarGoogle ScholarCross RefCross Ref
  38. T. Petermann and P. Rios. 2004. Exploration of scale-free networks. European Physical Journal B Vol. 38 (2004), 201--204.Google ScholarGoogle ScholarCross RefCross Ref
  39. Ali Pinar, Sucheta Soundarajan, Tina Eliassi-Rad, and Brian Gallagher. 2015. MaxOutProbe: An Algorithm for Increasing the Size of Partially Observed Networks. Technical Report. Sandia National Laboratories (SNL-CA), Livermore, CA (United States).Google ScholarGoogle Scholar
  40. Bruno Ribeiro and Don Towsley. 2012. On the estimation accuracy of degree distributions from graph sampling Annual Conference on Decision and Control (CDC). IEEE, 5240--5247.Google ScholarGoogle Scholar
  41. Dana Ron. 2010. Algorithmic and Analysis Techniques in Property Testing. Foundations and Trends in Theoretical Computer Science Vol. 5, 2 (2010), 73--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Dana Ron and Gilad Tsur. 2016. The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling. ACM Transactions on Computation Theory Vol. 8, 4 (2016), 15:1--15:19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Seshadhri, Tamara G. Kolda, and Ali Pinar. 2012. Community structure and scale-free collections of Erdös-Rényi graphs. Physical Review E Vol. 85, 5 (May. 2012), 056109.Google ScholarGoogle ScholarCross RefCross Ref
  44. Olivia Simpson, C Seshadhri, and Andrew McGregor. 2015. Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution. In International Conference on Data Mining (ICDM). IEEE, 979--984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Michael PH Stumpf and Carsten Wiuf. 2005. Sampling properties of random graphs: the degree distribution. Physical Review E Vol. 72, 3 (2005), 036118.Google ScholarGoogle ScholarCross RefCross Ref
  46. Yaonan Zhang, Eric D Kolaczyk, and Bruce D Spencer. 2015. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. The Annals of Applied Statistics Vol. 9, 1 (2015), 166--199.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '18: Proceedings of the 2018 World Wide Web Conference
        April 2018
        2000 pages
        ISBN:9781450356398

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        International World Wide Web Conferences Steering Committee

        Republic and Canton of Geneva, Switzerland

        Publication History

        • Published: 23 April 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format