ABSTRACT
The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. Estimating the degree distribution of real-world graphs poses a significant challenge, due to their heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the h-index and the z-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10%$ of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.
- D. Achlioptas, A. Clauset, D. Kempe, and C. Moore. 2009. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. J. ACM Vol. 56, 4 (2009). Google ScholarDigital Library
- N.K. Ahmed, J. Neville, and R. Kompella. 2010. Reconsidering the Foundations of Network Sampling. In WIN 10.Google Scholar
- N. Ahmed, J. Neville, and R. Kompella. 2012. Space-Efficient Sampling from Social Activity Streams SIGKDD BigMine. 1--8.Google Scholar
- Nesreen K Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014 a. Graph sample and hold: A framework for big-graph analytics SIGKDD. ACM, ACM, 1446--1455. Google ScholarDigital Library
- Nesreen K Ahmed, Jennifer Neville, and Ramana Kompella. 2014 b. Network sampling: From static to streaming graphs. TKDD Vol. 8, 2 (2014), 7. Google ScholarDigital Library
- Sinan G. Aksoy, Tamara G. Kolda, and Ali Pinar. 2017. Measuring and modeling bipartite graphs with community structure. Journal of Complex Networks (2017). to appear.Google Scholar
- Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science Vol. 286 (Oct. 1999), 509--512.Google Scholar
- A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. 2000. Graph structure in the web. Computer Networks Vol. 33 (2000), 309--320. Google ScholarDigital Library
- Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph Mining: Laws, Generators, and Algorithms. Comput. Surveys Vol. 38, 1 (2006). Google ScholarDigital Library
- F. Chierichetti, A. Dasgupta, R. Kumar, S. Lattanzi, and T. Sarlos. 2016. On Sampling Nodes in a Network. In Conference on the World Wide Web (WWW). Google ScholarDigital Library
- A. Clauset and C. Moore. 2005. Accuracy and scaling phenomena in internet mapping. Phys. Rev. Lett. Vol. 94 (2005), 018701.Google ScholarCross Ref
- A. Clauset, C. R. Shalizi, and M. E. J. Newman. 2009. Power-Law Distributions in Empirical Data. SIAM Rev. Vol. 51, 4 (2009), 661--703. Google ScholarDigital Library
- R. Cohen, K. Erez, D. ben Avraham, and S. Havlin. 2000. Resilience of the Internet to Random Breakdowns. Phys. Rev. Lett. Vol. 85, 4626--8 (2000).Google ScholarCross Ref
- A. Dasgupta, R. Kumar, and T. Sarlos. 2014. On estimating the average degree. In Conference on the World Wide Web (WWW). 795--806. Google ScholarDigital Library
- D. Dubhashi and A. Panconesi. 2012. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press. Google ScholarDigital Library
- N. Durak, T.G. Kolda, A. Pinar, and C. Seshadhri. 2013. A scalable null model for directed graphs matching all degree distributions: In, out, and reciprocal. In Network Science Workshop (NSW), 2013 IEEE 2nd. 23--30.Google Scholar
- Peter Ebbes, Zan Huang, Arvind Rangaswamy, Hari P Thadakamalla, and ORGB Unit. 2008. Sampling large-scale social networks: Insights from simulated networks 18th Annual Workshop on Information Technologies and Systems, Paris, France.Google Scholar
- Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri. 2017 a. Provable and practical approximations for the degree distribution using sublinear graph samples. CoRR Vol. abs/1710.08607 (2017). {arxiv}1710.08607 http://arxiv.org/abs/1710.08607Google Scholar
- T. Eden, A. Levi, D. Ron, and C. Seshadhri. 2015. Approximately Counting Triangles in Sublinear Time Foundations of Computer Science (FOCS), GRS11 (Ed.). 614--633. Google ScholarDigital Library
- T. Eden, D. Ron, and C. Seshadhri. 2017 b. Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection. In International Colloquium on Automata, Languages, and Programming (ICALP), GRS11 (Ed.). 614--633.Google Scholar
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology SIGCOMM. 251--262. Google ScholarDigital Library
- U. Feige. 2006. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. Vol. 35, 4 (2006), 964--984. Google ScholarDigital Library
- O. Goldreich and D. Ron. 2002. Property Testing in Bounded Degree Graphs. Algorithmica (2002), 302--343. Google ScholarDigital Library
- O. Goldreich and D. Ron. 2008. Approximating average parameters of graphs. Random Structures and Algorithms Vol. 32, 4 (2008), 473--493. Google ScholarDigital Library
- M. Gonen, D. Ron, and Y. Shavitt. 2011. Counting stars and other small subgraphs in sublinear-time. SIAM Journal on Discrete Math Vol. 25, 3 (2011), 1365--1411.Google ScholarCross Ref
- Mira Gonen, Dana Ron, Udi Weinsberg, and Avishai Wool. 2008. Finding a dense-core in Jellyfish graphs. Computer Networks Vol. 52, 15 (2008), 2831--2841. Google ScholarDigital Library
- J. E. Hirsch. 2005. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences Vol. 102, 46 (2005), 16569--16572.Google ScholarCross Ref
- A. Lakhina, J. Byers, M. Crovella, and P. Xie.. 2003. Sampling biases in IP topology measurements. In Proceedings of INFOCOMM, Vol. Vol. 1. 332--341.Google Scholar
- Sang Hoon Lee, Pan-Jun Kim, and Hawoong Jeong. 2006. Statistical properties of sampled networks. Physical Review E Vol. 73, 1 (2006), 016102.Google ScholarCross Ref
- Jure Leskovec. 2015. SNAP Stanford Network Analysis Project. http://snap.standord.edu. (2015).Google Scholar
- Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Knowledge Data and Discovery (KDD). ACM, 631--636. Google ScholarDigital Library
- A. S. Maiya and T. Y. Berger-Wolf. 2011. Benefits of Bias: Towards Better Characterization of Network Sampling, In Knowledge Data and Discovery (KDD). ArXiv e-prints, 105--113. {arxiv}1109.3911 Google ScholarDigital Library
- Andrew McGregor. 2014. Graph stream algorithms: A survey. SIGMOD Vol. 43, 1 (2014), 9--20. Google ScholarDigital Library
- M. Mitzenmacher. 2003. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics Vol. 1, 2 (2003), 226--251.Google ScholarCross Ref
- M. E. J. Newman. 2003. The Structure and Function of Complex Networks. SIAM Rev. Vol. 45, 2 (2003), 167--256.Google ScholarDigital Library
- M. E. J. Newman, S. Strogatz, and D. Watts. 2001. Random graphs with arbitrary degree distributions and their applications. Physical Review E Vol. 64 (2001), 026118.Google ScholarCross Ref
- D. Pennock, G. Flake, S. Lawrence, E. Glover, and C. L. Giles. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences Vol. 99, 8 (2002), 5207--5211.Google ScholarCross Ref
- T. Petermann and P. Rios. 2004. Exploration of scale-free networks. European Physical Journal B Vol. 38 (2004), 201--204.Google ScholarCross Ref
- Ali Pinar, Sucheta Soundarajan, Tina Eliassi-Rad, and Brian Gallagher. 2015. MaxOutProbe: An Algorithm for Increasing the Size of Partially Observed Networks. Technical Report. Sandia National Laboratories (SNL-CA), Livermore, CA (United States).Google Scholar
- Bruno Ribeiro and Don Towsley. 2012. On the estimation accuracy of degree distributions from graph sampling Annual Conference on Decision and Control (CDC). IEEE, 5240--5247.Google Scholar
- Dana Ron. 2010. Algorithmic and Analysis Techniques in Property Testing. Foundations and Trends in Theoretical Computer Science Vol. 5, 2 (2010), 73--205. Google ScholarDigital Library
- Dana Ron and Gilad Tsur. 2016. The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling. ACM Transactions on Computation Theory Vol. 8, 4 (2016), 15:1--15:19. Google ScholarDigital Library
- C. Seshadhri, Tamara G. Kolda, and Ali Pinar. 2012. Community structure and scale-free collections of Erdös-Rényi graphs. Physical Review E Vol. 85, 5 (May. 2012), 056109.Google ScholarCross Ref
- Olivia Simpson, C Seshadhri, and Andrew McGregor. 2015. Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution. In International Conference on Data Mining (ICDM). IEEE, 979--984. Google ScholarDigital Library
- Michael PH Stumpf and Carsten Wiuf. 2005. Sampling properties of random graphs: the degree distribution. Physical Review E Vol. 72, 3 (2005), 036118.Google ScholarCross Ref
- Yaonan Zhang, Eric D Kolaczyk, and Bruce D Spencer. 2015. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. The Annals of Applied Statistics Vol. 9, 1 (2015), 166--199.Google ScholarCross Ref
Index Terms
- Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples
Recommendations
Sublinear Time Estimation of Degree Distribution Moments: The Arboricity Connection
We revisit the classic problem of estimating the moments of the degree distribution of an undirected simple graph. Consider an undirected simple graph $G=(V,E)$ with $n$ (nonisolated) vertices, and define (for $s > 0$) $M_s= \sum_{v \in V} d^s_v$. Our aim ...
Constructing and sampling graphs with a prescribed joint degree distribution
One of the most influential recent results in network analysis is that many natural networks exhibit a power-law or log-normal degree distribution. This has inspired numerous generative models that match this property. However, more recent work has ...
Relationship between degree-rank function and degree distribution of protein-protein interaction networks
It is argued that both the degree-rank function r=f(d), which describes the relationship between the degree d and the rank r of a degree sequence, and the degree distribution P(k), which describes the probability that a randomly chosen vertex has degree ...
Comments