research-article

Free Access

Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples

Authors:
Talya Eden

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
Shweta Jain

University of California, Santa Cruz, Santa Cruz, CA, USA

University of California, Santa Cruz, Santa Cruz, CA, USA
View Profile

,
Ali Pinar

Sandia National Laboratories, Livermore, CA, USA

Sandia National Laboratories, Livermore, CA, USA
View Profile

,
Dana Ron

Tel Aviv University, Tel Aviv, Israel

Tel Aviv University, Tel Aviv, Israel
View Profile

,
C. Seshadhri

University of California, Santa Cruz, Santa Cruz, CA, USA

University of California, Santa Cruz, Santa Cruz, CA, USA
View Profile

WWW '18: Proceedings of the 2018 World Wide Web ConferenceApril 2018Pages 449–458https://doi.org/10.1145/3178876.3186111

Published:23 April 2018Publication History

WWW '18: Proceedings of the 2018 World Wide Web Conference

Pages 449–458

ABSTRACT

The degree distribution is one of the most fundamental properties used in the analysis of massive graphs. There is a large literature on graph sampling, where the goal is to estimate properties (especially the degree distribution) of a large graph through a small, random sample. Estimating the degree distribution of real-world graphs poses a significant challenge, due to their heavy-tailed nature and the large variance in degrees. We design a new algorithm, SADDLES, for this problem, using recent mathematical techniques from the field of sublinear algorithms. The SADDLES algorithm gives provably accurate outputs for all values of the degree distribution. For the analysis, we define two fatness measures of the degree distribution, called the h-index and the z-index. We prove that SADDLES is sublinear in the graph size when these indices are large. A corollary of this result is a provably sublinear algorithm for any degree distribution bounded below by a power law. We deploy our new algorithm on a variety of real datasets and demonstrate its excellent empirical behavior. In all instances, we get extremely accurate approximations for all values in the degree distribution by observing at most $1%$ of the vertices. This is a major improvement over the state-of-the-art sampling algorithms, which typically sample more than $10%$ of the vertices to give comparable results. We also observe that the h and z-indices of real graphs are large, validating our theoretical analysis.

References

D. Achlioptas, A. Clauset, D. Kempe, and C. Moore. 2009. On the bias of traceroute sampling: Or, power-law degree distributions in regular graphs. J. ACM Vol. 56, 4 (2009). Google ScholarDigital Library
N.K. Ahmed, J. Neville, and R. Kompella. 2010. Reconsidering the Foundations of Network Sampling. In WIN 10.Google Scholar
N. Ahmed, J. Neville, and R. Kompella. 2012. Space-Efficient Sampling from Social Activity Streams SIGKDD BigMine. 1--8.Google Scholar
Nesreen K Ahmed, Nick Duffield, Jennifer Neville, and Ramana Kompella. 2014 a. Graph sample and hold: A framework for big-graph analytics SIGKDD. ACM, ACM, 1446--1455. Google ScholarDigital Library
Nesreen K Ahmed, Jennifer Neville, and Ramana Kompella. 2014 b. Network sampling: From static to streaming graphs. TKDD Vol. 8, 2 (2014), 7. Google ScholarDigital Library
Sinan G. Aksoy, Tamara G. Kolda, and Ali Pinar. 2017. Measuring and modeling bipartite graphs with community structure. Journal of Complex Networks (2017). to appear.Google Scholar
Albert-László Barabási and Réka Albert. 1999. Emergence of Scaling in Random Networks. Science Vol. 286 (Oct. 1999), 509--512.Google Scholar
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. 2000. Graph structure in the web. Computer Networks Vol. 33 (2000), 309--320. Google ScholarDigital Library
Deepayan Chakrabarti and Christos Faloutsos. 2006. Graph Mining: Laws, Generators, and Algorithms. Comput. Surveys Vol. 38, 1 (2006). Google ScholarDigital Library
F. Chierichetti, A. Dasgupta, R. Kumar, S. Lattanzi, and T. Sarlos. 2016. On Sampling Nodes in a Network. In Conference on the World Wide Web (WWW). Google ScholarDigital Library
A. Clauset and C. Moore. 2005. Accuracy and scaling phenomena in internet mapping. Phys. Rev. Lett. Vol. 94 (2005), 018701.Google ScholarCross Ref
A. Clauset, C. R. Shalizi, and M. E. J. Newman. 2009. Power-Law Distributions in Empirical Data. SIAM Rev. Vol. 51, 4 (2009), 661--703. Google ScholarDigital Library
R. Cohen, K. Erez, D. ben Avraham, and S. Havlin. 2000. Resilience of the Internet to Random Breakdowns. Phys. Rev. Lett. Vol. 85, 4626--8 (2000).Google ScholarCross Ref
A. Dasgupta, R. Kumar, and T. Sarlos. 2014. On estimating the average degree. In Conference on the World Wide Web (WWW). 795--806. Google ScholarDigital Library
D. Dubhashi and A. Panconesi. 2012. Concentration of Measure for the Analysis of Randomised Algorithms. Cambridge University Press. Google ScholarDigital Library
N. Durak, T.G. Kolda, A. Pinar, and C. Seshadhri. 2013. A scalable null model for directed graphs matching all degree distributions: In, out, and reciprocal. In Network Science Workshop (NSW), 2013 IEEE 2nd. 23--30.Google Scholar
Peter Ebbes, Zan Huang, Arvind Rangaswamy, Hari P Thadakamalla, and ORGB Unit. 2008. Sampling large-scale social networks: Insights from simulated networks 18th Annual Workshop on Information Technologies and Systems, Paris, France.Google Scholar
Talya Eden, Shweta Jain, Ali Pinar, Dana Ron, and C. Seshadhri. 2017 a. Provable and practical approximations for the degree distribution using sublinear graph samples. CoRR Vol. abs/1710.08607 (2017). {arxiv}1710.08607 http://arxiv.org/abs/1710.08607Google Scholar
T. Eden, A. Levi, D. Ron, and C. Seshadhri. 2015. Approximately Counting Triangles in Sublinear Time Foundations of Computer Science (FOCS), GRS11 (Ed.). 614--633. Google ScholarDigital Library
T. Eden, D. Ron, and C. Seshadhri. 2017 b. Sublinear Time Estimation of Degree Distribution Moments: The Degeneracy Connection. In International Colloquium on Automata, Languages, and Programming (ICALP), GRS11 (Ed.). 614--633.Google Scholar
M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On power-law relationships of the internet topology SIGCOMM. 251--262. Google ScholarDigital Library
U. Feige. 2006. On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. Vol. 35, 4 (2006), 964--984. Google ScholarDigital Library
O. Goldreich and D. Ron. 2002. Property Testing in Bounded Degree Graphs. Algorithmica (2002), 302--343. Google ScholarDigital Library
O. Goldreich and D. Ron. 2008. Approximating average parameters of graphs. Random Structures and Algorithms Vol. 32, 4 (2008), 473--493. Google ScholarDigital Library
M. Gonen, D. Ron, and Y. Shavitt. 2011. Counting stars and other small subgraphs in sublinear-time. SIAM Journal on Discrete Math Vol. 25, 3 (2011), 1365--1411.Google ScholarCross Ref
Mira Gonen, Dana Ron, Udi Weinsberg, and Avishai Wool. 2008. Finding a dense-core in Jellyfish graphs. Computer Networks Vol. 52, 15 (2008), 2831--2841. Google ScholarDigital Library
J. E. Hirsch. 2005. An index to quantify an individual's scientific research output. Proceedings of the National Academy of Sciences Vol. 102, 46 (2005), 16569--16572.Google ScholarCross Ref
A. Lakhina, J. Byers, M. Crovella, and P. Xie.. 2003. Sampling biases in IP topology measurements. In Proceedings of INFOCOMM, Vol. Vol. 1. 332--341.Google Scholar
Sang Hoon Lee, Pan-Jun Kim, and Hawoong Jeong. 2006. Statistical properties of sampled networks. Physical Review E Vol. 73, 1 (2006), 016102.Google ScholarCross Ref
Jure Leskovec. 2015. SNAP Stanford Network Analysis Project. http://snap.standord.edu. (2015).Google Scholar
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Knowledge Data and Discovery (KDD). ACM, 631--636. Google ScholarDigital Library
A. S. Maiya and T. Y. Berger-Wolf. 2011. Benefits of Bias: Towards Better Characterization of Network Sampling, In Knowledge Data and Discovery (KDD). ArXiv e-prints, 105--113. {arxiv}1109.3911 Google ScholarDigital Library
Andrew McGregor. 2014. Graph stream algorithms: A survey. SIGMOD Vol. 43, 1 (2014), 9--20. Google ScholarDigital Library
M. Mitzenmacher. 2003. A Brief History of Generative Models for Power Law and Lognormal Distributions. Internet Mathematics Vol. 1, 2 (2003), 226--251.Google ScholarCross Ref
M. E. J. Newman. 2003. The Structure and Function of Complex Networks. SIAM Rev. Vol. 45, 2 (2003), 167--256.Google ScholarDigital Library
M. E. J. Newman, S. Strogatz, and D. Watts. 2001. Random graphs with arbitrary degree distributions and their applications. Physical Review E Vol. 64 (2001), 026118.Google ScholarCross Ref
D. Pennock, G. Flake, S. Lawrence, E. Glover, and C. L. Giles. 2002. Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences Vol. 99, 8 (2002), 5207--5211.Google ScholarCross Ref
T. Petermann and P. Rios. 2004. Exploration of scale-free networks. European Physical Journal B Vol. 38 (2004), 201--204.Google ScholarCross Ref
Ali Pinar, Sucheta Soundarajan, Tina Eliassi-Rad, and Brian Gallagher. 2015. MaxOutProbe: An Algorithm for Increasing the Size of Partially Observed Networks. Technical Report. Sandia National Laboratories (SNL-CA), Livermore, CA (United States).Google Scholar
Bruno Ribeiro and Don Towsley. 2012. On the estimation accuracy of degree distributions from graph sampling Annual Conference on Decision and Control (CDC). IEEE, 5240--5247.Google Scholar
Dana Ron. 2010. Algorithmic and Analysis Techniques in Property Testing. Foundations and Trends in Theoretical Computer Science Vol. 5, 2 (2010), 73--205. Google ScholarDigital Library
Dana Ron and Gilad Tsur. 2016. The Power of an Example: Hidden Set Size Approximation Using Group Queries and Conditional Sampling. ACM Transactions on Computation Theory Vol. 8, 4 (2016), 15:1--15:19. Google ScholarDigital Library
C. Seshadhri, Tamara G. Kolda, and Ali Pinar. 2012. Community structure and scale-free collections of Erdös-Rényi graphs. Physical Review E Vol. 85, 5 (May. 2012), 056109.Google ScholarCross Ref
Olivia Simpson, C Seshadhri, and Andrew McGregor. 2015. Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution. In International Conference on Data Mining (ICDM). IEEE, 979--984. Google ScholarDigital Library
Michael PH Stumpf and Carsten Wiuf. 2005. Sampling properties of random graphs: the degree distribution. Physical Review E Vol. 72, 3 (2005), 036118.Google ScholarCross Ref
Yaonan Zhang, Eric D Kolaczyk, and Bruce D Spencer. 2015. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. The Annals of Applied Statistics Vol. 9, 1 (2015), 166--199.Google ScholarCross Ref

Index Terms

Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples
1. Theory of computation
  1. Design and analysis of algorithms
    1. Graph algorithms analysis
  2. Theory and algorithms for application domains
    1. Algorithmic game theory and mechanism design
      1. Social networks

Recommendations

Sublinear Time Estimation of Degree Distribution Moments: The Arboricity Connection

We revisit the classic problem of estimating the moments of the degree distribution of an undirected simple graph. Consider an undirected simple graph $G=(V,E)$ with $n$ (nonisolated) vertices, and define (for $s > 0$) $M_s= \sum_{v \in V} d^s_v$. Our aim ...
Read More
Constructing and sampling graphs with a prescribed joint degree distribution

One of the most influential recent results in network analysis is that many natural networks exhibit a power-law or log-normal degree distribution. This has inspired numerous generative models that match this property. However, more recent work has ...
Read More
Relationship between degree-rank function and degree distribution of protein-protein interaction networks

It is argued that both the degree-rank function r=f(d), which describes the relationship between the degree d and the rank r of a degree sequence, and the degree distribution P(k), which describes the probability that a randomly chosen vertex has degree ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '18: Proceedings of the 2018 World Wide Web Conference
April 2018
2000 pages
ISBN:9781450356398
General Chairs:
Pierre-Antoine Champin
Universitè Claude Bernard Lyon 1, France
,
Fabien Gandon
Inria, Université Côte d'Azur, CNRS, I3S, France
,
Lionel Médini
Université Claude Bernard Lyon 1, France
,
Program Chairs:
Mounia Lalmas
Spotify, UK
,
Panagiotis G. Ipeirotis
New York University, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 23 April 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
degree distribution
graphs
sampling
sublinear
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '18 Paper Acceptance Rate170of1,155submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 638
  Total Downloads
- Downloads (Last 12 months)85
- Downloads (Last 6 weeks)24
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Provable and Practical Approximations for the Degree Distribution using Sublinear Graph Samples

WWW '18: Proceedings of the 2018 World Wide Web Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sublinear Time Estimation of Degree Distribution Moments: The Arboricity Connection

Constructing and sampling graphs with a prescribed joint degree distribution

Relationship between degree-rank function and degree distribution of protein-protein interaction networks