skip to main content
Skip header Section
Handbook of massive data setsJanuary 2002
Publisher:
  • Kluwer Academic Publishers
  • 101 Philip Drive Assinippi Park Norwell, MA
  • United States
ISBN:978-1-4020-0489-6
Published:01 January 2002
Pages:
1252
Skip Bibliometrics Section
Bibliometrics
Abstract

No abstract available.

Skip Table Of Content Section
chapter
Preface
pp .9–.10
chapter
Algorithmic aspects of information retrieval on the web
pp 3–23

The Web explosion offers a bonanza of novel problems. In particular, information retrieval in the Web context requires methods and ideas that have not been addressed in the classic information retrieval literature. This chapter will survey emerging ...

chapter
High-performance web crawling
pp 25–45

High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, ...

chapter
Internet growth: is there a "Moore's law" for data traffic?
pp 47–93

Internet traffic is approximately doubling each year. This growth rate applies not only to the entire Internet, but to a large range of individual institutions. For a few places we have records going back several years that exhibit this regular rate of ...

chapter
Random evolution in massive graphs
pp 97–122

Many massive graphs (such as WWW graphs and Call graphs) share certain universal characteristics which can be described by the so-called the "power law". In this paper, we first briefly survey the history and previous work on power law graphs. Then we ...

chapter
Property testing in massive graphs
pp 123–147

We consider the task of evaluating properties of graphs that are too big to be even scanned. Thus, the input graph is given in form of an oracle which answers questions of the form is there an edge between vertices u and υ, or who is the ith neighbor of ...

chapter
String pattern matching for a deluge survival kit
pp 151–194

String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data ...

chapter
Searching large text collections
pp 195–243

In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also ...

chapter
Data compression
pp 245–309

The exponential growth of computer applications in the last three decades of the 20th century has resulted in an explosive growth in the amounts of data moved between computers, collected, and stored by computer users. This, in turn, has created the ...

chapter
External memory data structures
pp 313–357

In many massive dataset applications the data must be stored in space and query efficient data structures on external storage devices. Often the data needs to be changed dynamically. In this chapter we discuss recent advances in the development of ...

chapter
External memory algorithms
pp 359–416

Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major ...

chapter
Data envelopment analysis (DEA) in massive data sets
pp 419–437

Data Envelopment Analysis (DEA) is a clustering methodology for records in data sets corresponding to entities sharing a common list of attributes. Broadly defined, DEA partitions the records into two subsets; those 'efficient' and those 'inefficient.' ...

chapter
Optimization methods in massive data sets
pp 439–471

We describe the role of generalized support vector machines in separating massive and complex data using arbitrary nonlinear kernels. Feature selection that improves generalization is implemented via an effective procedure that utilizes a polyhedral ...

chapter
Wavelets and multiscale transform in astronomical image processing
pp 473–500

With the requirements of scientific and medical image database support in mind, we describe a range of useful technologies for storage, transmission and display. These new technologies are all based on discrete wavelet or related multiscale transforms. ...

chapter
Clustering in massive data sets
pp 501–543

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far ...

chapter
Managing and analyzing massive data sets with data cubes
pp 547–578

Data cubes combine an easy-to-understand conceptual model with an implementation that enables the fast summarization of large data sets. This makes them a powerful tool for supporting the interactive analysis of massive data collections like data ...

chapter
Data squashing: constructing summary data sets
pp 579–591

A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a ...

chapter
Mining and monitoring evolving data
pp 593–642

Data mining algorithms have been the focus of much recent research. The initial spurt of research on data mining algorithms typically considered static datasets. In practice, the input data to a data mining process resides in a large data warehouse ...

chapter
Data quality in massive data sets
pp 643–659

All data contain errors, and large spatial data sets are especially prone because they contain data from multiple sources, and use different assumptions about structure and semantics. The general issue is one of data quality assurance, defined in terms ...

chapter
Data warehousing
pp 661–710

A data warehouse is a repository for information that is collected, cleaned, and made available for analysis. A well run data warehouse makes many analyses easy to run, because many complex details have been taken care of already. In this chapter, we ...

chapter
Aggregate view management in data warehouses
pp 711–741

Materialized views and their potential have been recently rediscovered for the content of OLAP and data warehousing. A flurry of papers has been generated on how views can be used to accelerate ad-hoc computations over massive datasets. In this chapter ...

chapter
Semistructured data and XML
pp 743–788

The distinguishing feature of semistructured data is that the schema is embedded with the data. The main challenge is to cope with the additional flexibility without sacrificing efficiency. We introduce semistructured data by presenting a syntax and ...

chapter
Overview of high performance computers
pp 791–852

The overview given here concentrates on the computational capabilities of the systems discussed. To do full justice to all assets of present days high-performance computers one should list their I/O performance and their connectivity possibilities as ...

chapter
The national scalable cluster project: three lessons about high performance data mining and data intensive computing
pp 853–874

We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention ...

chapter
Sorting and selection on parallel disk models
pp 875–892

Data explosion is an increasingly prevalent problem in every field of science. Traditional out-of-core models that assume a single disk have been found inadequate to handle voluminous data. As a result, models that employ multiple disks have been ...

chapter
Billing in the large
pp 895–909

There is a growing need for very large databases which are not practical to implement with conventional relational database technology. These databases are characterized by huge size and frequent large updates; they do not require traditional database ...

chapter
Detecting fraud in the real world
pp 911–929

Finding telecommunications fraud in masses of call records is more difficult than finding a needle in a haystack. In the haystack problem, there is only one needle that does not look like hay, the pieces of hay all look similar, and neither the needle ...

chapter
Massive datasets in astronomy
pp 931–979

Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th century, this tradition is continuing today, and at ...

chapter
Data management in environmental information systems
pp 981–1091

This chapter describes the design and implementation of information systems to support decision-making in environmental management and protection. We employ a three-way object model: An environmental object (such as a lake) is described by one or more ...

chapter
Massive data sets issues in earth observing
pp 1093–1140

Current and next decade global Earth observing, other remote sensing and related climate analysis data collected by space and operational U.S. agencies such as NASA and NOAA, the European ESA, the Japanese NASDA and other international agency missions ...

chapter
Mining biomolecular data using background knowledge and artificial neural networks
pp 1141–1168

Biomolecular data mining is the activity of finding significant information in protein, DNA and RNA molecules. The significant information may refer to motifs, clusters, genes, protein signatures and classification rules. This chapter presents an ...

chapter
Massive data set issues in air pollution modelling
pp 1169–1220

Air pollution, especially the reduction of the air pollution to some acceptable levels, is a highly relevant environmental problem, which is becoming more and more important. This problem can successfully be studied only when high-resolution ...

Cited By

  1. Chen Y, Guo B and Huang X (2019). δ-Transitive closures and triangle consistency checking: a new way to evaluate graph pattern queries in large graph databases, The Journal of Supercomputing, 76:10, (8140-8174), Online publication date: 1-Oct-2020.
  2. Karmakar N and Biswas A Construction of an Approximate 3D Orthogonal Convex Skull Proceedings of the 6th International Workshop on Computational Topology in Image Context - Volume 9667, (180-192)
  3. Pepelyshev A, Staroselskiy Y and Zhigljavsky A Adaptive Targeting for Online Advertisement Revised Selected Papers of the First International Workshop on Machine Learning, Optimization, and Big Data - Volume 9432, (240-251)
  4. ACM
    Mezzanzanica M, Boselli R, Cesarini M and Mercorio F (2015). A Model-Based Approach for Developing Data Cleansing Solutions, Journal of Data and Information Quality, 5:4, (1-28), Online publication date: 3-Mar-2015.
  5. Angel E, Campigotto R and Laforest C Implementation and comparison of heuristics for the vertex cover problem on huge graphs Proceedings of the 11th international conference on Experimental Algorithms, (39-50)
  6. ACM
    Mironov I, Naor M and Segev G Sketching in adversarial environments Proceedings of the fortieth annual ACM symposium on Theory of computing, (651-660)
  7. Dagher I (2008). Quadratic kernel-free non-linear support vector machine, Journal of Global Optimization, 41:1, (15-30), Online publication date: 1-May-2008.
  8. Vitter J (2008). Algorithms and data structures for external memory, Foundations and Trends® in Theoretical Computer Science, 2:4, (305-474), Online publication date: 1-Jan-2008.
  9. Bradonjić M, Hagberg A and Percus A Giant component and connectivity in geographical threshold graphs Proceedings of the 5th international conference on Algorithms and models for the web-graph, (209-216)
  10. ACM
    Leskovec J, Kleinberg J and Faloutsos C (2007). Graph evolution, ACM Transactions on Knowledge Discovery from Data, 1:1, (2-es), Online publication date: 1-Mar-2007.
  11. ACM
    Donato D, Laura L, Leonardi S and Millozzi S (2007). The Web as a graph, ACM Transactions on Internet Technology, 7:1, (4-es), Online publication date: 1-Feb-2007.
  12. ACM
    Leskovec J, Kleinberg J and Faloutsos C Graphs over time Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, (177-187)
Contributors
  • Rutgers University–New Brunswick
  • University of Florida
  • University of Washington

Recommendations