No abstract available.
Algorithmic aspects of information retrieval on the web
The Web explosion offers a bonanza of novel problems. In particular, information retrieval in the Web context requires methods and ideas that have not been addressed in the classic information retrieval literature. This chapter will survey emerging ...
High-performance web crawling
High-performance web crawlers are an important component of many web services. For example, search services use web crawlers to populate their indices, comparison shopping engines use them to collect product and pricing information from online vendors, ...
Internet growth: is there a "Moore's law" for data traffic?
Internet traffic is approximately doubling each year. This growth rate applies not only to the entire Internet, but to a large range of individual institutions. For a few places we have records going back several years that exhibit this regular rate of ...
Random evolution in massive graphs
Many massive graphs (such as WWW graphs and Call graphs) share certain universal characteristics which can be described by the so-called the "power law". In this paper, we first briefly survey the history and previous work on power law graphs. Then we ...
Property testing in massive graphs
We consider the task of evaluating properties of graphs that are too big to be even scanned. Thus, the input graph is given in form of an oracle which answers questions of the form is there an edge between vertices u and υ, or who is the ith neighbor of ...
String pattern matching for a deluge survival kit
String Pattern Matching concerns itself with algorithmic and combinatorial issues related to matching and searching on linearly arranged sequences of symbols, arguably the simplest possible discrete structures. As unprecedented volumes of sequence data ...
Searching large text collections
In this chapter we present the main data structures and algorithms for searching large text collections. We emphasize inverted files, the most used index, but also review suffix arrays, which are useful in a number of specialized applications. We also ...
Data compression
The exponential growth of computer applications in the last three decades of the 20th century has resulted in an explosive growth in the amounts of data moved between computers, collected, and stored by computer users. This, in turn, has created the ...
External memory data structures
In many massive dataset applications the data must be stored in space and query efficient data structures on external storage devices. Often the data needs to be changed dynamically. In this chapter we discuss recent advances in the development of ...
External memory algorithms
Data sets in large applications are often too massive to fit completely inside the computer's internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major ...
Data envelopment analysis (DEA) in massive data sets
Data Envelopment Analysis (DEA) is a clustering methodology for records in data sets corresponding to entities sharing a common list of attributes. Broadly defined, DEA partitions the records into two subsets; those 'efficient' and those 'inefficient.' ...
Optimization methods in massive data sets
We describe the role of generalized support vector machines in separating massive and complex data using arbitrary nonlinear kernels. Feature selection that improves generalization is implemented via an effective procedure that utilizes a polyhedral ...
Wavelets and multiscale transform in astronomical image processing
With the requirements of scientific and medical image database support in mind, we describe a range of useful technologies for storage, transmission and display. These new technologies are all based on discrete wavelet or related multiscale transforms. ...
Clustering in massive data sets
We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far ...
Managing and analyzing massive data sets with data cubes
Data cubes combine an easy-to-understand conceptual model with an implementation that enables the fast summarization of large data sets. This makes them a powerful tool for supporting the interactive analysis of massive data collections like data ...
Data squashing: constructing summary data sets
A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a ...
Mining and monitoring evolving data
Data mining algorithms have been the focus of much recent research. The initial spurt of research on data mining algorithms typically considered static datasets. In practice, the input data to a data mining process resides in a large data warehouse ...
Data quality in massive data sets
All data contain errors, and large spatial data sets are especially prone because they contain data from multiple sources, and use different assumptions about structure and semantics. The general issue is one of data quality assurance, defined in terms ...
Data warehousing
A data warehouse is a repository for information that is collected, cleaned, and made available for analysis. A well run data warehouse makes many analyses easy to run, because many complex details have been taken care of already. In this chapter, we ...
Aggregate view management in data warehouses
Materialized views and their potential have been recently rediscovered for the content of OLAP and data warehousing. A flurry of papers has been generated on how views can be used to accelerate ad-hoc computations over massive datasets. In this chapter ...
Semistructured data and XML
The distinguishing feature of semistructured data is that the schema is embedded with the data. The main challenge is to cope with the additional flexibility without sacrificing efficiency. We introduce semistructured data by presenting a syntax and ...
Overview of high performance computers
The overview given here concentrates on the computational capabilities of the systems discussed. To do full justice to all assets of present days high-performance computers one should list their I/O performance and their connectivity possibilities as ...
The national scalable cluster project: three lessons about high performance data mining and data intensive computing
We discuss three principles learned from experience with the National Scalable Cluster Project. Storing, managing and mining massive data requires systems that exploit parallelism. This can be achieved with shared-nothing clusters and careful attention ...
Sorting and selection on parallel disk models
Data explosion is an increasingly prevalent problem in every field of science. Traditional out-of-core models that assume a single disk have been found inadequate to handle voluminous data. As a result, models that employ multiple disks have been ...
Billing in the large
There is a growing need for very large databases which are not practical to implement with conventional relational database technology. These databases are characterized by huge size and frequent large updates; they do not require traditional database ...
Detecting fraud in the real world
Finding telecommunications fraud in masses of call records is more difficult than finding a needle in a haystack. In the haystack problem, there is only one needle that does not look like hay, the pieces of hay all look similar, and neither the needle ...
Massive datasets in astronomy
Astronomy has a long history of acquiring, systematizing, and interpreting large quantities of data. Starting from the earliest sky atlases through the first major photographic sky surveys of the 20th century, this tradition is continuing today, and at ...
Data management in environmental information systems
This chapter describes the design and implementation of information systems to support decision-making in environmental management and protection. We employ a three-way object model: An environmental object (such as a lake) is described by one or more ...
Massive data sets issues in earth observing
Current and next decade global Earth observing, other remote sensing and related climate analysis data collected by space and operational U.S. agencies such as NASA and NOAA, the European ESA, the Japanese NASDA and other international agency missions ...
Mining biomolecular data using background knowledge and artificial neural networks
Biomolecular data mining is the activity of finding significant information in protein, DNA and RNA molecules. The significant information may refer to motifs, clusters, genes, protein signatures and classification rules. This chapter presents an ...
Massive data set issues in air pollution modelling
Air pollution, especially the reduction of the air pollution to some acceptable levels, is a highly relevant environmental problem, which is becoming more and more important. This problem can successfully be studied only when high-resolution ...
Cited By
- Chen Y, Guo B and Huang X (2019). δ-Transitive closures and triangle consistency checking: a new way to evaluate graph pattern queries in large graph databases, The Journal of Supercomputing, 76:10, (8140-8174), Online publication date: 1-Oct-2020.
- Karmakar N and Biswas A Construction of an Approximate 3D Orthogonal Convex Skull Proceedings of the 6th International Workshop on Computational Topology in Image Context - Volume 9667, (180-192)
- Pepelyshev A, Staroselskiy Y and Zhigljavsky A Adaptive Targeting for Online Advertisement Revised Selected Papers of the First International Workshop on Machine Learning, Optimization, and Big Data - Volume 9432, (240-251)
- Mezzanzanica M, Boselli R, Cesarini M and Mercorio F (2015). A Model-Based Approach for Developing Data Cleansing Solutions, Journal of Data and Information Quality, 5:4, (1-28), Online publication date: 3-Mar-2015.
- Angel E, Campigotto R and Laforest C Implementation and comparison of heuristics for the vertex cover problem on huge graphs Proceedings of the 11th international conference on Experimental Algorithms, (39-50)
- Mironov I, Naor M and Segev G Sketching in adversarial environments Proceedings of the fortieth annual ACM symposium on Theory of computing, (651-660)
- Dagher I (2008). Quadratic kernel-free non-linear support vector machine, Journal of Global Optimization, 41:1, (15-30), Online publication date: 1-May-2008.
- Vitter J (2008). Algorithms and data structures for external memory, Foundations and Trends® in Theoretical Computer Science, 2:4, (305-474), Online publication date: 1-Jan-2008.
- Bradonjić M, Hagberg A and Percus A Giant component and connectivity in geographical threshold graphs Proceedings of the 5th international conference on Algorithms and models for the web-graph, (209-216)
- Leskovec J, Kleinberg J and Faloutsos C (2007). Graph evolution, ACM Transactions on Knowledge Discovery from Data, 1:1, (2-es), Online publication date: 1-Mar-2007.
- Donato D, Laura L, Leonardi S and Millozzi S (2007). The Web as a graph, ACM Transactions on Internet Technology, 7:1, (4-es), Online publication date: 1-Feb-2007.
- Leskovec J, Kleinberg J and Faloutsos C Graphs over time Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, (177-187)
Index Terms
- Handbook of massive data sets
Recommendations
Relationships between covering-based rough sets and relation-based rough sets
Rough set theory is an important technique to deal with vagueness and granularity in information systems. In rough set theory, relation-based rough sets and covering-based rough sets are two important extensions of the classical rough sets. This paper ...