ABSTRACT
This paper describes the building of a research library for studying the Web, especially research on how the structure and content of the Web change over time. The library is particularly aimed at supporting social scientists for whom the Web is both a fascinating social phenomenon and a mirror on society.The library is built on the collections of the Internet Archive, which has been preserving a crawl of the Web every two months since 1996. The technical challenges in organizing this data for research fall into two categories: high-performance computing to transfer and manage the very large amounts of data, and human-computer interfaces that empower research by non-computer specialists.
- Arms, W., Aya, S., Dmitriev, P., Kot, B., Mitchell, R., Walle, L., A Research Library for the Web based on the Historical Collections of the Internet Archive. D-Lib Magazine. February 2006. http://www.dlib.org/dlib/february06/arms/02arms.htmlGoogle Scholar
- Bergmark, D., Collection synthesis. ACM/IEEE-CS Joint Conference on Digital Libraries, 2002. Google ScholarDigital Library
- Brin, S., and Page. L., The anatomy of a large-scale hypertextual Web search engine. Seventh International World Wide Web Conference. Brisbane, Australia, 1998. Google ScholarDigital Library
- Burner, M., and Kahle, B., Internet Archive ARC File Format, 1996. http://archive.org/web/researcher/ArcFileFormat.phpGoogle Scholar
- Chakrabarti, D., Zhan, Y., and Faloutsos, C., R-MAT: recursive model for graph mining. SIAM International Conference on Data Mining, 2004.Google ScholarCross Ref
- Gerner, N., Sosa, C., Fall 2005 Semester Report for Web Lab Database Load Group. M.Eng. report, Computer Science Department, Cornell University, 2005. http://www.infosci.cornell.edu/SIN/WebLib/papers/Gerner2005.doc.Google Scholar
- Ghemawat, S., Gobioff, H. and Leung, S., The Google File System. 19th ACM Symposium on Operating Systems Principles, October 2003. Google ScholarDigital Library
- Jeyabalan, K., Kallukalam, J., Representation of Web Graph for in Memory Computation. M.Eng. report, Computer Science Department, Cornell University, 2005. http://www.infosci.cornell.edu/SIN/WebLib/papers/JeyabalanKallukalam2005.doc.Google Scholar
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Ninth ACM-SIAM Symposium on Discrete Algorithms, 1998. Google ScholarDigital Library
- Mitchell, S., Mooney, M., Mason, J., Paynter, G., Ruscheinski, J., Kedzierski, A., Humphreys, K., iVia Open Source Virtual Library System. D-Lib Magazine, 9 (1), January 2003. http://www.dlib.org/dlib/january03/mitchell/01mitchell.htmlGoogle Scholar
- Shah, S., Generating a web graph. M.Eng. report, Computer Science Department, Cornell University, 2005. http://www.infosci.cornell.edu/SIN/WebLib/papers/Shah2005a.doc.Google Scholar
- Shah, S., Retro Browser. M.Eng. report, Computer Science Department, Cornell University, 2005. http://www.infosci.cornell.edu/SIN/WebLib/papers/Shah2005b.pdf.Google Scholar
Index Terms
- Building a research library for the history of the web
Recommendations
Web Archiving and Digital Libraries (WADL)
JCDL '15: Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital LibrariesThis workshop will explore integration of Web archiving and digital libraries, so the complete life cycle involved is covered: creation/authoring, uploading/publishing in the Web (2.0), (focused) crawling, indexing, exploration (searching, browsing), ...
Building interoperable digital library services: MARIAN, open archives, and the NDLTD
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalIn this demonstration, we present interoperable and personalized search services for the Networked Digital Library of Theses and Dissertations (NDLTD). Using standard protocols and software, including those specified by the Open Archives Initiative (OAI)...
To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages
When replaying an archived web page, or memento, the fundamental expectation is that the page should be viewable and function exactly as it did at the archival time. However, this expectation requires web archives upon replay to modify the page and its ...
Comments