Abstract
During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context).
This article discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies, comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.
- Alonso, J. L., Figuerola, C. G., and Zazo, Á. F. 2003. Cibermetría: Nuevas Técnicas de Estudio Aplicables al Web. Ediciones TREA, Spain.Google Scholar
- Arlitt, M., Friedrich, R., and Jin, T. 1999. Workload characterization of a Web proxy in a cable modem environment. SIGMETRICS Perfor. Evaluat. Rev. 27, 2, 25--36. Google ScholarDigital Library
- Baeza-Yates, R. and Castillo, C. 2000. Caracterizando la Web chilena. In Encuentro Chileno de Ciencias de la Computación. Sociedad Chilena de Ciencias de la Computación, Punta Arenas, Chile.Google Scholar
- Baeza-Yates, R. and Castillo, C. 2001. Relating Web characteristics with link-based Web page ranking. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 21--32.Google Scholar
- Baeza-Yates, R. and Castillo, C. 2002. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems---Design, Management and Applications. IOS Press Amsterdam, 565--572.Google Scholar
- Baeza-Yates, R. and Castillo, C. 2004. Crawling the infinite Web: Five levels are enough. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 156--167.Google Scholar
- Baeza-Yates, R. and Castillo, C. 2005. Características de la Web chilena 2004. Tech. rep., Center for Web Research, University of Chile.Google Scholar
- Baeza-Yates, R., Castillo, C., and Lopez, V. 2006. Características de la Web de Espaa. El Profesional de la Informacin 15, 1 (Jan.).Google Scholar
- Baeza-Yates, R. and Lalanne, F. 2004. Characteristics of the Korean Web. Tech. rep., Korea--Chile IT Cooperation Center (ITCC).Google Scholar
- Baeza-Yates, R. and Navarro, G. 2004. Modeling text collections and its application to the Web. In Applied Probability: Recent Advances, Kluwer Academic Publishing.Google Scholar
- Baeza-Yates, R. and Poblete, B. 2003. Evolution of the chilean Web structure composition. In Proceedings of Latin American Web Conference. IEEE Computer Society Press, 11--13. Google ScholarDigital Library
- Baeza-Yates, R., Poblete, B., and Saint-Jean, F. 2003. Evolución de la Web Chilena 2001--2002. Tech. rep., Center for Web Research, University of Chile.Google Scholar
- Barr, D. 1996. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org/rfc/rfc1912.txt.Google Scholar
- Bharat, K., Chang, B. W., Henzinger, M., and Ruhl, M. 2001. Who links to whom: Mining linkage between Web sites. In International Conference on Data Mining (ICDM). IEEE Computer Society, 51--58. Google ScholarDigital Library
- Björneborn, L. and Ingwersen, P. 2004. Toward a basic framework for webometrics. J. Amer. Soc. Inform. Sci. Techn. 55, 14 (Aug.), 1216--1227. Google ScholarDigital Library
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International Conference on World Wide Web. ACM Press.Google Scholar
- Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726. Google ScholarDigital Library
- Brewington, B., Cybenko, G., Stata, R., Bharat, K., and Maghoul, F. 2000. How dynamic is the Web? In Proceedings of the 9th Conference on the World Wide Web. ACM Press. Google ScholarDigital Library
- Brin, S., Motwani, R., Page, L., and Winograd, T. 1998. What can you do with a Web in your pocket? IEEE Data Engin. Bull. 21, 2, 37--47.Google Scholar
- Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web: Experiments and models. In Proceedings of the 9th Conference on the World Wide Web. ACM Press, 309--320. Google ScholarDigital Library
- Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR' 94). 161--175.Google Scholar
- da Silva, A. S., Veloso, E. A., Golgher, P. B., Berthier, Laender, A. H. F., and Ziviani, N. 1999. Cobweb---A crawler for the Brazilian Web. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 184--191. Google ScholarDigital Library
- Dill, S., Kumar, R., Mccurley, K. S., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2002. Self-similarity in the Web. ACM Trans. Intern. Techn. 2, 3, 205--223. Google ScholarDigital Library
- Downey, A. B. 2001. The structural cause of file size distributions. In Proceedings of the 9th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS). IEEE Computer Society Press. Google ScholarDigital Library
- Efthimiadis, E. and Castillo, C. 2004. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST). American Society for Information Science and Technology.Google Scholar
- Eiron, N., Curley, K. S., and Tomlin, J. A. 2004. Ranking the Web frontier. In Proceedings of the 13th International Conference on the World Wide Web. ACM Press, 309--318. Google ScholarDigital Library
- Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6. Google ScholarDigital Library
- Gomes, D. and Silva, M. J. 2005. Characterizing a national community Web. ACM Trans. Intern. Techn. 5, 3. Google ScholarDigital Library
- Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the www. In Proceedings of Content-Based Multimedia Information Access (RIAO). 237--246.Google Scholar
- Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web Conference 2, 4 (April), 219--229. Google ScholarDigital Library
- Huberman, B. A. and Adamic, L. A. 1999. Growth dynamics of the World-Wide Web. Nature 399.Google Scholar
- Jaimes, A., Ruiz, Verschae, R., Baeza-Yates, R., Castillo, C., Yaksic, D., and Davis, E. 2004. On the image content of a Web segment: Chile as a case study. J. Web Engin. 3, 2, 153--168. Google ScholarDigital Library
- Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarDigital Library
- Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 1627. Springer, 1--18. Google ScholarDigital Library
- Mitzenmacher, M. 2003. Dynamic models for file sizes and double Pareto distributions. Intern. Mathe. 1, 3, 305--333.Google ScholarCross Ref
- Modesto, M., Pereira, Ä., Ziviani, N., Castillo, C., and Baeza-Yates, R. 2005. Um novo retrato da Web Brasileira. In Proceedings of 32nd SEMISH. So Leopoldo, Brazil, 2005--2017.Google Scholar
- Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.Google Scholar
- Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 330--390. Google ScholarDigital Library
- Pitkow, J. E. 1999. Summary of WWW characterizations. WWW 2, 1-2, 3--13. Google ScholarDigital Library
- Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., and Kaiser, M. 2002. Uncovering information hidden in Web archives. D-Lib Magazine 8, 12.Google ScholarCross Ref
- Sanguanpong, S., Nga, P. P., Keretho, S., Poovarawan, Y., and Warangrit, S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceeding of the Asia Pacific Advance Network Conference. Beijing, China, 225--230.Google Scholar
- Sanguanpong, S. and Warangrit, S. 1998. Nontrisearch: Search engine for campus network. In National Computer Science and Engineering Conference. Bangkok, Thailand.Google Scholar
- Suel, T. and Yuan, J. 2001. Compressing the graph structure of the Web. In Proceedings of the Data Compression Conference DCC. IEEE Computer Society Press. Google ScholarDigital Library
- Veloso, E. A., de Moura, E., Golgher, P., da Silva, A., Almeida, R., Laender, A., Neto, R. B., and Ziviani, N. 2000. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao. Curitiba, Brasil.Google Scholar
- Yossef, Z. B., Broder, A. Z., Kumar, R., and Tomkins, A. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In Proceedings of the 13th Conference on the World Wide Web. ACM Press. Google ScholarDigital Library
- Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA.Google Scholar
Index Terms
- Characterization of national Web domains
Recommendations
Databases on the web: national web domain survey
IDEAS '11: Proceedings of the 15th Symposium on International Database Engineering & ApplicationsThe deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are ...
Characterization of the Thai hostgraph
ICUIMC '08: Proceedings of the 2nd international conference on Ubiquitous information management and communicationThe Web of a country or the national Web is a set of web pages related to a specific country. Understanding in the graph structure of the national Web provides invaluable insights for the development of algorithms and localized search services targeting ...
Characterizing a national community web
This article presents a characterization of the community Web of the people of Portugal. We defined criteria for delimiting this Web based on our past experience of crawling pages related to Portugal and collected over 3.2 million documents from 46,000 ...
Comments