skip to main content
article

Characterization of national Web domains

Published:01 May 2007Publication History
Skip Abstract Section

Abstract

During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context).

This article discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a side-by-side comparison of the results of 12 Web characterization studies, comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.

References

  1. Alonso, J. L., Figuerola, C. G., and Zazo, Á. F. 2003. Cibermetría: Nuevas Técnicas de Estudio Aplicables al Web. Ediciones TREA, Spain.Google ScholarGoogle Scholar
  2. Arlitt, M., Friedrich, R., and Jin, T. 1999. Workload characterization of a Web proxy in a cable modem environment. SIGMETRICS Perfor. Evaluat. Rev. 27, 2, 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baeza-Yates, R. and Castillo, C. 2000. Caracterizando la Web chilena. In Encuentro Chileno de Ciencias de la Computación. Sociedad Chilena de Ciencias de la Computación, Punta Arenas, Chile.Google ScholarGoogle Scholar
  4. Baeza-Yates, R. and Castillo, C. 2001. Relating Web characteristics with link-based Web page ranking. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 21--32.Google ScholarGoogle Scholar
  5. Baeza-Yates, R. and Castillo, C. 2002. Balancing volume, quality and freshness in Web crawling. In Soft Computing Systems---Design, Management and Applications. IOS Press Amsterdam, 565--572.Google ScholarGoogle Scholar
  6. Baeza-Yates, R. and Castillo, C. 2004. Crawling the infinite Web: Five levels are enough. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 156--167.Google ScholarGoogle Scholar
  7. Baeza-Yates, R. and Castillo, C. 2005. Características de la Web chilena 2004. Tech. rep., Center for Web Research, University of Chile.Google ScholarGoogle Scholar
  8. Baeza-Yates, R., Castillo, C., and Lopez, V. 2006. Características de la Web de Espaa. El Profesional de la Informacin 15, 1 (Jan.).Google ScholarGoogle Scholar
  9. Baeza-Yates, R. and Lalanne, F. 2004. Characteristics of the Korean Web. Tech. rep., Korea--Chile IT Cooperation Center (ITCC).Google ScholarGoogle Scholar
  10. Baeza-Yates, R. and Navarro, G. 2004. Modeling text collections and its application to the Web. In Applied Probability: Recent Advances, Kluwer Academic Publishing.Google ScholarGoogle Scholar
  11. Baeza-Yates, R. and Poblete, B. 2003. Evolution of the chilean Web structure composition. In Proceedings of Latin American Web Conference. IEEE Computer Society Press, 11--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Baeza-Yates, R., Poblete, B., and Saint-Jean, F. 2003. Evolución de la Web Chilena 2001--2002. Tech. rep., Center for Web Research, University of Chile.Google ScholarGoogle Scholar
  13. Barr, D. 1996. RFC 1912: Common DNS operational and configuration errors. http://www.ietf.org/rfc/rfc1912.txt.Google ScholarGoogle Scholar
  14. Bharat, K., Chang, B. W., Henzinger, M., and Ruhl, M. 2001. Who links to whom: Mining linkage between Web sites. In International Conference on Data Mining (ICDM). IEEE Computer Society, 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Björneborn, L. and Ingwersen, P. 2004. Toward a basic framework for webometrics. J. Amer. Soc. Inform. Sci. Techn. 55, 14 (Aug.), 1216--1227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2002. Structural properties of the African Web. In Proceedings of the 11th International Conference on World Wide Web. ACM Press.Google ScholarGoogle Scholar
  17. Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed Web crawler. Softw. Practice Exper. 34, 8, 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Brewington, B., Cybenko, G., Stata, R., Bharat, K., and Maghoul, F. 2000. How dynamic is the Web? In Proceedings of the 9th Conference on the World Wide Web. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Brin, S., Motwani, R., Page, L., and Winograd, T. 1998. What can you do with a Web in your pocket? IEEE Data Engin. Bull. 21, 2, 37--47.Google ScholarGoogle Scholar
  20. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., and Wiener, J. 2000. Graph structure in the Web: Experiments and models. In Proceedings of the 9th Conference on the World Wide Web. ACM Press, 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cavnar, W. B. and Trenkle, J. M. 1994. N-gram-based text categorization. In Proceedings of 3rd Annual Symposium on Document Analysis and Information Retrieval (SDAIR' 94). 161--175.Google ScholarGoogle Scholar
  22. da Silva, A. S., Veloso, E. A., Golgher, P. B., Berthier, Laender, A. H. F., and Ziviani, N. 1999. Cobweb---A crawler for the Brazilian Web. In Proceedings of String Processing and Information Retrieval (SPIRE). IEEE Computer Society Press, 184--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Dill, S., Kumar, R., Mccurley, K. S., Rajagopalan, S., Sivakumar, D., and Tomkins, A. 2002. Self-similarity in the Web. ACM Trans. Intern. Techn. 2, 3, 205--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Downey, A. B. 2001. The structural cause of file size distributions. In Proceedings of the 9th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems (MASCOTS). IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Efthimiadis, E. and Castillo, C. 2004. Charting the Greek Web. In Proceedings of the Conference of the American Society for Information Science and Technology (ASIST). American Society for Information Science and Technology.Google ScholarGoogle Scholar
  26. Eiron, N., Curley, K. S., and Tomlin, J. A. 2004. Ranking the Web frontier. In Proceedings of the 13th International Conference on the World Wide Web. ACM Press, 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam Web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gomes, D. and Silva, M. J. 2005. Characterizing a national community Web. ACM Trans. Intern. Techn. 5, 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Grefenstette, G. and Nioche, J. 2000. Estimation of english and non-english language use on the www. In Proceedings of Content-Based Multimedia Information Access (RIAO). 237--246.Google ScholarGoogle Scholar
  30. Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  31. Heydon, A. and Najork, M. 1999. Mercator: A scalable, extensible Web crawler. World Wide Web Conference 2, 4 (April), 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Huberman, B. A. and Adamic, L. A. 1999. Growth dynamics of the World-Wide Web. Nature 399.Google ScholarGoogle Scholar
  33. Jaimes, A., Ruiz, Verschae, R., Baeza-Yates, R., Castillo, C., Yaksic, D., and Davis, E. 2004. On the image content of a Web segment: Chile as a case study. J. Web Engin. 3, 2, 153--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kleinberg, J. M. 1999. Authoritative sources in a hyperlinked environment. J. ACM 46, 5, 604--632. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., and Tomkins, A. S. 1999. The Web as a graph: Measurements, models and methods. In Proceedings of the 5th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 1627. Springer, 1--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mitzenmacher, M. 2003. Dynamic models for file sizes and double Pareto distributions. Intern. Mathe. 1, 3, 305--333.Google ScholarGoogle ScholarCross RefCross Ref
  37. Modesto, M., Pereira, Ä., Ziviani, N., Castillo, C., and Baeza-Yates, R. 2005. Um novo retrato da Web Brasileira. In Proceedings of 32nd SEMISH. So Leopoldo, Brazil, 2005--2017.Google ScholarGoogle Scholar
  38. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.Google ScholarGoogle Scholar
  39. Pandurangan, G., Raghavan, P., and Upfal, E. 2002. Using PageRank to characterize Web structure. In Proceedings of the 8th Annual International Computing and Combinatorics Conference (COCOON). Lecture Notes in Computer Science, vol. 2387. Springer, 330--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Pitkow, J. E. 1999. Summary of WWW characterizations. WWW 2, 1-2, 3--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rauber, A., Aschenbrenner, A., Witvoet, O., Bruckner, R. M., and Kaiser, M. 2002. Uncovering information hidden in Web archives. D-Lib Magazine 8, 12.Google ScholarGoogle ScholarCross RefCross Ref
  42. Sanguanpong, S., Nga, P. P., Keretho, S., Poovarawan, Y., and Warangrit, S. 2000. Measuring and analysis of the Thai World Wide Web. In Proceeding of the Asia Pacific Advance Network Conference. Beijing, China, 225--230.Google ScholarGoogle Scholar
  43. Sanguanpong, S. and Warangrit, S. 1998. Nontrisearch: Search engine for campus network. In National Computer Science and Engineering Conference. Bangkok, Thailand.Google ScholarGoogle Scholar
  44. Suel, T. and Yuan, J. 2001. Compressing the graph structure of the Web. In Proceedings of the Data Compression Conference DCC. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Veloso, E. A., de Moura, E., Golgher, P., da Silva, A., Almeida, R., Laender, A., Neto, R. B., and Ziviani, N. 2000. Um retrato da Web Brasileira. In Proceedings of Simposio Brasileiro de Computacao. Curitiba, Brasil.Google ScholarGoogle Scholar
  46. Yossef, Z. B., Broder, A. Z., Kumar, R., and Tomkins, A. 2004. Sic transit gloria telae: Towards an understanding of the web's decay. In Proceedings of the 13th Conference on the World Wide Web. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Cambridge, MA.Google ScholarGoogle Scholar

Index Terms

  1. Characterization of national Web domains

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Internet Technology
              ACM Transactions on Internet Technology  Volume 7, Issue 2
              May 2007
              152 pages
              ISSN:1533-5399
              EISSN:1557-6051
              DOI:10.1145/1239971
              Issue’s Table of Contents

              Copyright © 2007 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 May 2007
              Published in toit Volume 7, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader