skip to main content
research-article

Automated Quality Assessment of Metadata across Open Data Portals

Authors Info & Claims
Published:25 October 2016Publication History
Skip Abstract Section

Abstract

The Open Data movement has become a driver for publicly available data on the Web. More and more data—from governments and public institutions but also from the private sector—are made available online and are mainly published in so-called Open Data portals. However, with the increasing number of published resources, there is a number of concerns with regards to the quality of the data sources and the corresponding metadata, which compromise the searchability, discoverability, and usability of resources.

In order to get a more complete picture of the severity of these issues, the present work aims at developing a generic metadata quality assessment framework for various Open Data portals: We treat data portals independently from the portal software frameworks by mapping the specific metadata of three widely used portal software frameworks (CKAN, Socrata, OpenDataSoft) to the standardized Data Catalog Vocabulary metadata schema. We subsequently define several quality metrics, which can be evaluated automatically and in an efficient manner. Finally, we report findings based on monitoring a set of over 260 Open Data portals with 1.1M datasets. This includes the discussion of general quality issues, for example, the retrievability of data, and the analysis of our specific quality metrics.

References

  1. Maristella Agosti, Leonardo Candela, Donatella Castelli, Nicola Ferro, Yannis Ioannidis, Georgia Koutrika, Carlo Meghini, Pasquale Pagano, Seamuss Ross, H. J. Schek, and H. Schuldt. 2006. A Reference Model for DLMSs Interim Report. Deliverable. DELOS.Google ScholarGoogle Scholar
  2. Maristella Agosti, Nicola Ferro, Edward A. Fox, and Marcos A. Gonçalves. 2007. Modelling DL quality: A comparison between approaches: The DELOS reference model and the 5S model. In Proceedings of the 2nd DELOS Conference on Digital Libraries. 5--7.Google ScholarGoogle Scholar
  3. Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary. Retrieved from https://www.w3.org/TR/void/.Google ScholarGoogle Scholar
  4. Ahmad Assaf, Raphaël Troncy, and Aline Senart. 2015. HDL - Towards a harmonized dataset model for open data portals. In PROFILES 2015, 2nd International Workshop on Dataset Profiling 8 Federated Search for Linked Data, Main Conference ESWC15, 31 May-4 June 2015, Portoroz, Slovenia. CEUR-WS.org, Portoroz, Slovenia.Google ScholarGoogle Scholar
  5. Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, Article 16 (Jul. 2009), 52 pages. DOI:http://dx.doi.org/10.1145/1541880.1541883 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. John Carlo Bertot, Patrice McDermott, and Ted Smith. 2012. Measurement of open government: Metrics and process. In Proceedings of the 2014 47th Hawaii International Conference on System Sciences. 2491--2499. DOI:http://dx.doi.org/10.1109/HICSS.2012.658 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christian Bizer and Richard Cyganiak. 2009. Quality-driven information filtering using the WIQA policy framework. J. Web Sem. 7, 1 (2009), 1--10. DOI:http://dx.doi.org/10.1016/j.websem.2008.02.005 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Katrin Braunschweig, Julian Eberius, Maik Thiele, and Wolfgang Lehner. 2012. The state of open data - limits of current open data platforms. In Proceedings of the International World Wide Web Conference (WWW’12). ACM.Google ScholarGoogle Scholar
  9. Leonardo Candela, Donatella Castelli, Pasquale Pagano, Costantino Thanos, Yannis E. Ioannidis, Georgia Koutrika, Seamus Ross, Hans-Jörg Schek, and Heiko Schuldt. 2007. Setting the foundations of digital libraries: The DELOS manifesto. D-Lib Mag. 13, 3/4 (2007). DOI:http://dx.doi.org/10.1045/march1007-castelliGoogle ScholarGoogle Scholar
  10. T. Davies, R. M. Sharif, and J. M. Alonso. 2015. Open Data Barometer Global Report. World Wide Web Foundation.Google ScholarGoogle Scholar
  11. Muriel de Dona, Elie Sloïm, Laurent Denis, and Fabrice Bonny. 2012. Qualité Web : Les Bonnes Pratiques Pour AmÈliorer Vos Sites. Temesis. Retrieved from http://amazon.com/o/ASIN/2954303107/.Google ScholarGoogle Scholar
  12. Nicola Ferro and Gianmaria Silvello. 2013. NESTOR: A formal model for digital archives. Inf. Process. Manage. 49, 6 (2013), 1206--1240. DOI:http://dx.doi.org/10.1016/j.ipm.2013.05.001Google ScholarGoogle ScholarCross RefCross Ref
  13. Edward A. Fox, Marcos André Gonçalves, and Rao Shen. 2012. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Morgan 8 Claypool Publishers. DOI:http://dx.doi.org/10.2200/S00434ED1V01Y201207ICR022 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Marcos André Gonçalves, Edward A. Fox, Layne T. Watson, and Neill A. Kipp. 2004. Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM Trans. Inf. Syst. 22, 2 (2004), 270--312. DOI:http://dx.doi.org/10.1145/984321.984325 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Marcos André Gonçalves, Bárbara Lagoeiro Moreira, Edward A. Fox, and Layne T. Watson. 2007. “What is a good digital library?” - A quality model for digital libraries. Inf. Process. Manage. 43, 5 (2007), 1416--1437. DOI:http://dx.doi.org/10.1016/j.ipm.2006.11.010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jane Greenberg, Maria Cristina Pattuelli, Bijan Parsia, and W. Davenport Robertson. 2001. Author-generated dublin core metadata for web resources: A baseline study in an organization. In Proceedings of the International Conference on Dublin Core and Metadata Applications (DC’01). National Institute of Informatics, Tokyo, Japan, 38--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Andreas Harth, Jürgen Umbrich, and Stefan Decker. 2006. MultiCrawler: A pipelined architecture for crawling and indexing semantic web data. In Proceedings of the 5th International Semantic Web Conference on the Semantic Web (ISWC’06). 258--271. DOI:http://dx.doi.org/10.1007/11926078_19 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Baden Hughes and Amol Kamat. 2005. A metadata search engine for digital language archives. D-Lib Magazine 11, 2 (2005). DOI:http://dx.doi.org/10.1045/february2005-hughesGoogle ScholarGoogle Scholar
  19. Matthias Jarke and Yannis Vassiliou. 1997. Data warehouse quality: A review of the DWQ project. In Proceedings of the 2nd Conference on Information Quality (IQ’97). MIT, 299--313.Google ScholarGoogle Scholar
  20. Jan Kucera, Dusan Chlapek, and Martin Necaský. 2013. Open government data catalogs: Current approaches and quality perspective. In Proceedings of the Technology-Enabled Innovation for Democracy, Government and Governance - 2nd Joint International Conference on Electronic Government and the Information Systems Perspective, and Electronic Democracy (EGOVIS/EDEM’13). 152--166. DOI:http://dx.doi.org/10.1007/978-3-642-40160-2_13 Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Fadi Maali and John Erickson. 2014. Data Catalog Vocabulary (DCAT). Retrieved from http://www.w3.org/TR/vocab-dcat/.Google ScholarGoogle Scholar
  22. Thomas Margaritopoulos, Merkourios Margaritopoulos, Ioannis Mavridis, and Athanasios Manitsaris. 2008. A conceptual framework for metadata quality assessment. In Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications (DCMI’08). Dublin Core Metadata Initiative, 104--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jerzy Michnik and Mei-Chen Lo. 2009. The assessment of the information quality with the aid of multiple criteria analysis. Eur. J. Oper. Res. 195, 3 (2009), 850--856. DOI:http://dx.doi.org/10.1016/j.ejor.2007.11.017Google ScholarGoogle ScholarCross RefCross Ref
  24. William E. Moen, Erin L. Stewart, and Charles R. McClure. 1998. Assessing metadata quality: Findings and methodological considerations from an evaluation of the U.S. government information locator service (GILS). In Proceedings of the IEEE Forum on Reasearch and Technology Advances in Digital Libraries (IEEE ADL’98). 246--255. DOI:http://dx.doi.org/10.1109/ADL.1998.670425 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Bárbara Lagoeiro Moreira, Marcos André Gonçalves, Alberto H. F. Laender, and Edward A. Fox. 2007. 5SQual: A quality assessment tool for digital libraries. In Proceeding of the ACM/IEEE Joint Conference on Digital Libraries (JCDL’07). 513. DOI:http://dx.doi.org/10.1145/1255175.1255313 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jehad Najjar, Stefaan Ternier, and Erik Duval. 2003. The actual use of metadata in ARIADNE: An empirical analysis. In Proceedings of the 3rd ARIADNE Conference. 1--6.Google ScholarGoogle Scholar
  27. Marc Najork and Allan Heydon. 2002. High-performance web crawling. In Handbook of Massive Data Sets. Massive Computing, Vol. 4. Springer US, 25--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Xavier Ochoa and Erik Duval. 2009. Automatic evaluation of metadata quality in digital repositories. Int. J. Dig. Libr. 10, 2--3 (2009), 67--91. DOI:http://dx.doi.org/10.1007/s00799-009-0054-4Google ScholarGoogle ScholarCross RefCross Ref
  29. Peter Orszag. 2009. Open Government Directive.(2009). Memorandum for the Heads of Executive Departments and Agencies. Retrieved from https://www.whitehouse.gov/open/documents/open-government-directive.Google ScholarGoogle Scholar
  30. Leo Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211--218. DOI:http://dx.doi.org/10.1145/505248.5060010 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Konrad Johannes Reiche, Edzard Höfig, and Ina Schieferdecker. 2014. Assessment and visualization of metadata quality for open government data. In Proceedings of the International Conference for E-Democracy and Open Government (CeDEM’14).Google ScholarGoogle Scholar
  32. Pavel Shvaiko and Jérôme Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25, 1 (2013), 158--176. DOI:http://dx.doi.org/10.1109/TKDE.2011.253 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Manu Sporny, Gregg Kellogg, and Markus Lanthaler. 2014. JSON-LD 1.0A JSON-based Serialization for Linked Data. Retrieved from http://www.w3.org/TR/json-ld/.Google ScholarGoogle Scholar
  34. Diane M. Strong, Yang W. Lee, and Richard Y. Wang. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jürgen Umbrich, Sebastian Neumaier, and Axel Polleres. 2015. Quality assessment 8 evolution of open data portals. In Proceedings of the International Conference on Open and Big Data. IEEE, 404--411. DOI:http://dx.doi.org/10.1109/FiCloud.2015.82 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Stuart Weibel, John Kunze, Carl Lagoze, and Misha Wolf. 1998. Dublin Core Metadata for Resource Discovery. Internet Engineering Task Force, RFC 2413. Technical Report. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Amanda J. Wilson. 2007. Toward releasing the metadata bottleneck-A baseline evaluation of contributor-supplied metadata. Libr. Res. Tech. Serv. 51, 1 (Jan. 2007), 16--28. DOI:http://dx.doi.org/10.5860/lrts.51n1.16Google ScholarGoogle Scholar
  39. Burcu Yildiz, Katharina Kaiser, and Silvia Miksch. 2005. pdf2table: A method to extract table information from PDF files. In Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 1773--1785.Google ScholarGoogle Scholar
  40. Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Semant. Web J. 7, 1 (Mar. 2015), 63--93. DOI:http://dx.doi.org/10.3233/SW-150175Google ScholarGoogle ScholarCross RefCross Ref
  41. Hongwei Zhu, Stuart E. Madnick, Yang W. Lee, and Richard Y. Wang. 2012. Data and information quality research: Its evolution and future. In Computing Handbook, Third Edition: Information Systems and Information Technology. CRC Press, 16: 1--20.Google ScholarGoogle Scholar
  42. Anneke Zuiderwijk, Marijn Janssen, Sunil Choenni, Ronald Meijer, and Roexsana Sheikh Alibaks. 2012. Socio-technical impediments of open data. Electr. J. e-Gov. 10, 2 (2012), 156--172.Google ScholarGoogle Scholar

Index Terms

  1. Automated Quality Assessment of Metadata across Open Data Portals

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Journal of Data and Information Quality
            Journal of Data and Information Quality  Volume 8, Issue 1
            Special Issue on Web Data Quality
            November 2016
            125 pages
            ISSN:1936-1955
            EISSN:1936-1963
            DOI:10.1145/3012403
            Issue’s Table of Contents

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 25 October 2016
            • Accepted: 1 June 2016
            • Revised: 1 March 2016
            • Received: 1 November 2015
            Published in jdiq Volume 8, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader