Abstract
The Open Data movement has become a driver for publicly available data on the Web. More and more data—from governments and public institutions but also from the private sector—are made available online and are mainly published in so-called Open Data portals. However, with the increasing number of published resources, there is a number of concerns with regards to the quality of the data sources and the corresponding metadata, which compromise the searchability, discoverability, and usability of resources.
In order to get a more complete picture of the severity of these issues, the present work aims at developing a generic metadata quality assessment framework for various Open Data portals: We treat data portals independently from the portal software frameworks by mapping the specific metadata of three widely used portal software frameworks (CKAN, Socrata, OpenDataSoft) to the standardized Data Catalog Vocabulary metadata schema. We subsequently define several quality metrics, which can be evaluated automatically and in an efficient manner. Finally, we report findings based on monitoring a set of over 260 Open Data portals with 1.1M datasets. This includes the discussion of general quality issues, for example, the retrievability of data, and the analysis of our specific quality metrics.
- Maristella Agosti, Leonardo Candela, Donatella Castelli, Nicola Ferro, Yannis Ioannidis, Georgia Koutrika, Carlo Meghini, Pasquale Pagano, Seamuss Ross, H. J. Schek, and H. Schuldt. 2006. A Reference Model for DLMSs Interim Report. Deliverable. DELOS.Google Scholar
- Maristella Agosti, Nicola Ferro, Edward A. Fox, and Marcos A. Gonçalves. 2007. Modelling DL quality: A comparison between approaches: The DELOS reference model and the 5S model. In Proceedings of the 2nd DELOS Conference on Digital Libraries. 5--7.Google Scholar
- Keith Alexander, Richard Cyganiak, Michael Hausenblas, and Jun Zhao. 2011. Describing Linked Datasets with the VoID Vocabulary. Retrieved from https://www.w3.org/TR/void/.Google Scholar
- Ahmad Assaf, Raphaël Troncy, and Aline Senart. 2015. HDL - Towards a harmonized dataset model for open data portals. In PROFILES 2015, 2nd International Workshop on Dataset Profiling 8 Federated Search for Linked Data, Main Conference ESWC15, 31 May-4 June 2015, Portoroz, Slovenia. CEUR-WS.org, Portoroz, Slovenia.Google Scholar
- Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, Article 16 (Jul. 2009), 52 pages. DOI:http://dx.doi.org/10.1145/1541880.1541883 Google ScholarDigital Library
- John Carlo Bertot, Patrice McDermott, and Ted Smith. 2012. Measurement of open government: Metrics and process. In Proceedings of the 2014 47th Hawaii International Conference on System Sciences. 2491--2499. DOI:http://dx.doi.org/10.1109/HICSS.2012.658 Google ScholarDigital Library
- Christian Bizer and Richard Cyganiak. 2009. Quality-driven information filtering using the WIQA policy framework. J. Web Sem. 7, 1 (2009), 1--10. DOI:http://dx.doi.org/10.1016/j.websem.2008.02.005 Google ScholarDigital Library
- Katrin Braunschweig, Julian Eberius, Maik Thiele, and Wolfgang Lehner. 2012. The state of open data - limits of current open data platforms. In Proceedings of the International World Wide Web Conference (WWW’12). ACM.Google Scholar
- Leonardo Candela, Donatella Castelli, Pasquale Pagano, Costantino Thanos, Yannis E. Ioannidis, Georgia Koutrika, Seamus Ross, Hans-Jörg Schek, and Heiko Schuldt. 2007. Setting the foundations of digital libraries: The DELOS manifesto. D-Lib Mag. 13, 3/4 (2007). DOI:http://dx.doi.org/10.1045/march1007-castelliGoogle Scholar
- T. Davies, R. M. Sharif, and J. M. Alonso. 2015. Open Data Barometer Global Report. World Wide Web Foundation.Google Scholar
- Muriel de Dona, Elie Sloïm, Laurent Denis, and Fabrice Bonny. 2012. Qualité Web : Les Bonnes Pratiques Pour AmÈliorer Vos Sites. Temesis. Retrieved from http://amazon.com/o/ASIN/2954303107/.Google Scholar
- Nicola Ferro and Gianmaria Silvello. 2013. NESTOR: A formal model for digital archives. Inf. Process. Manage. 49, 6 (2013), 1206--1240. DOI:http://dx.doi.org/10.1016/j.ipm.2013.05.001Google ScholarCross Ref
- Edward A. Fox, Marcos André Gonçalves, and Rao Shen. 2012. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Morgan 8 Claypool Publishers. DOI:http://dx.doi.org/10.2200/S00434ED1V01Y201207ICR022 Google ScholarDigital Library
- Marcos André Gonçalves, Edward A. Fox, Layne T. Watson, and Neill A. Kipp. 2004. Streams, structures, spaces, scenarios, societies (5s): A formal model for digital libraries. ACM Trans. Inf. Syst. 22, 2 (2004), 270--312. DOI:http://dx.doi.org/10.1145/984321.984325 Google ScholarDigital Library
- Marcos André Gonçalves, Bárbara Lagoeiro Moreira, Edward A. Fox, and Layne T. Watson. 2007. “What is a good digital library?” - A quality model for digital libraries. Inf. Process. Manage. 43, 5 (2007), 1416--1437. DOI:http://dx.doi.org/10.1016/j.ipm.2006.11.010 Google ScholarDigital Library
- Jane Greenberg, Maria Cristina Pattuelli, Bijan Parsia, and W. Davenport Robertson. 2001. Author-generated dublin core metadata for web resources: A baseline study in an organization. In Proceedings of the International Conference on Dublin Core and Metadata Applications (DC’01). National Institute of Informatics, Tokyo, Japan, 38--46. Google ScholarDigital Library
- Andreas Harth, Jürgen Umbrich, and Stefan Decker. 2006. MultiCrawler: A pipelined architecture for crawling and indexing semantic web data. In Proceedings of the 5th International Semantic Web Conference on the Semantic Web (ISWC’06). 258--271. DOI:http://dx.doi.org/10.1007/11926078_19 Google ScholarDigital Library
- Baden Hughes and Amol Kamat. 2005. A metadata search engine for digital language archives. D-Lib Magazine 11, 2 (2005). DOI:http://dx.doi.org/10.1045/february2005-hughesGoogle Scholar
- Matthias Jarke and Yannis Vassiliou. 1997. Data warehouse quality: A review of the DWQ project. In Proceedings of the 2nd Conference on Information Quality (IQ’97). MIT, 299--313.Google Scholar
- Jan Kucera, Dusan Chlapek, and Martin Necaský. 2013. Open government data catalogs: Current approaches and quality perspective. In Proceedings of the Technology-Enabled Innovation for Democracy, Government and Governance - 2nd Joint International Conference on Electronic Government and the Information Systems Perspective, and Electronic Democracy (EGOVIS/EDEM’13). 152--166. DOI:http://dx.doi.org/10.1007/978-3-642-40160-2_13 Google ScholarDigital Library
- Fadi Maali and John Erickson. 2014. Data Catalog Vocabulary (DCAT). Retrieved from http://www.w3.org/TR/vocab-dcat/.Google Scholar
- Thomas Margaritopoulos, Merkourios Margaritopoulos, Ioannis Mavridis, and Athanasios Manitsaris. 2008. A conceptual framework for metadata quality assessment. In Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications (DCMI’08). Dublin Core Metadata Initiative, 104--113. Google ScholarDigital Library
- Jerzy Michnik and Mei-Chen Lo. 2009. The assessment of the information quality with the aid of multiple criteria analysis. Eur. J. Oper. Res. 195, 3 (2009), 850--856. DOI:http://dx.doi.org/10.1016/j.ejor.2007.11.017Google ScholarCross Ref
- William E. Moen, Erin L. Stewart, and Charles R. McClure. 1998. Assessing metadata quality: Findings and methodological considerations from an evaluation of the U.S. government information locator service (GILS). In Proceedings of the IEEE Forum on Reasearch and Technology Advances in Digital Libraries (IEEE ADL’98). 246--255. DOI:http://dx.doi.org/10.1109/ADL.1998.670425 Google ScholarDigital Library
- Bárbara Lagoeiro Moreira, Marcos André Gonçalves, Alberto H. F. Laender, and Edward A. Fox. 2007. 5SQual: A quality assessment tool for digital libraries. In Proceeding of the ACM/IEEE Joint Conference on Digital Libraries (JCDL’07). 513. DOI:http://dx.doi.org/10.1145/1255175.1255313 Google ScholarDigital Library
- Jehad Najjar, Stefaan Ternier, and Erik Duval. 2003. The actual use of metadata in ARIADNE: An empirical analysis. In Proceedings of the 3rd ARIADNE Conference. 1--6.Google Scholar
- Marc Najork and Allan Heydon. 2002. High-performance web crawling. In Handbook of Massive Data Sets. Massive Computing, Vol. 4. Springer US, 25--45. Google ScholarDigital Library
- Xavier Ochoa and Erik Duval. 2009. Automatic evaluation of metadata quality in digital repositories. Int. J. Dig. Libr. 10, 2--3 (2009), 67--91. DOI:http://dx.doi.org/10.1007/s00799-009-0054-4Google ScholarCross Ref
- Peter Orszag. 2009. Open Government Directive.(2009). Memorandum for the Heads of Executive Departments and Agencies. Retrieved from https://www.whitehouse.gov/open/documents/open-government-directive.Google Scholar
- Leo Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211--218. DOI:http://dx.doi.org/10.1145/505248.5060010 Google ScholarDigital Library
- Konrad Johannes Reiche, Edzard Höfig, and Ina Schieferdecker. 2014. Assessment and visualization of metadata quality for open government data. In Proceedings of the International Conference for E-Democracy and Open Government (CeDEM’14).Google Scholar
- Pavel Shvaiko and Jérôme Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25, 1 (2013), 158--176. DOI:http://dx.doi.org/10.1109/TKDE.2011.253 Google ScholarDigital Library
- Manu Sporny, Gregg Kellogg, and Markus Lanthaler. 2014. JSON-LD 1.0A JSON-based Serialization for Linked Data. Retrieved from http://www.w3.org/TR/json-ld/.Google Scholar
- Diane M. Strong, Yang W. Lee, and Richard Y. Wang. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103--110. Google ScholarDigital Library
- Jürgen Umbrich, Sebastian Neumaier, and Axel Polleres. 2015. Quality assessment 8 evolution of open data portals. In Proceedings of the International Conference on Open and Big Data. IEEE, 404--411. DOI:http://dx.doi.org/10.1109/FiCloud.2015.82 Google ScholarDigital Library
- Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5--33. Google ScholarDigital Library
- Stuart Weibel, John Kunze, Carl Lagoze, and Misha Wolf. 1998. Dublin Core Metadata for Resource Discovery. Internet Engineering Task Force, RFC 2413. Technical Report. Google ScholarDigital Library
- Amanda J. Wilson. 2007. Toward releasing the metadata bottleneck-A baseline evaluation of contributor-supplied metadata. Libr. Res. Tech. Serv. 51, 1 (Jan. 2007), 16--28. DOI:http://dx.doi.org/10.5860/lrts.51n1.16Google Scholar
- Burcu Yildiz, Katharina Kaiser, and Silvia Miksch. 2005. pdf2table: A method to extract table information from PDF files. In Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 1773--1785.Google Scholar
- Amrapali Zaveri, Anisa Rula, Andrea Maurino, Ricardo Pietrobon, Jens Lehmann, and Sören Auer. 2015. Quality assessment for linked data: A survey. Semant. Web J. 7, 1 (Mar. 2015), 63--93. DOI:http://dx.doi.org/10.3233/SW-150175Google ScholarCross Ref
- Hongwei Zhu, Stuart E. Madnick, Yang W. Lee, and Richard Y. Wang. 2012. Data and information quality research: Its evolution and future. In Computing Handbook, Third Edition: Information Systems and Information Technology. CRC Press, 16: 1--20.Google Scholar
- Anneke Zuiderwijk, Marijn Janssen, Sunil Choenni, Ronald Meijer, and Roexsana Sheikh Alibaks. 2012. Socio-technical impediments of open data. Electr. J. e-Gov. 10, 2 (2012), 156--172.Google Scholar
Index Terms
- Automated Quality Assessment of Metadata across Open Data Portals
Recommendations
Open Data Portal Quality Comparison using AHP
dg.o '16: Proceedings of the 17th International Digital Government Research Conference on Digital Government ResearchDuring recent years, more and more Open Data becomes available and used as part of the Open Data movement. However, there are reported issues with the quality of the metadata in data portals and the data itself. This is a serious risk that could disrupt ...
Quality Assessment and Evolution of Open Data Portals
FICLOUD '15: Proceedings of the 2015 3rd International Conference on Future Internet of Things and CloudDespite the enthusiasm caused by the availability of a steadily increasing amount of openly available, structured data, first critical voices appear addressing the emerging issue of low quality in the meta data and data source of Open Data portals which ...
Quality Assessment for Open Government Data in China
ICIME 2018: Proceedings of the 2018 10th International Conference on Information Management and EngineeringWith the development in research of government open data, the issue of data quality becomes more prominent. It's important to accurately judge the data quality before using it. The microcosmic quality assessment not only provides criteria for users to ...
Comments