Abstract
High-quality data is critical for effective data science. As the use of data science has grown, so too have concerns that individuals’ rights to privacy will be violated. This has led to the development of data protection regulations around the globe and the use of sophisticated anonymization techniques to protect privacy. Such measures make it more challenging for the data scientist to understand the data, exacerbating issues of data quality. Responsible data science aims to develop useful insights from the data while fully embracing these considerations.
We pose the high-level problem in this article, “How can a data scientist develop the needed trust that private data has high quality?” We then identify a series of challenges for various data-centric communities and outline research questions for data quality and privacy researchers, which would need to be addressed to effectively answer the problem posed in this article.
- C. Batini and M. Scannapieco. 2016. Data and Information Quality—Dimensions, Principles and Techniques. Springer International Publishing. Google ScholarDigital Library
- L. English. 1999. Improving Data Warehouse and Business Information Quality. Wiley. Google ScholarDigital Library
- S. Lohr. 2018. Facial Recognition is Accurate—If You’re a White Guy. Retrieved from https://www.nytimes.com/2018/02/09/technology/facial-recognition-race-artificial-intelligence.html.Google Scholar
- D. McGilvray. 2008. Executing Data Quality Projects. Morgan Kaufmann. Google ScholarDigital Library
- T. Nagle, T. Redman, and D. Sammon. 2017. Only 3% of Companies’ Data Meets Basic Quality Standards. Retrieved from https://hbr.org/2017/09/only-3-of-companies-data-meets-basic-quality-standards.Google Scholar
- European Statistical System Project. 2018. ESSnet Big Data Pilots-I. Retrieved from https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Main_Page.Google Scholar
- T. Redman. 2016. Getting in Front on Data: Who Does What. Technics.Google Scholar
- T. Redman. 2018. If Your Data Is Bad, Your Machine Learning Tools Are Useless. Retrieved from https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless.Google Scholar
- G. Stateva, O. Bosch, D. Windmeijer, J. Maslankowski, G. Barcaroli, M. Scannapieco, D. Summa, M. Greenaway, I. Jansson, and D. Wu. 2018. Web Scraping Enterprise Characteristics-Final Report. Retrieved from https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/e/ee/Wp2_Del2_4.pdf.Google Scholar
- E. Wilder-James. 2016. Breaking Down Data Silos. Retrieved from https://hbr.org/2016/12/breaking-down-data-silos.Google Scholar
Index Terms
- Ensuring High-Quality Private Data for Responsible Data Science: Vision and Challenges
Recommendations
Provenance-based auditing of private data use
VoCS'08: Proceedings of the 2008 international conference on Visions of Computer Science: BCS International Academic ConferenceAcross the world, organizations are required to comply with regulatory frameworks dictating how to manage personal information. Despite these, several cases of data leaks and exposition of private data to unauthorized recipients have been publicly and ...
Protecting Privacy of Sensitive Data Dissemination Using Active Bundles
CONGRESS '09: Proceedings of the 2009 World Congress on Privacy, Security, Trust and the Management of e-BusinessThe solution for protecting data privacy proposed in this paper—, called Active Bundles—, protects sensitive data from their disclosure to unauthorized parties and from unauthorized dissemination (even if started by an authorized party). The Active ...
Responsible Data Science
SIGMOD '19: Proceedings of the 2019 International Conference on Management of DataData science is an emerging discipline that offers both promise and peril. Responsible data science refers to efforts that address both the technical and societal issues in emerging data-driven technologies. How can machine learning and database systems ...
Comments