Open source decentralized data markets for training AI in areas of large shared risk
Paco Nathan examines decentralized data markets. With components based on blockchain technologiessmart contracts, token-curated registries, DApps, voting mechanisms, etc.decentralized data markets allow multiple parties to curate ML training datasets in ways that are transparent, auditable, and secure and allow equitable payouts that take social values into account.
|Open source decentralized data markets for training AI in areas of large shared risk
|Paco Nathan (derwen.ai)
|Artificial Intelligence Conference
|Put AI to Work
|San Francisco, California
|September 5-7, 2018
As the risk and reward trade-offs grow for products based on AI, along with the pressures of compliance and accountability, at what point is it no longer acceptable for any one commercial entity to hold responsibility for so much shared risk? Can we incentivize corporations, government agencies, independent watchdog groups, and other relevant parts to combine their data in cases where there are large shared risks? ML models have become ubiquitous, embedded in products and services used throughout our daily lives. Generally, those models get deployed by large commercial interests, which train them on proprietary datasets. However, matters of ethics, privacy, safety, bias, and other concerns can have terrible impact on individuals. For example, Google develops large sets of training data from crucial sensors in self-driving cars. In an almost adversarial way, the regulators on multiple continents focus on the impact of failure cases related to those sensors and associated ML models. Edge cases in test datasets prove to be disproportionately valuable, and potentially the basis for economic incentives. Instead of entrusting each manufacturer to build “near perfect” training datasets while bearing large risks, we should incentivize manufacturers to combine their data. Rewards for contributing parties could then derive from a combination of training data and testing edge cases, as identified by regulators and other watchdog parties. Paco Nathan explains how decentralized data markets provide a means to resolve difficult problems when training machine learning models, especially for use cases with large shared risks. With components based on blockchain technologies—smart contracts, token-curated registries, DApps, voting mechanisms, etc.—decentralized data markets allow multiple parties to curate ML training datasets in ways that are transparent, auditable, and secure and allow equitable payouts that take social values into account. Paco explores open source libraries from Computable.io based on Ethereum, which are being used to develop data markets. These enable users to adjust trade-offs between decentralized and centralized characteristics as needed for specific business use cases and as indicated by ethical concerns. This addresses other areas of machine learning risk, such as in genomics, medical research, and financial credit scores, where proprietary interests and social needs often come into conflict.