The COVID-19 pandemic has triggered the creation of worldwide data hubs to collect data, and the development of software tools to extract insights from data. These insights can be used to find new ways to reduce disease spread, protect healthcare workers and accelerate the development of treatments, including a vaccine. A data hub might be available “as-is” to “use at your own risk,” requiring collaborators to manage legal, regulatory and ethical risks.
Data hubs can involve different health datasets, such as clinical and research data. COVID-19 Open Research Dataset
(CORD-19) provides more than 44,000 articles about COVID-19 and the coronavirus family of viruses in a centralized hub in a machine-readable format, making it easier for use by the global machine learning community. Currently, CORD-19 is the most extensive coronavirus literature collection available for data and text mining. MiPasa
, another data hub that collects COVID-19 clinical data and laboratory data, is being assembled by the World Health Organization in partnership with technology companies. These types of health datasets can be integrated with non-health datasets, such as location and map datasets, to identify patterns and generate predictions of disease spread. COVID-19 geographic information system
by Esri includes maps and applications for pandemic response. Mobile phone location data can be used to monitor physical distancing and track movement of people who have been exposed to the virus.
In addition to collecting data, data hubs have the ability to integrate software tools to generate analytics and export datasets. Furthermore, developers can use application programming interfaces or software development kits from the hub to develop third-party applications. For example, machine learning applications can be generated using hub-provided training, test, and validation datasets. These applications can use trained models on other datasets to generate predictions or derived insights.
Intellectual property (IP) rights can protect data and the software tools of data hubs. Data hubs can make data and tools available for participants to use at their own risk, without IP assurance. Clearing and tracking contributed IP rights will minimize the risk of disputes and contested ownership. Large-scale data hubs involve many different stakeholders and owners of IP rights, making IP management quite complex.
Different types of IP rights apply to data hubs. For example, from a Canadian perspective, copyright can protect data and code if they resulted from an exercise of skill and judgment, rather than from a purely mechanical exercise. Consequently, raw clinical data are unlikely to pass this threshold but research articles would likely attract copyright protection. Data compilations, SDKs, and APIs can also be protected by copyright. Additionally, patents can protect functional aspects of data hubs, such as systems for training or generating models using the collected data on COVID 19. Patents can also protect data visualization, security or storage techniques, and practical applications of machine learning systems, including for health applications. In some instances, the data, and/or derivatives thereof, may also be protected as proprietary information or trade secrets held by various organizations.
The existence of these IP rights requires the relevant owners and creators of IP assets to be identified. The owner can contribute IP to the data hub which can motivate others to also make contributions. Only the owner should contribute the IP assets to a data hub to mitigate future disputes over use of the contributed data or resulting tools and any derived insights. If any data or software tools were included in a data hub without the permission of the owner, then any use might be unlawful and challenged by the owner.
Generating datasets for these hubs can involve “data scraping” or automatically collecting data online or through other published sources. However, publication of data does not automatically grant a license to use the data for any purpose. For example, a dataset for machine vision is currently subject to ongoing litigation in view of allegations that the underlying data was obtained in breach of the terms of service prohibiting data scraping.
To mitigate litigation risks for downstream users and projects, IP rights to data and software should be cleared before they are intermingled in a data hub or used by a participant. If any portion of data is added without the permission of the owner, then the data might be required to be removed. A data hub can have an IP clearance framework that traces ownership of different IP assets. Software solutions can use metadata to track data provenance (i.e., the origins of the data in datasets and creators/owners). This can help identify and remove any impugned data from the data hub. Such metadata allows data ownership and contributions of IP rights to be tracked and verified.
Contractual agreements for a data hub define the scope of permitted use of data and tools by others. Various licensing models are possible. For example, an owner can permit use by others using an “open” licensing model so that data can be freely used, reused, and redistributed. However, any use is still subject to contractual restrictions. For example, CORD-19 provides a dataset license agreement that permits text and data mining only.
Ownership and permitted use of data, including the treatment of original, derived and usage data–and any IP rights for applications to process data, should be clearly set out in agreements. The data hub can involve different licenses to support collaboration and motivate participation by different stakeholders. For example, patent owners can permit use of their inventions by users of a data hub. This might encourage other participants to permit use of their inventions to create a “patent pool” for the hub. Parties can mutually agree not to assert their rights in an effort to reach a common goal: developing solutions to mitigate the impact of COVID-19.
Data can be contributed “as is.” Parties relying on data to develop health solutions should consider representations and warranties relating to the quality of the datasets. Otherwise limitations should be clearly set out and disclaimers should be considered. Parties should also consider the license terms for the data hubs to ensure they comply with permitted uses for the data and tools.
Data collected as part of the effort against COVID-19 will include private, personal information, and health data. In some cases, inferences can be drawn from the data without the direct collection of such information, and even when personal information has been removed. Collection and use of sensitive data engages regulatory frameworks that protect personal information, which vary from country to country, such as PIPEDA (Canada) and the GDPR (European Union). Participants of data hubs should manage their own regulatory compliance and not rely on the organization centrally managing the data hub. The global nature of these data hubs makes regulatory compliance for participants particularly complex.
The collection of patient data and their use in developing a vaccine triggers privacy and public health regulatory requirements
. This includes limiting the purpose for which the data can be used, getting appropriate consent, and setting up strict technical and management measures to prevent data breaches. Patient data might have additional restrictions on use around informed consent. Generally speaking, requirements vary by jurisdiction requiring a global compliance process. Please see related articles regarding the federal privacy commissioner’s privacy law guidance bulletin about the COVID-19 pandemic
as well as issues regarding cybersecurity, fraud and liability
during the pandemic.
Use of data hubs for COVID-19 research triggers different ethical considerations that lead to reputational risks. Data might not be evaluated with quality assurance measures to verify accuracy and data integrity. Additionally, data might not equally represent a population, which can create bias. The bias may unintentionally skew results and treatments that do not apply equally to individuals. There should also be transparency into what data are included in the hub, how they were collected, and what factors are used to generate different insights. A data hub can have an ethics committee and data governance model. For example, a data trust framework can set terms for the collection and usage of data in which parties entrust one group or entity to make decisions about the data for the benefit of a wider group of stakeholders.
Big data and machine learning has great potential to support solutions for the COVID-19 pandemic. Even in times of crisis, legal, regulatory and ethical risks should be managed to help maximize benefits. Should you have questions regarding COVID-19 data hubs or on how to implement our suggested best practices, do not hesitate to contact our team.
The authors would like to thank Saba Samanianpour, articling student, for her help in preparing this legal update.