“Data are not, by themselves, a form of knowledge.”1
In January 2021, a former U.S. intelligence officer revealed that BGI, a Chinese DNA sequencing company suspected of having ties with the Chinese military and Chinese government, had attempted to collect American DNA under the guise of offering to build COVID-19 testing sites in six U.S. states, including California and New York.2 These states, along with their medical institutions, were warned against cooperating with BGI because it could give China the opportunity to harness this biomedical data and use it for economic and security purposes. This incident is only one of many recent examples that highlight the growing anxiety felt by the U.S. government and public about the increasing availability of big data, including biomedical data, and about how China might use this data.3 In June 2021, the Biden administration signed a new executive order regarding the threat posed by China to U.S. information technologies, systems, and digital data.4 These developments — as well as those involving intellectual property theft, espionage, and China’s foreign talents programs — point to a growing concern about China as a U.S. strategic competitor and national security threat.5
This paper will discuss the security concerns that have arisen about biomedical big data and how they are based on a simplistic understanding about what is required to create, process, and use biomedical big data. Bioinformatics and science and technology scholarship provides a more accurate, empirically based account of the challenges of working with and applying biomedical data — challenges that stem from the heterogeneous nature of big data. These studies indicate that there are a number of bottlenecks and errors that can get introduced from the moment that a piece of biomedical big data is created through the journey to processing, storing, transferring, and using the data. In addition, laboratories that produce biomedical big data do not have standardized methods for creating or working with data. Thus, moving the data from one location to another can often involve tedious and nontrivial data curation and translation in order to use that data in a new setting. All of these data usage issues pose challenges to actors — whether state or nonstate — wishing to licitly or illicitly acquire biomedical big data and use them for their economic or security benefit. For far too long, the U.S. security community has jumped to assumptions about how easily data can become a security threat, without considering the more complex socio-technical factors shaping how those data can be used in practice.
Instead of merely focusing on data acquisition, intelligence analysts, law enforcement officials, and researchers should spend more time studying China’s bioinformatics infrastructure. This should include determining who is working in China’s bioinformatics community, where the industries are located, and what their capabilities are, as well as how China has been able to deal with the problems inherent in building a bioinformatics research community. These researchers should also investigate how China’s bioinformatics community of experts is — or is not — overcoming the problems that have plagued China’s overall science and technology infrastructure and what problems the country has encountered in using and translating biomedical big data. Probing these kinds of research questions would provide a more nuanced understanding of what kinds of indigenous innovation are occurring within China’s bioinformatics community that might actually pose security concerns for the United States. If academia, intelligence analysts, and law enforcement could collect and analyze data that are relevant to answering these kinds of issues, they would acquire a much better understanding of what China’s science and technology capabilities are when it comes to actually using the biomedical big data that it acquires. The following sections will outline the limitations in existing assessments of Chinese biomedical hacking, as well as provide new analytic frameworks and a research roadmap for conducting more robust assessments of this potential security threat in the future.
Biomedical Big Data and Security Concerns
In recent years, there has been increased attention on how the avalanche of new biomedical and life science “big data” coming from genomic sequencing, databases, electronic medical records, and other sources will usher in a new era of “precision medicine” that will reap a variety of public health benefits.6 “Big data” is a term used to describe extremely large data sets that can only be analyzed computationally, either individually or integrated with other data sets, to reveal previously unknown patterns, trends, and associations. For example, data from large-scale genomic studies are expected to elucidate the role that genetics plays in particular diseases, indicating which individuals might develop a given disease or disorder, and whether a new drug or therapeutic treatment might help patients suffering from a medical condition. The National Institutes of Health started the Big Data to Knowledge Initiative and the Precision Medicine Initiative, which aim to gather heterogeneous biomedical data (e.g., data about genomics and proteomics, as well as patient electronic health records, clinical trial data, and environmental data, to name a few) on millions of Americans to better understand the genetic, environmental, and behavioral/lifestyle determinants of diseases for the purpose of creating genetically guided medical treatments, enabling early detection, and looking toward preventative interventions in the future.7
Advances in genome editing, a process of modifying DNA sequences at precise genomic locations, are seen as important research steps that can aid in the development of precision medicine.8 Harvesting the potential of genome editing for public health depends on collecting and organizing accurate and digitized information about human and animal gene sequences and genomes. Since the advent of the Human Genome Project, the sheer volume of digitized genomic information has been on the rise due to dramatic decreases in the costs of DNA sequencing and synthesis, computing power, and data storage. These genomic data, coupled with data from patient electronic medical records and other relevant biomedical data stored in large, digitized databases, are seen as critical to realizing the vision of precision medicine.9 New digital technologies are allowing for the mining, collection, analysis, and visualization of these vast reservoirs of biomedical data in new ways.
Since 2015, the problem of hackers targeting and accessing the data of health and biomedical institutions has worsened.
The potential for the nefarious acquisition of these data by commercial, state, or nonstate actors has been noted by U.S. intelligence and law enforcement officials, and academic and think tank scholars.10 These security concerns over biomedical big data emerged in the 2010s, soon after the term “big data” began garnering media and public attention.11 In 2014, the American Association for the Advancement of Science, in conjunction with the FBI and the U.N. Interregional Crime and Justice Research Institute, produced the first report that put a spotlight on the security risks inherent in the generation of big data in the life sciences. The report highlighted two key data-related vulnerabilities. The first is the vulnerability of biomedical databases and IT infrastructure to theft or hacking. The second is the possibility that malevolent actors may access, integrate, and analyze diverse private and publicly available biomedical data to create pathogens, toxins, or biologically active molecules to harm animals, plants, or people, and/or to evade current detection devices and other medical countermeasures. The report noted, “Unlike other security risks often associated with the life sciences, the potential risks of Big Data in the life sciences rely on attacking the underlying data and cyber infrastructure and/or inappropriately using data and analytic technologies, not theft of actual pathogens or toxins.”12 The report seems to suggest that the enhanced risk or threat comes from, or is driven by, the data and cyber infrastructure alone.
The FBI’s Weapons of Mass Destruction Directorate and Biological Countermeasures Unit funded the 2014 report and has taken special interest in the big data life science threat.13 Ed You, then head of the FBI’s countermeasures unit, claimed, “Now somebody out there has the brass ring—this gigantic data set, where the only limitation is deriving the analytical tools to make all that data useful. … Whoever has the largest and most diverse data set is going to win.”14 Similar to the American Association for the Advancement of Science report, You also emphasized the roles of data acquisition and software tools, suggesting that they alone are the key limiting factors to the nefarious application of big data. In 2018, National Defense University launched a multiyear study titled, “The Digitization of Biology: Understanding the New Risks and Implications for Governance,” which also called attention to the growing risks of the data that are available in genome editing. This study noted that “[m]alicious actors may be able to leverage CRISPR-Cas9 and the knowledge generated from legitimate research using pathogen genomic data to cause harm.”15 One year later, the Bipartisan Commission on Biodefense held a meeting that examined the vulnerabilities of these data sets and their potential misuse, among other biomedical data concerns.16 And in January 2020, the National Academies of Science produced a report titled Safeguarding the Bioeconomy, funded by the Office of the Director of National Intelligence, which also placed attention on the vulnerabilities of cyber infrastructure related to biomedical data. In addition, the report highlighted how biomedical data sets could pose privacy risks, as well as economic and national security risks. For example, data associated with personally identifiable information, other personal health information, and genomic data sets could be leveraged for blackmail, extortion, or various types of exploitation and surveillance.17 Thus, with this focus on big data in the life sciences, U.S. security concerns have moved from the biological materials themselves to the information generated from life science research.
These security concerns, which sit at the intersection of big data and biomedicine, have been further aggravated by several high-profile cyber security breaches at leading U.S. insurance companies. In 2015, hackers infiltrated Anthem, the second largest health insurer in the United States, and accessed a company database containing as many as 80 million records of current and former Anthem customers and employees.18 Subsequent investigations revealed that the hackers accessed personal information such as names, member identifications, Social Security numbers, home and email addresses, and employment information. Thankfully, the hackers were not able to access credit card or patient medical information. This incident drew public and policy attention to the vulnerability of electronic health records and the direct targeting of U.S. citizens and their biomedical data, and it raised questions about whether the Health Insurance Portability and Accountability Act (HIPAA) adequately safeguarded against these kinds of cyber attacks.
Since 2015, the problem of hackers targeting and accessing the data of health and biomedical institutions has worsened.19 According to Robert Lord, founder and chief strategy officer of the data security firm Protenus, in 2019, 32 million patient records were breached by hacking — double the amount seen in 2018.20 The 2019 Healthcare Data Breach Report, released by the HIPAA Journal, found that more healthcare records were breached in 2019 than in the six years from 2009 to 2014.21 Because of growing concerns about data protection, in 2019, the U.S. Senate Cybersecurity Caucus held a forum on cyber security threats facing the healthcare industry to further address the issue.22
Some have highlighted the fact that China-based hacking groups have been responsible for, or are strongly implicated in, several of these hacking incidents. For example, researchers at the security firm ThreatConnect found that the technical infrastructure used in the Anthem attack was linked to the Chinese computer security firm Topsec, which has strong links to China’s security establishment.23 In May 2019, a U.S. grand jury indicted two Chinese nationals for hacking Anthem and three other U.S. businesses.24 These individuals remain at large. Additionally, research from security firm FireEye has identified multiple Chinese-linked groups that have hacked medical systems and databases around the world.25 Others have observed how Chinese hackers have attempted to obtain data from clinical trials and scientific research studies, as well as intellectual property involving medical devices.26 For example, in July 2020, a federal grand jury returned an indictment charging two Chinese nationals with hacking into a variety of U.S. computer systems in attempts to acquire COVID-19 research, and in some instances acting on behalf of China’s Ministry of State Security.27 The FBI has also highlighted how China may have access to other sets of large-scale biomedical data through contract work, business partnerships, and research collaborations with hospitals, universities, and biotech companies.28 It is unclear what the specific motives are for these attacks — whether they are purely for economic or industrial gain, or whether they are for the potential creation of new bioweapons or surveillance mechanisms meant to help China gain military advantage.29 For example, China has conducted a massive DNA collection effort of millions of its men and boys, as well as from its ethnic minority Uyghur population, as part of a growing surveillance apparatus.30
What is clear is that China has sought to increase its biotech capability over the past 15 years. The Chinese government has prioritized building up China’s biotech industry in its 11th (2006–2010), 12th (2011–2015), and 13th (2016–2020) Five Year Plans.31 In doing so, Beijing has made a significant effort to acquire knowledge — most notably seen in its controversial Thousand Talents Programs,32 which were created in 2008 to recruit overseas expertise to build up China’s science and technology knowledge and innovation base. Some have noted that this drive to acquire external knowledge stems from China’s desire to “catch up” and become a world leader in science and technology.33 This has led Beijing to pursue strategies that focus on short-term results, rather than on building up its own domestic innovation capabilities.34
The Chinese government has made biomedical big data a national priority, launching a 60 billion yuan ($9.3 billion) precision medicine initiative in 2016 to address growing diseases with genetic links in China’s aging population.35 Chinese biohacking attempts may be an effort to try to get biomedical innovation on the “quick and cheap.” In light of these concerns, U.S. intelligence and law enforcement entities are keen to identify and interdict China’s attempts to pursue an enhanced science and technology capability through the illicit acquisition of various kinds of biomedical data to further its commercial or security ambitions. But key questions remain: Has China actually been able to use this data for economic or security gain? If so, how? How difficult has it been for China to accomplish its goals? And how might one more accurately assess these kinds of questions? The following sections will outline the limitations in existing assessments of Chinese biomedical hacking, as well as provide signposts and a research roadmap for conducting more robust assessments of this in the future.
Framing the Chinese Biomedical Hacking Threat
To date, most of the existing law enforcement, intelligence, and policy practitioner discourse about China’s threat related to acquiring U.S. biomedical big data has focused on the discrete pieces of information that are being, have been, or may be passed between the United States and China, including electronic health records, genomic data, and patient behavioral survey data. As noted above, this discourse tends to assume that, once China has accumulated enough of this data, it is only a matter of time before it outpaces the United States and becomes the new science and technology global powerhouse in the biomedical and biotechnology arena — indeed, many argue that this is imminent.36 This rhetoric is similar to past instances, going back to the 1980s, when intelligence and policy officials pointed — wrongly — to how advances in biology and biotechnology would lead to new and growing security threats.37 In these cases, the focus was on access to biological materials (e.g., the smallpox virus, anthrax bacteria, toxins, other pathogens, synthesized DNA), new biological techniques and technologies (e.g., genetic engineering, polymerase chain reaction, synthetic biology, genome editing, “cloud labs”), or the published materials and methods sections of scientific papers.
The current focus of U.S. law enforcement and U.S. policy on these data threats fails to capture the more complex character of what it takes to make biomedical big data work in practice for applied purposes.
We have argued in previous papers that this reflects a flawed, technologically deterministic way of thinking about science and technology — one that focuses only on its material aspects and not on the tacit knowledge and other social dimensions of laboratory work that enable science and technology to work in practice in the real world.38 There is other literature that also points to the importance of the socio-technical character and context of technology diffusion in general,39 as well as literature on the importance of socio-technical factors in state-level nuclear weapons development,40 military technology development,41 and the adoption of technology by nonstate actors.42 This collective work, which involves a variety of case studies, demonstrates the importance of considering the social dimensions of science and technology research, and it emphasizes how focusing only on the material aspects of technology acquisition and development is an erroneous way of thinking about what it takes for state or nonstate actors to develop science and technology capabilities. Such misunderstandings of technology have led to numerous scholarly, intelligence, and policy failures in understanding biosecurity threats. This has included the flawed assessments of the Soviet and Iraqi bioweapons programs, the overhyped “bioterrorism threat” since the 1990s, as well as various current biosecurity concerns.
In today’s concern about biomedical big data, we see the focus again being placed on the material — this time biomedical data — with claims being made that mere access to troves of biomedical data poses new and alarming security risks. The current focus of U.S. law enforcement and U.S. policy on these data threats fails to capture the more complex character of what it takes to make biomedical big data work in practice for applied purposes.
There are, however, a variety of bioinformatics and big data researchers who provide an alternative understanding and framework for how to think about these biomedical data threats.43 Their research focuses on the methods and assumptions involved in the use of biomedical big data for the discovery of new drugs and therapies, the socio-technical challenges of extracting knowledge from digital infrastructures, and the implications of choices in data curation for applications in science and technology. This body of work is consistent with several bioinformatics papers that discuss the challenges of working with heterogeneous biomedical big data: It is not a trivial task to harness these data for either useful or nefarious applications.44 There are often errors associated with this data or other data quality issues that require substantial data curation and preparation before they can be used.45 As other researchers have noted, “Data heterogeneity, data protection, analytical flows in analyzing data and the lack of appropriate infrastructures for data storage emerged as critical technical and infrastructural issues that might endanger a Big-Data-driven healthcare.”46 The crux of this scholarship focuses on the challenges involved in creating, transferring, and using data for the production of knowledge that can lead to biomedical and biotechnology applications.
For far too long, the U.S. security community has jumped to assumptions about how easily data or materials (usually related to emerging technologies) can be translated to security threats. Usually, the standard pieces of evidence used to make those judgments are generic references to the technology (or assumed trends of the technology) without relying on rigorous, real-world empirical data and studies of the various social and technical factors involved in shaping the development and use of that data or technology. To better inform decision-makers, we need new research questions and a committed funding stream to support research agendas focused on the socio-technical dimensions of biomedical big data to guide intelligence collection and analysis. This would enable analysts to parse out more carefully how or under what conditions actors can utilize biomedical big data to pose economic or security threats to the United States. The next section summarizes some key bioinformatics scholarship that offers a more nuanced understanding of how to think about biomedical big data. It also provides some useful signposts for how to think about crafting better intelligence and law enforcement assessments of the Chinese threat in this domain.
Biomedical Big Data: Definitions and Challenges
Big data scholar Sabina Leonelli provides useful definitions of data and knowledge and of the relationship between the two.47 She defines data as “mobile pieces of information” that can be collected, stored, and disseminated. These data can be accurate, or not, and they can be used for an applied purpose, or not. Using these data to produce knowledge that is deemed reliable, accurate, and useful for an applied purpose depends on the various people involved in collecting, storing, analyzing, and interpreting the data for some practical use. The value of data does not come from their intrinsic nature. Leonelli describes the value as coming “from their interpretation in relation to specific contexts and goals, rather than as a context-independent quality.”48 Leonelli points out that “[d]ata are not, by themselves, a form of knowledge. Rather, data need to be interpreted in order to yield knowledge.”49 In short, she is focused on the practices and processes for making sense of data. What Leonelli’s work and that of other bioinformatic researchers in this domain usefully document is how difficult these processes are when it comes to working with biomedical big data.50 This is due to one key characteristic and challenge of biomedical data — its heterogeneity — which creates difficulties for data sharing and for the assessment, interpretation, and application of data.
The National Institutes of Health defines biomedical big data as inclusive of the “numerous quantitative and qualitative datasets emanating from fundamental research using model organisms (e.g., mice, fruit flies, zebrafish), clinical studies (including medical images), and observational and epidemiological studies (including data from electronic health records and wearable devices).”51 They can include imaging, phenotypic, epigenetic, genotypic, molecular, clinical, behavioral, environmental, and many other types of biological or medical data, and they can encompass metadata (i.e., data that describe other data), such as the title, abstract, author, and keywords in publications; the organization and relationships of digital materials; file types or modification dates; and the data standards and software tools involved in data processing and analysis.52 Biomedical data can also be qualitative, such as patient medical narratives.
Not only is there a lot of heterogeneity in biomedical big data, there is also heterogeneity in the ways that this type of data is created and stored. The data can be stored using different labels and can consist of varying types and file formats. This depends on the kind of experiment that was conducted, the origins of the data, the way in which the data were collected and generated, and the equipment used to generate them, which can vary significantly from one laboratory or location to the next. The life and biomedical sciences are extremely diverse in their experimental methods, goals, instruments, and conceptual frameworks.53 Often, different research groups — even within the same subfield — disagree over preferred terminology, research organisms, and experimental methods and protocols.54 This extreme diversity is reflected in the various methods used to generate, store, share, and analyze biomedical data,55 meaning that data are not standardized in the life and biomedical sciences. This is true for all current open- or private-source biomedical big data.56 Although there are continual efforts being made to solve these standardization problems, they will continue to pose challenges for data scientists.
Using these data to produce knowledge that is deemed reliable, accurate, and useful for an applied purpose depends on the various people involved in collecting, storing, analyzing, and interpreting the data for some practical use.
This is a key difference from other types of scientific big data, such as the data coming from the particle physics community (e.g., the European Organization for Nuclear Research). These physical scientists agreed to dedicate significant time, labor, and funding to provide a standardized means of collecting and storing data, and they located the research facilities in just a few places in order to foster collaboration from the outset and share the high costs of running these laboratories.57 This approach has made working with this kind of data much easier, because scientists do not have to do the extra work of translating the data from one setting to another. In contrast, the life science and biomedical communities consist of thousands of laboratories (clinical, academic, private, commercial, and industrial) around the globe that have been set up with no common standardization and that use different kinds of equipment to capture and record the data, and both the equipment and data have not necessarily been standardized across labs.
Thus, integrating these diverse data for applied knowledge requires significant labor and expertise, often involving the development and application of new tools, standards, methods, and infrastructures, which, in turn, requires a large amount of “conceptual and material scaffolding” to transform the data into something useful.58 This extreme diversity around the world makes the transfer of data from one location (where it was generated) to another for reuse very challenging without understanding how the data were generated, handled, and stored.
Leonelli calls one of the challenges of working with heterogeneous data the difference between propositional and embodied knowledge — or what we and others before us have referred to as the importance of differentiating between explicit versus tacit knowledge — which is a key analytic construct in the field of science and technology studies.59 Explicit (or propositional) knowledge is information that can be codified, for example, in a written protocol or in a database. Tacit (or embodied) knowledge is hands-on, skills-based knowledge that is difficult to codify or, in some cases, cannot be codified. A laboratory may have a written protocol for how data were generated and may have a particular dataset for a given experiment. However, a whole host of tacit knowledge is involved in making that protocol generate the data that is produced in that laboratory and in being able to understand and work with that data.60 A different laboratory might use a different method for the protocol or have no expertise with the protocol or those data practices, or it might use different equipment, which could yield different data results. There are also different kinds of tacit knowledge related to how data scientists in different laboratories choose to process data. This challenge of tacit knowledge and nonstandardized laboratory practices in the life sciences leads to one of the strengths — and accompanying weaknesses — of trying to glean insights from biomedical big data. If one were able to standardize and integrate all of these different kinds of data, it could lead to major breakthroughs — this is the underlying premise and promise of precision medicine.61 However, the “if” constitutes one of the great challenges that would have to be solved by China or any other state or entity that wishes to capitalize on the troves of current and future biomedical big data, because the process of standardizing and integrating this information is not trivial.
Given the heterogeneity and nonstandardization of biomedical big data, the sharing and integration of data can be a problem. To make the most of biomedical big data, one would need to combine different types of information, such as genomics data, clinical research, behavioral studies, environmental studies, and so forth. As pointed out above, all of these data streams have been captured using different methods, procedures, and formats, usually without the recognition of the need to integrate them with other data sets. Therefore, significant work is required to be able to integrate data sets that were never meant to “talk” to one another.62 For example, many data repositories, particularly in the healthcare field, were designed and built in the pre-big-data era and were made to stand alone and be siloed, with no intention of allowing the data to be combined and analyzed with other data sets.63 In other cases, a single healthcare database may be composed of over 100 different interlinked data systems, all of which have their own ways of collecting and storing information.64 Genomic data pose unique challenges. Researchers are currently facing substantial problems in storing, managing, manipulating, analyzing, and interpreting whole genome sequence data for even relatively small numbers of individuals, especially if they must also take into account data quality information (e.g., errors or biases in the data).65 Moreover, structured data (i.e., those that can be stored in spreadsheets) do not necessarily tell you what you need to know about a particular experiment or biomedical process. This information is often stored as unstructured data, such as in the narrative of a journal article, or is embodied in the heads and hands of practicing scientists. Unstructured data are more difficult to extract but may be critical to making sense of the structured data that one might have acquired. Lawrence Hunter, a computational biologist, sums up the challenge: “Getting the most from the data requires interpreting them in light of all the relevant prior knowledge.”66 Thus, it is not enough merely to access the data. One must also know how to work with and make sense of them in light of prior data.
The process of curating data — the organization and integration of data — is also a significant issue that needs to be addressed in order to make sense of, or apply, big data. Curation involves several complex tasks, including selecting the data that are to be assimilated into a database; formatting them into a standard that can be read by the available software; classifying them into retrievable categories, to make it possible to “mine” them, according to whichever biological question is being asked; and displaying them in ways that make it possible to spot meaningful patterns.67 How data are curated also shapes the analytic outputs because, in some cases, curation may introduce errors, leading to data that may or may not be based upon the original experiments — or upon reality.68 These errors can be difficult to see and identify if one has only a spreadsheet of data, with no prior documentation or knowledge of how the data were processed.
The process of curating data — the organization and integration of data — is also a significant issue that needs to be addressed in order to make sense of, or apply, big data.
Similarly, interpreting or assessing data that have been acquired (whether through licit or illicit means) is also an issue. How does one know if the data are any good or even trustworthy? One would need to know the context in, conditions under, and purposes for which the data were collected, processed, and stored. This is no different than trusting the conditions under which an experiment was conducted,69 but in this case, it would mean trusting the digital data when one does not have direct access to the laboratory expertise that generated that data. There have been several accounts noting the difficulty of reproducing the results of an experiment involving biomedical big data and digital medicine.70 Some involved the mislabeling of data, administrative errors in the input of patient data, or other errors that can be introduced while cleaning, integrating, and processing the data,71 not to mention the problems that may underpin the experimental and publication process itself.72 In addition, data curators (who are different from the scientists conducting the experiment) may bring their own biases and errors to the labeling and processing of data, which would, in turn, yield unreliable data.73
Moreover, according to Leonelli, “how data are interpreted often changes depending on the skills, background knowledge, and circumstances of the researchers involved.”74 In the area of analysis, there are also numerous examples that caution researchers about the problem of establishing causal links in data — links that are either wrong (due to the aforementioned data acquisition, curation, and processing problems) or of limited value, if one does not also understand the explanatory mechanism of the causal link.75 As a result of these factors, Leonelli writes:
Data can be used to represent various aspects of reality and each interpretation will depend on the specific circumstances of analysis, including the skills and technical premises that allow people and/or algorithms to organize and visualize data in a way that corroborates a certain conceptualization of reality. In other words, the interpretation of data is constantly mediated by the view point and abilities of those using it.76
The key point here is that understanding big data — and we would extend this to understanding the security threats of biomedical big data — is always related to understanding the social context of science and technology: Who collected and curated the data? What are their skills and expertise? How did they collect the data? Under what conditions and in what context did they do so? How did they analyze and interpret the data? How did they store the data? These same questions can also be posed about individuals receiving these data from another laboratory. But these individuals must also answer another question: What is needed to translate the data so that it will “work” in a new context? As the field of science and technology studies has long noted, there are many factors that can shape how data are collected, adopted, and used. For example, adherence to a particular theoretical paradigm shapes the interpretation of what counts as relevant or important scientific data.77 How data are labeled and classified also shapes what can be understood and what counts as valid or important.78 In addition, the production, transfer, and use of scientific and technical data are often messy and unlike their clean and orderly public depiction.79 With these issues in mind, historian of science Ted Porter cautions that the “detachment of data from the concrete conditions of its production is always risky. Data, as it moves, is most often thinned, and what is thinned is necessarily transformed.”80 Bioinformatician David Duffy has summarized the spectrum of challenges of working with biomedical big data when it comes to producing, processing, filtering, reviewing, validating, interpreting, and applying the data — all of these steps can create bottlenecks and are points at which errors could be introduced into the data or analysis.81
We need to get beyond the hype of big data — including biomedical big data — to a more grounded understanding of how state or nonstate actors are able to use these data in practice.
To be sure, the U.S. and international bioinformatics communities are working hard to overcome these bottleneck problems. However, another problematic claim that proponents of biomedical big data and precision medicine make is that the availability of more data at all of these levels is always beneficial and makes things easier. A corollary to this premise can be found in the U.S. security community, which argues that a growing abundance of biomedical big data can readily translate into new and dangerous security threats. However, Duffy cautions that “[m]ore data do not necessarily translate into more knowledge; rather, it can mean an increase in noise.”82 Data scientists are well aware that more data is creating even more complex data ecosystems to curate, manage, and navigate.83 Whether biomedical big data can translate into the touted benefits described by precision medicine advocates, or whether it translates into more and varied kinds of security threats, depends on the processes that are used to make sense of such data. We need to get beyond the hype of big data — including biomedical big data — to a more grounded understanding of how state or nonstate actors are able to use these data in practice.84
How to Improve Intelligence Collection and Analysis on Biomedical Big-Data Threats: Focus on the “Data Journey”
With this understanding of biomedical big data in mind, Leonelli discusses the importance of studying what she calls the “data journey,” which is composed of the various social factors, infrastructures, and work involved in data traveling and being used in new contexts — in essence, studying all the ways in which data are produced, transformed, and used to address a given problem.85 This research agenda would involve focusing on the various technologies, materials, infrastructures, people, social settings, and institutions involved in the production and transfer of biomedical data, and the conditions and contexts that make the transfer and use of data more or less stable.86 Focusing on the data journey as the unit of analysis would provide a more realistic understanding of how data could be used for an actor’s economic or security benefit and would help to mitigate errors associated with focusing on the data alone.
To apply this approach to U.S. intelligence and law enforcement would mean going beyond a focus merely on apprehending people who are stealing or hacking data, and instead focusing on the data journey. If China is thought to be the key threat in this area, then U.S. agencies should try to better understand what the data journey would look like in order for China to use a particular biomedical big data set. One way to do this would be by looking at one of the U.S. biomedical data sets that has already been hacked. In a case like the 2015 Anthem hacking incident, this would involve getting a much more finely grained understanding of the data that was hacked, including determining in what forms and formats the data were stored; how Anthem stored its data differently from other entities (i.e., what is specific to Anthem’s data storage, management, and usage and what could be applied to other hacking cases); how exactly this particular data could be used or combined with other data sets for an actor’s economic or security benefit; what was involved in accessing this information; what would be required to use this data; and what challenges or limitations would exist in handling this heterogeneous data.
This kind of data journey inquiry would also require a better understanding of China’s bioinformatics infrastructure. This would include determining who and where the people and industries in China’s bioinformatics community are and what their capabilities are, as well as how China has been able to deal with the problems of building a bioinformatics research community.87 Creating a bioinformatics capacity has also proved highly challenging for other countries, including the United States. Intelligence analysts, law enforcement officials, and researchers will also need to investigate how China’s bioinformatics community of experts is — or is not — overcoming the inherent problems that plague China’s overall science and technology infrastructure.88 Are they able to bring together diverse sets of scientists and other technical experts, such as experimental biologists, clinicians, bioinformaticians, computer scientists, and engineers, to work on bioinformatics problems in innovative ways, rather than just replicate existing work? In addition, it will be important to determine exactly what advances China is making in the bioinformatics domain and what problems they have encountered in using and translating biomedical big data. How have they solved (or not solved) some of the problems of bottlenecking discussed above and what problems remain? In addition, government and academic analysts should investigate what government policies and programs, industries, and other institutions have been involved in these efforts and what funding streams are devoted to solving these problems.
Probing these kinds of research questions would provide a more nuanced understanding of what kind of indigenous innovation is occurring within China’s bioinformatics community. As one group of science and technology scholars has written, “[T]he ease of mastering foreign technological knowledge increases with the capability of the country in indigenous innovation. The buying of foreign technology is one thing but then being able to use it fully is another.”89
If academia, intelligence analysts, and law enforcement could collect and analyze data relevant to answer these kinds of research questions, they would acquire a much better understanding of what China’s science and technology capabilities are when it comes to actually using the biomedical big data that they acquire.
The above information could be gathered through a variety of open-source means, such as Chinese and international bioinformatics conferences, scientific publications, industry activities, and clinical trials, as well as clandestine means. If academia, intelligence analysts, and law enforcement could collect and analyze data relevant to answer these kinds of research questions, they would acquire a much better understanding of what China’s science and technology capabilities are when it comes to actually using the biomedical big data that they acquire. And this kind of data and analysis could serve as a baseline and be updated over time to reflect changes in China’s growth and science and technology development. Such data journey research questions could also then create a baseline set of data to feed into testing existing technology diffusion and adoption models noted earlier in the paper. In particular, such analyses could focus on data, data expertise and skills, and data infrastructures.
These assessments about China could be conducted by the CIA’s Open Source Enterprise within its Directorate of Digital Innovation, as well as within the CIA’s new China Mission Center.90 The CIA has also launched a new CIA Technology Fellows program to bring promising experts to the agency for one to two years of public service. This would be a way for the CIA to collaborate with an interdisciplinary team of China experts, science and technology studies scholars, and bioinformatics researchers to conduct studies on the above research questions. In addition, the State Department’s intelligence arm, the Bureau of Intelligence and Research, could also bring in Jefferson Science Fellows to work on China-focused biomedical big data analyses during their year of public service.91 The National Security Agency-funded Laboratory for Analytic Sciences could also be a site of year-long, focused, unclassified analyses of China and its big data capabilities that could bring together researchers from academia, industry, and the intelligence community.92 In addition, the Intelligence Advanced Research Projects Agency could create new research calls to support larger interdisciplinary academia-industry research teams to study these problems over a multi-year period. Competing analyses could also be conducted by Five Eyes partners, such as the National Cybersecurity Centre in the United Kingdom in partnership with the Alan Turing Institute (the United Kingdom’s leading data science research center), as well as Canadian and Australian scholars and intelligence practitioners. The U.S. intelligence community could also commission specific research studies on this topic from academic scholars through its National Intelligence Council Associates Program or its CIA Labs program. Although this paper has focused on China, the same assessments could be conducted of other state or nonstate actors who are attempting to acquire illicitly biomedical big data.
Sociologist of information technology Geoffrey Bowker warns about the dangers of being caught up in what he calls the “information mythology” — a contemporary understanding of information that assumes that “information is everything,” and everything is rendered coherent through the more or less systematic communication of information.93 The threat from China’s acquisition of U.S. biomedical big data has been a key focus of U.S. intelligence and law enforcement. However, existing literature concerned with biomedical big data and bioinformatics shows that there are major challenges in trying to use biomedical big data for any kind of applied purpose. Therefore, it is very unclear what threats are actually posed by the mere acquisition of data alone. We need to do a better job of conducting more complex socio-technical assessments of how China might try to use biomedical big data, as well as studying China’s bioinformatics personnel and technical infrastructure. Just focusing on the data and the people who steal them does not tell us enough about China’s ability to use those data for economic or security purposes. Security concerns about biomedical big data (and China’s role in biomedical data hacking) need to be further studied and scrutinized with more robust empirical evidence in order to better inform U.S. decision-makers about the true nature of China’s economic and national security threats.
Kathleen M. Vogel is interim director and professor at the School for the Future of Innovation in Society at Arizona State University. Her research examines knowledge production on biosecurity and big data issues.
Sonia Ben Ouagrham-Gormley is an associate professor at the George Mason University’s Schar School of Policy and Government. She is affiliated with the Biodefense Program and her research deals with issues at the crossroad between science and technology and security.
This research was funded under the Project on Advanced Systems and Concepts for Countering Weapons of Mass Destruction (PASCC) grant #FA7000-18-1-0011.