Guest post: Data & Statistics Bill should be withdrawn

We liked Stephen Judd’s submission on the Data & Statistics Bill and thought it should be seen by a wider audience (read our submission here).

He has kindly reformatted and edited it for us to work better as an article.

Parliament will shortly consider the Data and Statistics Bill, which is intended to replace the aging Statistics Act. After I read the NZCCL submission, I wanted to take a closer look, and I ended up making a submission of my own. Here’s what I told the select committee.

A bill which seeks to “tidy up” some decades of drift between current practice and its governing legislation is not usually contentious, so the Bill hasn’t received much attention in its early stages. But it deserves that attention

Taken as a whole, the Bill represents a “land grab” approach to New Zealanders’ data. It will definitely make life easier for officials to obtain whatever they want for nearly any purpose. However New Zealanders’ legitimate interests in privacy and safety of their information is poorly served. I think the Bill should be withdrawn pending genuine consultation with Māori, with privacy advocates, and the public at large on how best to safeguard those interests.

Unbridled Power

The drafters of the 1975 Statistics Act could not have foreseen the technology available in 2022. At that time, it was just possible to fit about 6 million transistors on to a single microprocessor. Today that number is around 20 billion. Combined with the ability to network computers, and advances in computing techniques that we typically call artificial intelligence, the computing power available to governments has expanded hugely.  This increase in computing power can be used to cross-reference and analyse data that it was previously prohibitively expensive to do. 

Even so, at the same time the current Statistics Act was passed, Parliament was aware of the potential for data matching and sharing to be harmful to civil liberties, and so for example imposed strict limits in the Wanganui Computer Centre Act 1976.

The Wanganui Computer Centre Act required the creation of a policy board to govern it with non-government representation from the NZ Computer Society and the Law Society. The present Bill is not even as mature as this rudimentary measure from 1976, vesting all discretion in the Statistician, who would only have to account for themselves through summary reports of decision. It is surprising and concerning to see legislation which doesn’t limit the scope power but seeks to expand it further. The New Zealand government cannot stop progress in technology, but it can limit how that technology is applied.

While the Bill sets some limits, mostly located in the powers given to the position of Statistician, they are not enough to deal with the world of 2022 and beyond. In the 21st century, statistical techniques are far more powerful, data far more dangerous, than in 1975. Therefore an institution such as Statistic NZ needs guardrails beyond the hope of good intent, good sense and good fortune of its officials.

Intent Is Not Magic

It is important to understand that it’s not necessary to have bad intent to end up with bad outcomes.

Bad outcomes can occur when data collected for one purpose is pressed into service for another. Definitions or classifications used in the original collection, and standards of rigour in collection, rarely align exactly with the needs of the present. While statisticians will carefully caveat findings, policy cannot tolerate these kinds of caveats.

Consider the scenario where protective agencies have records of their engagements. The complete set of factors leading to an individual engagement can’t all be captured as simple quantities in a standardised record. We have to decide what to include, what to exclude, and how to assign things. If those records are later pressed into service to discover factors in social problems, they will not themselves reveal (for example) the attitudes of the agency’s staff to ethnicity and social status, proximity of the case to agency premises, engagements that were not recorded because in some way they didn’t “make the cut”, the basis for assessments of risk factors like alcohol abuse, and so on.

Particularly with older data, it may not be possible to contextualise the data with the approach used in its original collection. But it will be tempting to use it because of the beliefs that more is better and something is better than nothing. This in turn can lead to models that suggest risk factors that are nothing more than artefacts of recording, or worse, driven by the application of earlier erroneous models. This has actually happened. In 2016, Treasury produced modelling of vulnerability factors in the 0-5 age group. Treasury analysts noted the unreliability of their data sources and cautioned:

“…care must be taken in generalising results to the experience of more recent cohorts of children. Cohorts born more recently have had a higher likelihood of being notified to CYF, partly because of administrative changes related to family violence events attended by police.”

“The study has a number of limitations and caveats. The scope of the study is limited by the nature and breadth of the information collected in agencies’ administrative systems”

“Some of the methods are exploratory in nature, and as such the results should be considered as preliminary requiring further testing and development over time.”

Despite this, MSD issued an update on Family Start that same year, saying:

“We used Treasury data on vulnerability to inform the expansion of Family Start. Treasury’s mapping of vulnerability factors for 0-5 year olds has been combined with MSD’s mapping of children who meet the Family Start referral criteria. This analysis has been used to inform investment decisions about the expansion of Family Start.”

(see . Source documents from Treasury and MSD are no longer available from ministry websites, a problem of a different kind but also deserving of attention.)

MSD here failed to observe a basic principle of computer science, namely: “Garbage In, Garbage Out”.

Anything You Want To Know Is Research

The term “research” is used throughout the Bill but nowhere defined. I worry that any investigation to obtain information could be characterised as research. The aims of “research” in turn are limited only by the clause 48 requirement on the Government Statistician that their decisions on sharing data should be “taking into account” public benefit. This could become a broad license to obtain any data set for any purpose.

This broad license cuts across notions of consent and social license. It is one thing to provide necessary data in support of a transaction, another to give it up for a specific social good (curing cancer, perhaps), but a step too far to give it away for completely unspecified purposes that might arise in future.

The Statistician is required to “take into account” a range of factors. However, merely taking factors into account does not amount to a limit. Nothing would prevent the Statistician (for example) requiring the location of urupa and wahi tapu to be shared if that were deemed necessary, as long as they took the the factors in Part 5 into account. And recently, many have advocated for collection of data on sexuality and gender, which is undoubtedly useful, yet poses challenges for individuals who may not wish to disclose. Imagine being outed in your home or workplace as you complete a form, or just the fear that can go with putting words in writing. The Statistician may take into account the harms of disclosure or sharing with other agencies and yet do it anyway, perhaps endangering the well-being of some people.

I Want A List Of Their Names

Many citizens may not want to disclose personal information beyond the strictly necessary. “I don’t want to” should of course be a good and sufficient reason in and of itself, but beyond this, a desire for privacy can be founded on cultural, religious or other values, or pragmatic fears about misuse.

As a Jew, I personally do not want to contribute to a registry of where all the Jews live. Members of other minorities might well feel the same. It is true that government services can be targeted more accurately and fairly when more is known about the community, but equally, “good enough” knowledge can be obtained with less compulsion and more consent. The Bill rests on the belief that more knowledge is always better, which is false and dangerous.

Crime Is Its Own Reward

The penalties prescribed for citizens refusing to comply are pretty high for an ordinary person  ( cl76). According to a report from ASB late last year, 4 out of 10 New Zealanders have less than $1000 in the bank, compared to a proposed $2000 maximum penalty for not answering Statistics NZ’s questions. Conscientious objection comes at a high cost.

This is in stark contrast to the penalties prescribed for misuse of data for an organisation (or individual). Consider the potentially catastrophic consequences for a person of release of their most intimate data, multiplied across a community when a large data set is leaked. Consider the rapaciousness of unregulated data brokerages offshore and the growing criminal enterprises based on identity theft, blackmail and fraud. The maximum penalty proposed for an organisation disclosing data is only $15,000, even less for an individual.

Statistics NZ has on its own admission had to sanction researchers for breaking rules of use for the IDI, despite presumably extended induction of new researchers. Misuse of public sector data by employees continues to be a problem reported in the news. Criminal gangs pay handsomely for illicit data. Section 249 of the Crimes Act by contrast provides for multi-year prison sentences for unauthorised access to computer systems. However it says nothing of the distribution of data once obtained. The discrepancy between 7 years in prison (s249 of the Crimes Act) and $15,000 (the maximum penalty for an organisation provided for in the Bill) is profound.

In Fact We Do Know Who You Are

This leads to considerations of security and anonymisation.

Lack of consent, broad scope of collection and weak incentives to comply with disclosure limits might be mitigated by strict management safeguards against identification and release. However, repeated use of the term “reasonable” coupled with definitions that imply point in time rather than regular review make this problematic too.

The sad fact is that many reasonable seeming efforts to anonymise data sets by public agencies and large corporations have proved easy to unravel. There are two underlying principles that cause this.

As more facts are included in a record – for example date of birth, and broad location, and gender – their combination with other facts makes it increasingly easy to narrow candidates to a single person or small group. It has been shown in research for example that “87% of the U.S. population can be uniquely re-identified based on five-digit ZIP code, gender, and date of birth.” While New Zealand postal codes cover broader areas than ZIP codes, the census meshblocks typically used in research here are smaller and would help identification even more.

But also, data sets can be combined with existing public data, sometimes in unexpected ways. For example, in 2019 the Australian state of Victoria released an anonymised set of data from Myki public transport cards. Researchers were able to identify other travellers, by using knowledge of their own trips to find their records, and then matching with people known to them who were travelling with them on some of those trips. Innocuous enough, but with what they had learned then, they were able to use a Tweet made by Victorian MP Anthony Carbine about catching a train to obtain his entire travel history. Similar approaches have been used to deanonymise medical records in the Australian Medicare system and locate US service personnel and base locations and also in the NY Times.

The crucial issues here are: 

  • new datasets continue to be made public (deliberately via open data or inadvertently through security breaches) which can then be matched with existing ones
  • institutional imperatives (manifest in this Bill) encourage collecting unnecessary data in case it is useful later, making it easier to find combinations that pinpoint individuals
  • data once circulated cannot be withdrawn – there is always a copy somewhere and incentives to reproduce copies

In the past Statistics NZ have said they “anonymise,” rather “de-identify.” The idea is that data does not identify individuals because the identifying data (eg names) has been separated and encrypted. However, as shown above, this does not prevent identification if the data set is large enough, rich enough, or able to be combined with other data.

Further, the Bill provides for disclosure to other parties inside and outside government. The further away from Statistics NZ officials data travels, the further it travels from their practices.

It Was Safe When We Checked Last

There is another issue which combined with the foregoing poses a serious problem: techniques to deanonymise continue to be discovered and improved at rapid pace.

The consequence is that data sets that were once thought to be reasonably anonymised become unreasonably and unpredictably vulnerable over time. Things that seemed reasonable at the time of disclosure will not be reasonable in the future, perhaps even the very near future.

Associate Professor Dr  Zeynep Tufekci:

“With the implications of our current data practices unknown, and with future uses of our data unknowable, data storage must move from being the default procedure to a step that is taken only when it is of demonstrable benefit to the user, with explicit consent and with clear warnings…” 

Princeton academics Naranjan, Heuy and Felton:

“Once released to the public, data cannot be taken back. As time passes, data analytic techniques improve and additional datasets become public that can reveal information about the original data. It follows that released data will get increasingly vulnerable to re-identification—unless methods with provable privacy properties are used for the data release.”

“New attributes continue to be linked with identities: search queries, social network data, genetic information (without DNA samples from the targeted people), and geolocation data all can permit re-identification, and Acquisti, Gross, and Stutzman have shown that it is possible to determine some people’s interests and Social Security numbers from only a photo of their faces. The realm of potential identifiers will continue to expand, increasing the privacy risks of already released datasets.” (my italics)

While data may not be released outside Statistics NZ or outside the government, risks include over-enthusiastic matching with other data already held within an agency, or data leaks.

The implication is that responsible use of data requires:

  • collecting the least necessary to complete a transaction
  • using the least necessary data to obtain information
  • wherever possible, providing aggregated results (sums, averages, etc) rather than raw individual-level data
  • regular or even continuous review of current holdings and practices to take account of changes in the state of the art

But these precautionary principles are not to be found in the Bill. Far from seeking to constrain the use of data, there is an unspoken premise that data has no value unless it is used, and that greater value can be further derived by sharing our data.

For all these reasons, the Bill makes me very uneasy. It will be interesting to see if the select committee feels the same way.

Stephen Judd lives in Christchurch and has been working in IT for 30 years. He is currently on sabbatical.