"The NHS will last as long as there are folk left with the faith to fight for it"
Aneurin Bevan

Monday 5 December 2011

Anonymising Data

Today we hear that The Prime Minister wants to sell off NHS data (sorry, "he wants to make it easier for drug companies to run clinical trials in hospitals and to benefit from the NHS's vast collection of patient data"). This has prompted much howling from people scared that private companies will have access to their medical records. This is not the case (although Lansley wants you to voluntarily hand over your full medical records to private companies, but that is another policy yet to be debated).

It is easy to provide rich, anonymised data: simply remove your name and address. However, this removes important data since your location may be affecting your health. Researchers trying to use the data to look for connections between health and the various factors given in the dataset need to take into account any variations due to location. There are 28 million addresses in the UK. Your postcode covers about 15 addresses, so the combination of your house number and postcode identifies your home.

Your postcode comes in two parts. The first part is called the Outward Code, and the second part is the Inward Code. Each of these can be split into two. The postcode PO1 2AF can be split like this (this is data from the Post Office's lengthy document on postcodes and I have assumed total population is 60 million):

Outward CodeInward Code
PostCode AreaPostcode DistrictPostcode SectorUnit Postcode
PostcodePO12AF
Number of areas1242,98011,1591.8 million
Addresses226,0009,4002,50015
People484,00020,0005,40033

The point of this table is that it shows that by giving some of the postcode you can make the data more localised, while anonymising the data. If the entire postcode is provided then there is a good change the patient can be identified. However, such data would be too granular to be useful for data mining anyway. Data grouped by Postcode Sector (giving 11,000 unique locations) or Postcode District (giving about 3,000 unique locations) is much more manageable.

Let's imagine that that a celebrity actress has just had an operation on her foot and is on crutches. She has recently been on Strictly Come Dancing so the tabloid paparazzi think it is newsworthy to have a picture of her on crutches. The paps obtain a spreadsheet of all the people who have had foot operations in the last month. If the spreadsheet has the actress's house number and postcode, they can can simply park outside her house and take pictures of anything that moves within. If they have just her postcode, they have a one in fifteen chance of getting the right address and most likely will knock on one door in that postcode area and ask which house the actress lives in - eventually they find out the right house.

If the data has the Postcode Sector, it means that it covers 2,500 addresses (or about 5,400 people). It is not feasible for the pap to visit all the streets in the Postcode Sector on the random chance that they may be able to see an actress on crutches. However, 5,400 people is around the size of the patient list of a GP practice, so if the paparazzi loiter by the GP practice that covers that area they are bound to be able to take pictures of someone on crutches with a bandaged foot. There is a chance that the patient will be the actress. The more "innovative" paps will realise that from the Postcode Sector they can identify the community health team, the paps can then find out who the physiotherapists are and try to extract the address from them. Incidentally, the Postcode Sector is also roughly the size of an electoral ward (there is a large variation of the population size of electoral wards across the country).

For location information to be useful for epidemiologically it needs some granularity, and the larger the area, the less useful the location data will be. If the data has the Postcode District this will cover about 20,000 people (9,400 addresses), which is a small town. Since there are 250 NHS Trusts (and about 500 hospitals) in England the PostCode District will identify the hospital where the operation was carried out (and most likely where follow up outpatients will be). The paps could wait outside outpatients on the day that the follow ups for foot operations are booked and hope the actress turns up. If the data has just the Postcode Area, then the paps cannot even identify the trust since there would be two, or (in cities) more trusts covering that area. However, it is likely that such data would have the trust identifying code, or the hospital identifying code.

The privacy concerns about NHS data being handed to private companies are unfounded. It is easy to anonymise data while still providing enough granularity. However, this is not my complaint against Cameron's decision. I will explain why I am not in favour of this policy in my next blog.

3 comments:

  1. "Any data that includes a whole postcode will pinpoint you."

    No it won't. My whole postcode is the same for the other six houses in my street. The only way to pinpoint me is by combining it with the house number. Which is what you seemed to be arguing in the previous sentence. Maybe the quoted sentence should read "will not pinpoint you".

    Bur regardless of the postcode debate, it won't defeat my argument that I have reservations about security and most certainly refuse to let it be passed on to animal testing labs.

    ReplyDelete
  2. I haven't read the detail yet but it's more likely that postcode would be removed completely and replaced with lower or middle superoutput area (census geographical areas). These are useful in that other measures such as deprivation can be attributed to them and allows quite detailed mapping without disclosing pinpointed location. I suspect data released will not be as timely as monthly either. I work in a PCT and hospital data is at least 6-8 weeks old before we get it

    ReplyDelete
  3. @Richard, thanks, that sentence was out of place (considering the previous one) so I've deleted it.

    @InvisibleWoman, my understanding of the Middle SOA areas is that they are around 7k so are equivalent to the Postcode Sector.

    ReplyDelete