DMA compliance: challenges in Search query data anonymisation

DMA & data: real risk of over-sharing of identifiable, personal search query data. The European Commission has launched a public consultation on the measures it intends to impose on Google for compliance with Article 6(11) of the Digital Markets Act (DMA) – a specific obligation to anonymise search query data and share it with other search providers.

The Commission tries to set out what is required by law in terms of anonymisation and data sharing. Unfortunately, in my view, the measures raise key concerns on how to get to a workable degree of anonymisation.

For instance, the Commission says that the following must be removed from any search query data: “Any end user identifier, such as Google account identifiers (e.g., account IDs, usernames), whether the user was signed in, and other associated direct identifiers (e.g., IP addresses, device IDs).

That’s obviously good: by removing clear identifiers, identification becomes harder for another search provider.

But the Commission then seems to take shortcuts, as if there was a magic wand to remove personal data or elements enabling identification.

For instance, it requires Alphabet/Google to “split the original query text of each search record into entities” and to “use personal data detectors to detect addresses, names and other forms of well-known identifiers and resolve them to standard formats”.

It gives the following example:

the query “john doe 200 wetstraat brussel 04 12 34 56 78 communications department” could be split into “john doe” (flagged by the full name detector), “200 wetstraat brussel” (flagged by the address detector), “04 12 34 56 78” (flagged by a phone number detector), “communications”, and “department” (single words).

Unfortunately:

– Names are complex to detect properly. Across different languages, countries and cultures, names can take vastly different forms. “Firstname Lastname” as a pattern creates huge numbers of false negatives (= names *not* flagged as a name) when you have names like “Charles de Gaulle”, “LeBron James”, “Douglas MacArthur”, “Marie Skłodowska-Curie” or “J. Robert Oppenheimer”. Make it more flexible, and you end up with more false positives (= non-names flagged as names).

– With the phone number detector, false positives & false negatives will also be frequent. It’s easier if someone uses an international number with the “+” sign, but whether something is a phone number or another string of numbers (e.g. a company number) is very context-dependent.

Plus, requiring timestamp + location to be shared, as well as query sequences, heightens the risk of re-identification.

Is it a minor issue? Even small error rates can be massive in scale when looking at the amount of data at stake here.

De-identification isn’t easy, but if this becomes the standard, it could have broader data protection implications. See also analysis by Lukasz Olejnik, in which he warns of privacy and national security risks (link in comments).

More thoughts soon!

GDPR

🫖

Did this analysis get you thinking? Reach out!

DataLaws.net is entirely open-access, and instead of getting your data in exchange for this content, how about another trade? If this commentary saved you research time or sparked an idea, feel free to invite me over for tea, chai or a hot chocolate next time you are around Brussels or Antwerp - or invite me over to your offices for a chat!

Get in touch ↗ Let's connect on LinkedIn ↗