Select date

May 2024
Mon Tue Wed Thu Fri Sat Sun

Behind the One-Way Mirror: A Deep Dive Into the Technology of Corporate Surveillance

28-12-2019 < Blacklisted News 18 7597 words
 

A thumbnail of the pdf download


By Bennett Cyphers and Gennie Gebhart


Introduction


Trackers are hiding in nearly every corner of today’s Internet, which is to say nearly every corner of modern life. The average web page shares data with dozens of third-parties. The average mobile app does the same, and many apps collect highly sensitive information like location and call records even when they’re not in use. Tracking also reaches into the physical world. Shopping centers use automatic license-plate readers to track traffic through their parking lots, then share that data with law enforcement. Businesses, concert organizers, and political campaigns use Bluetooth and WiFi beacons to perform passive monitoring of people in their area. Retail stores use face recognition to identify customers, screen for theft, and deliver targeted ads.


The tech companies, data brokers, and advertisers behind this surveillance, and the technology that drives it, are largely invisible to the average user. Corporations have built a hall of one-way mirrors: from the inside, you can see only apps, web pages, ads, and yourself reflected by social media. But in the shadows behind the glass, trackers quietly take notes on nearly everything you do. These trackers are not omniscient, but they are widespread and indiscriminate. The data they collect and derive is not perfect, but it is nevertheless extremely sensitive.


This paper will focus on corporate “third-party” tracking: the collection of personal information by companies that users don’t intend to interact with. It will shed light on the technical methods and business practices behind third-party tracking. For journalists, policy makers, and concerned consumers, we hope this paper will demystify the fundamentals of third-party tracking, explain the scope of the problem, and suggest ways for users and legislation to fight back against the status quo.


Part 1 breaks down “identifiers,” or the pieces of information that trackers use to keep track of who is who on the web, on mobile devices, and in the physical world. Identifiers let trackers link behavioral data to real people.


Part 2 describes the techniques that companies use to collect those identifiers and other information. It also explores how the biggest trackers convince other businesses to help them build surveillance networks.


Part 3 goes into more detail about how and why disparate actors share information with each other. Not every tracker engages in every kind of tracking. Instead, a fragmented web of companies collect data in different contexts, then share or sell it in order to achieve specific goals.


Finally, Part 4 lays out actions consumers and policy makers can take to fight back. To start, consumers can change their tools and behaviors to block tracking on their devices. Policy makers must adopt comprehensive privacy laws to rein in third-party tracking.


Contents


Introduction
      First-party vs. third-party tracking
      What do they know?
Part 1: Whose Data is it Anyway: How Do Trackers Tie Data to People?
      Identifiers on the Web
      Identifiers on mobile devices
      Real-world identifiers
      Linking identifiers over time
Part 2: From bits to Big Data: What do tracking networks look like?
      Tracking in software: Websites and Apps
      Passive, real-world tracking
      Tracking and corporate power
Part 3: Data sharing: Targeting, brokers, and real-time bidding
      Real-time bidding
      Group targeting and look-alike audiences
      Data brokers
      Data consumers
Part 4: Fighting back
      On the web
      On mobile phones
      IRL
      In the legislature


First-party vs. third-party tracking


The biggest companies on the Internet collect vast amounts of data when people use their services. Facebook knows who your friends are, what you “Like,” and what kinds of content you read on your newsfeed. Google knows what you search for and where you go when you’re navigating with Google Maps. Amazon knows what you shop for and what you buy.


The data that these companies collect through their own products and services is called “first-party data.” This information can be extremely sensitive, and companies have a long track record of mishandling it. First-party data is sometimes collected as part of an implicit or explicit contract: choose to use our service, and you agree to let us use the data we collect while you do. More users are coming to understand that for many free services, they are the product, even if they don’t like it.


However, companies collect just as much personal information, if not more, about people who aren’t using their services. For example, Facebook collects information about users of other websites and apps with its invisible “conversion pixels.” Likewise, Google uses location data to track user visits to brick and mortar stores. And thousand of other data brokers, advertisers, and other trackers lurk in the background of our day-to-day web browsing and device use. This is known as “third-party tracking.” Third-party tracking is much harder to identify without a trained eye, and it’s nearly impossible to avoid completely.


What do they know?


Many consumers are familiar with the most blatant privacy-invasive potential of their devices. Every smartphone is a pocket-sized GPS tracker, constantly broadcasting its location to parties unknown via the Internet. Internet-connected devices with cameras and microphones carry the inherent risk of conversion into silent wiretaps. And the risks are real: location data has been badly abused in the past. Amazon and Google have both allowed employees to listen to audio recorded by their in-home listening devices, Alexa and Home. And front-facing laptop cameras have been used by schools to spy on students in their homes.


But these better known surveillance channels are not the most common, or even necessarily the most threatening to our privacy. Even though we spend many of our waking hours in view of our devices’ Internet-connected cameras, it’s exceedingly rare for them to record anything without a user’s express intent. And to avoid violating federal and state wiretapping laws, tech companies typically refrain from secretly listening in on users’ conversations. As the rest of this paper will show, trackers learn more than enough from thousands of less dramatic sources of data. The unsettling truth is that although Facebook doesn’t listen to you through your phone, that’s just because it doesn’t need to.


The most prevalent threat to our privacy is the slow, steady, relentless accumulation of relatively mundane data points about how we live our lives. This includes things like browsing history, app usage, purchases, and geolocation data. These humble parts can be combined into an exceptionally revealing whole. Trackers assemble data about our clicks, impressions, taps, and movement into sprawling behavioral profiles, which can reveal political affiliation, religious belief, sexual identity and activity, race and ethnicity, education level, income bracket, purchasing habits, and physical and mental health.


Despite the abundance of personal information they collect, tracking companies frequently use this data to derive conclusions that are inaccurate or wrong. Behavioral advertising is the practice of using data about a user’s behavior to predict what they like, how they think, and what they are likely to buy, and it drives much of the third-party tracking industry. While behavioral advertisers sometimes have access to precise information, they often deal in sweeping generalizations and “better than nothing” statistical guesses. Users see the results when both uncannily accurate and laughably off-target advertisements follow them around the web. Across the marketing industry, trackers use petabytes of personal data to power digital tea reading. Whether trackers’ inferences are correct or not, the data they collect represents a disproportionate invasion of privacy, and the decisions they make based on that data can cause concrete harm.


Part 1: Whose Data is it Anyway: How Do Trackers Tie Data to People?


Most third-party tracking is designed to build profiles of real people. That means every time a tracker collects a piece of information, it needs an identifier—something it can use to tie that information to a particular person. Sometimes a tracker does so indirectly: by correlating collected data with a particular device or browser, which might in turn later be correlated to one person or perhaps a small group of people like a household.


To keep track of who is who, trackers need identifiers that are unique, persistent, and available. In other words, a tracker is looking for information (1) that points only to you or your device, (2) that won’t change, and (3) that it has easy access to. Some potential identifiers fit all three of these requirements, but trackers can still make use of an identifier that checks only two of these three boxes. And trackers can combine multiple weak identifiers to create a single, strong one.


An identifier that checks all three boxes might be a name, an email, or a phone number. It might also be a “name” that the tracker itself gives you, like “af64a09c2” or “921972136.1561665654”. What matters most to the tracker is that the identifier points to you and only you. Over time, it can build a rich enough profile about the person known as “af64a09c2”—where they live, what they read, what they buy—that a conventional name is not necessary. Trackers can use artificial identifiers, like cookies and mobile ad IDs, to reach users with targeted messaging. And data that isn’t tied to a real name is no less sensitive: “anonymous” profiles of personal information can nearly always be linked back to real people.


Some types of identifiers, like cookies, are features built into the tech that we use. Others, like browser fingerprints, emerge from the way those technologies work. This section will break down how trackers on the web and in mobile apps are able to identify and attribute data points.


This section will describe a representative sample of identifiers that third-party trackers can use. It is not meant to be exhaustive; there are more ways for trackers to identify users than we can hope to cover, and new identifiers will emerge as technology evolves. The tables below give a brief overview of how unique, persistent, and available each type of identifier is.







































Web Identifiers Unique Persistent Available
Cookies Yes Until user deletes In some browsers without tracking protection
IP address Yes On the same network, may persist for weeks or months Always
TLS state Yes For up to one week In most browsers
Local storage super cookie Yes Until user deletes Only in third-party IFrames; can be blocked by tracker blockers
Browser fingerprint Only on certain browsers Yes Almost always; usually requires JavaScript access, sometimes blocked by tracker blockers

































Phone Identifiers Unique Persistent Available
Phone number Yes Until user changes Readily available from data brokers; only visible to apps with special permissions
IMSI and IMEI number Yes Yes Only visible to apps with special permissions
Advertising ID Yes Until user resets Yes, to all apps
MAC address Yes Yes To apps: only with special permissions

To passive trackers: visible unless OS performs randomization or device is in airplane mode






























Other Identifiers


Unique Persistent Available
License plate Yes Yes Yes
Face print Yes Yes Yes
Credit card number Yes Yes, for months or years To any companies involved in payment processing

 


Identifiers on the Web


Browsers are the primary way most people interact with the Web. Each time you visit a website, code on that site may cause your browser to make dozens or even hundreds of requests to hidden third parties. Each request contains several pieces of information that can be used to track you.


Anatomy of a Request


Almost every piece of data transmitted between your browser and the servers of the websites you interact with occurs in the form of an HTTP request. Basically, your browser asks a web server for content by sending it a particular URL. The web server can respond with content, like text or an image, or with a simple acknowledgement that it received your request. It can also respond with a cookie, which can contain a unique identifier for tracking purposes.


Each website you visit kicks off dozens or hundreds of different requests. The URL you see in the address bar of your browser is the address for the first request, but hundreds of other requests are made in the background. These requests can be used for loading images, code, and styles, or simply for sharing data.




A diagram depicting the various parts of a URL

Parts of a URL. The domain tells your computer where to send the request, while the path and parameters carry information that may be interpreted by the receiving server however it wants.





The URL itself contains a few different pieces of information. First is the domain, like “nytimes.com”. This tells your browser which server to connect to. Next is the path, a string at the end of the domain like “/section/world.html”. The server at nytimes.com chooses how to interpret the path, but it usually specifies a piece of content to serve—in this case, the world news section. Finally, some URLs have parameters at the end in the form of “?key1=value1&key2=value2”. The parameters usually carry extra information about the request, including queries made by the user, context about the page, and tracking identifiers.




A computer sending a single request to a website at "eff.org."

The path of a request. After it leaves your machine, the request is redirected by your router to your ISP, which sends it through a series of intermediary routing stations in “the Internet.” Finally, it arrives at the server specified by the domain, which can decide how (or if) to respond.





The URL isn’t all that gets sent to the server. There are also HTTP headers, which contain extra information about the request like your device’s language and security settings, the “referring” URL, and cookies. For example, the User-Agent header identifies your browser type, version, and operating system. There’s also lower-level information about the connection, including IP address and shared encryption state. Some requests contain even more configurable information in the form of POST data. POST requests are a way for websites to share chunks of data that are too large or unwieldy to fit in a URL. They can contain just about anything.


Some of this information, like the URL and POST data, is specifically tailored for each individual request; other parts, like your IP address and any cookies, are sent automatically by your machine. Almost all of it can be used for tracking.




A URL bar and the data that’s sent along with a website request.

Data included with a background request. In the image, although the user has navigated to fafsa.gov, the page triggers a third-party request to facebook.com in the background. The URL isn’t the only information that gets sent to the receiving server; HTTP Headers contain information like your User Agent string and cookies, and POST data can contain anything that the server wants.





The animation immediately above contains data we collected directly from a normal version of Firefox. If you want to check it out for yourself, you can. All major browsers have an “inspector” or “developer” mode which allows users to see what’s going on behind the scenes, including all requests coming from a particular tab. In Chrome and Firefox, you can access this interface with Crtl+ Shift+ I (or ⌘+ Shift+ I on Mac). The “Network” tab has a log of all the requests made by a particular page, and you can click on each one to see where it’s going and what information it contains.


Identifiers shared automatically


Some identifiable information is shared automatically along with each request. This is either by necessity—as with IP addresses, which are required by the underlying protocols that power the Internet—or by design—as with cookies. Trackers don’t need to do anything more than trigger a request, any request, in order to collect the information described here.




//website.com. This is shown as a HTTP request, processed by a first-party server, and delivering the requested content. A separate red line shows that the HTTP request is also forwarded to a third-party server, given an assigned ID, and a tracking cookie that is included in the requested content.

Each time you visit a website by typing in a URL or clicking on a link, your computer makes a request to that website’s server (the “first party”). It may also make dozens or hundreds of requests to other servers, many of which may be able to track you.





Cookies

The most common tool for third-party tracking is the HTTP cookie. A cookie is a small piece of text that is stored in your browser, associated with a particular domain. Cookies were invented to help website owners determine whether a user had visited their site before, which makes them ideal for behavioral tracking. Here’s how they work.


The first time your browser makes a request to a domain (like www.facebook.com), the server can attach a Set-Cookie header to its reply. This will tell your browser to store whatever value the website wants—for example, `c_user:"100026095248544"` (an actual Facebook cookie taken from the author’s browser). Then, every time your browser makes a request to www.facebook.com in the future, it sends along the cookie that was set earlier. That way, every time Facebook gets a request, it knows which individual user or device it’s coming from.




//website.com. The server responds with website content and a cookie.

The first time a browser makes a request to a new server, the server can reply with a “Set-Cookie” header that stores a tracking cookie in the browser.





Not every cookie is a tracker. Cookies are also the reason that you don’t have to log in every single time you visit a website, as well as the reason your cart doesn’t empty if you leave a website in the middle of shopping. Cookies are just a means of sharing information from your browser to the website you are visiting. However, they are designed to be able to carry tracking information, and third-party tracking is their most notorious use.


Luckily, users can exercise a good deal of control over how their browsers handle cookies. Every major browser has an optional setting to disable third-party cookies (though it is usually turned off by default.) In addition, Safari and Firefox have recently started restricting access to third-party cookies for domains they deem to be trackers. As a result of this “cat and mouse game” between trackers and methods to block them, third-party trackers are beginning to shift away from relying solely on cookies to identify users, and are evolving to rely on other identifiers.


Cookies are always unique, and they normally persist until a user manually clears them. Cookies are always available to trackers in unmodified versions of Chrome, but third-party cookies are no longer available to many trackers in Safari and Firefox. Users can always block cookies themselves with browser extensions.


IP Address

Each request you make over the Internet contains your IP address, a temporary identifier that’s unique to your device. Although it is unique, it is not necessarily persistent: your IP address changes every time you move to a new network (e.g., from home to work to a coffee shop). Thanks to the way IP addresses work, it may change even if you stay connected to the same network.


There are two types of IP addresses in widespread use, known as IPv4 and IPv6. IPv4 is a technology that predates the Web by a decade. It was designed for an Internet used by just a few hundred institutions, and there are only around 4 billion IPV4 addresses in the world to serve over 22 billion connected devices today. Even so, over 70% of Internet traffic still uses IPv4.


As a result, IPv4 addresses used by consumer devices are constantly being reassigned. When a device connects to the Internet, its internet service provider (ISP) gives it a “lease” on an IPv4 address. This lets the device use a single address for a few hours or a few days. When the lease is up, the ISP can decide to extend the lease or grant it a new IP. If a device remains on the same network for extended periods of time, its IP may change every few hours -- or it may not change for months.


IPv6 addresses don’t have the same scarcity problem. They do not need to change, but thanks to a privacy-preserving extension to the technical standard, most devices generate a new, random IPv6 address every few hours or days. This means that IPv6 addresses may be used for short-term tracking or to link other identifiers, but cannot be used as standalone long-term identifiers.


IP addresses are not perfect identifiers on their own, but with enough data, trackers can use them to create long-term profiles of users, including mapping relationships between devices. You can hide your IP address from third-party trackers by using a trusted VPN or the Tor browser.


IP addresses are always unique, and always available to trackers unless a user connects through a VPN or Tor. Neither IPv4 nor IPv6 addresses are guaranteed to persist for longer than a few days, although IPv4 addresses may persist for several months.


TLS State

Today, most traffic on the web is encrypted using Transport Layer Security, or TLS. Any time you connect to a URL that starts with “https://” you’re connecting using TLS. This is a very good thing. The encrypted connection that TLS and HTTPS provide prevents ISPs, hackers, and governments from spying on web traffic, and it ensures that data isn’t being intercepted or modified on the way to its destination.


However, it also opens up new ways for trackers to identify users. TLS session IDs and session tickets are cryptographic identifiers that help speed up encrypted connections. When you connect to a server over HTTPS, your browser starts a new TLS session with the server.


The session setup involves some expensive cryptographic legwork, so servers don’t like to do it more often than they have to. Instead of performing a full cryptographic “handshake” between the server and your browser every time you reconnect, the server can send your browser a session ticket that encodes some of the shared encryption state. The next time you connect to the same server, your browser sends the session ticket, allowing both parties to skip the handshake. The only problem with this is that the session ticket can be exploited by trackers as a unique identifier.


TLS session tracking was only brought to the public’s attention recently in an academic paper, and it’s not clear how widespread its use is in the wild.


Like IP addresses, session tickets are always unique. They are available unless the user’s browser is configured to reject them, as Tor is. Server operators can usually configure session tickets to persist for up to a week, but browsers do reset them after a while.


Identifiers created by trackers


Sometimes, web-based trackers want to use identifiers beyond just IP addresses (which are unreliable and not persistent), cookies (which a user can clear or block), or TLS state (which expires within hours or days). To do so, trackers need to put in a little more effort. They can use JavaScript to save and load data in local storage or perform browser fingerprinting.


Local storage “cookies” and IFrames

Local storage is a way for websites to store data in a browser for long periods of time. Local storage can help a web-based text editor save your settings, or allow an online game to save your progress. Like cookies, local storage allows third-party trackers to create and save unique identifiers in your browser.


Also like cookies, data in local storage is associated with a specific domain. This means if example.com sets a value in your browser, only example.com web pages and example.com’s IFrames can access it. An IFrame is like a small web page within a web page. Inside an IFrame, a third-party domain can do almost everything a first-party domain can do. For example, embedded YouTube videos are built using IFrames; every time you see a YouTube video on a site other than YouTube, it’s running inside a small page-within-a-page. For the most part, your browser treats the YouTube IFrame like a full-fledged web page, giving it permission to read and write to YouTube’s local storage. Sure enough, YouTube uses that storage to save a unique “device identifier” and track users on any page with an embedded video.


Local storage “cookies” are unique, and they persist until a user manually clears their browser storage. They are only available to trackers which are able to run JavaScript code inside a third-party IFrame. Not all cookie-blocking measures take local storage cookies into account, so local storage cookies may sometimes be available to trackers for which normal cookie access is blocked.


Fingerprinting

Browser fingerprinting is one of the most complex and insidious forms of web-based tracking. A browser fingerprint consists of one or more attributes that, on their own or when combined, uniquely identify an individual browser on an individual device. Usually, the data that go into a fingerprint are things that the browser can’t help exposing, because they’re just part of the way it interacts with the web. These include information sent along with the request made every time the browser visits a site, along with attributes that can be discovered by running JavaScript on the page. Examples include the resolution of your screen, the specific version of software you have installed, and your time zone. Any information that your browser exposes to the websites you visit can be used to help assemble a browser fingerprint. You can get a sense of your own browser’s fingerprint with EFF’s Panopticlick project.


The reliability of fingerprinting is a topic of active research, and must be measured against the backdrop of ever-evolving web technologies. However, it is clear that new techniques increase the likelihood of unique identification, and the number of sites that use fingerprinting is increasing as well. A recent report found that at least a third of the top 500 sites visited by Americans employ some form of browser fingerprinting. The prevalence of fingerprinting on sites also varies considerably with the category of website.


Researchers have found canvas fingerprinting techniques to be particularly effective for browser identification. The HTML Canvas is a feature of HTML5 that allows websites to render complex graphics inside of a web page. It’s used for games, art projects, and some of the most beautiful sites on the Web. Because it’s so complex and performance-intensive, it works a little bit differently on each different device. Canvas fingerprinting takes advantage of this.




Subtle differences in the way shapes and text are rendered on the two computers lead to very different fingerprints.

Canvas fingerprinting. A tracker renders shapes, graphics, and text in different fonts, then computes a “hash” of the pixels that get drawn. The hash will be different on devices with even slight differences in hardware, firmware, or software.





A tracker can create a “canvas” element that’s invisible to the user, render a complicated shape or string of text using JavaScript, then extract data about exactly how each pixel on the canvas is rendered. The operating system, browser version, graphics card, firmware version, graphics driver version, and fonts installed on your computer all affect the final result.


For the purposes of fingerprinting, individual characteristics are hardly ever measured in isolation. Trackers are most effective in identifying a browser when they combine multiple characteristics together, stitching the bits of information left behind into a cohesive whole. Even if one characteristic, like a canvas fingerprint, is itself not enough to uniquely identify your browser, it can usually be combined with others -- your language, time zone, or browser settings -- in order to identify you. And using a combination of simple bits of information is much more effective than you might guess.


Fingerprints are often, but not always, unique. Some browsers, like Tor and Safari, are specifically designed so that their users are more likely to look the same, which removes or limits the effectiveness of browser fingerprinting. Browser fingerprints tend to persist as long as a user has the same hardware and software: there’s no setting you can fiddle with to “reset” your fingerprint. And fingerprints are usually available to any third parties who can run JavaScript in your browser.


Identifiers on mobile devices


Smartphones, tablets, and ebook readers usually have web browsers that work the same way desktop browsers do. That means that these types of connected devices are susceptible to all of the kinds of tracking described in the section above.


However, mobile devices are different in two big ways. First, users typically need to sign in with an Apple, Google, or Amazon account to take full advantage of the devices’ features. This links device identifiers to an account identity, and makes it easier for those powerful corporate actors to profile user behavior. For example, in order to save your home and work address in Google Maps, you need to turn on Google’s “Web and App Activity,” which allows it to use your location, search history, and app activity to target ads.


Second, and just as importantly, most people spend most of their time on their mobile device in apps outside of the browser. Trackers in apps can’t access cookies the same way web-based trackers can. But by taking advantage of the way mobile operating systems work, app trackers can still access unique identifiers that let them tie activity back to your device. In addition, mobile phones—particularly those running the Android and iOS operating systems—have access to a unique set of identifiers that can be used for tracking.


In the mobile ecosystem, most tracking happens by way of third-party software development kits, or SDKs. An SDK is a library of code that app developers can choose to include in their apps. For the most part, SDKs work just like the Web resources that third parties exploit, as discussed above: they allow a third party to learn about your behavior, device, and other characteristics. An app developer who wants to use a third-party analytics service or serve third-party ads downloads a piece of code from, for example, Google or Facebook. The developer then includes that code in the published version of their app. The third-party code thus has access to all the data that the app does, including data protected behind any permissions that the app has been granted, such as location or camera access.


On the web, browsers enforce a distinction between “first party” and “third party” resources. That allows them to put extra restrictions on third-party content, like blocking their access to browser storage. In mobile apps, this distinction doesn’t exist. You can’t grant a privilege to an app without granting the same privilege to all the third party code running inside it.


Phone numbers


The phone number is one of the oldest unique numeric identifiers, and one of the easiest to understand. Each number is unique to a particular device, and numbers don’t change often. Users are encouraged to share their phone numbers for a wide variety of reasons (e.g., account verification, electronic receipts, and loyalty programs in brick-and-mortar stores). Thus, data brokers frequently collect and sell phone numbers. But phone numbers aren’t easy to access from inside an app. On Android, phone numbers are only available to third-party trackers in apps that have been granted certain permissions. iOS prevents apps from accessing a user’s phone number at all.


Phone numbers are unique and persistent, but usually not available to third-party trackers in most apps.


Hardware identifiers: IMSI and IMEI


Every device that can connect to a mobile network is assigned a unique identifier called an International Mobile Subscriber Identity (IMSI) number. IMSI numbers are assigned to users by their mobile carriers and stored on SIM cards, and normal users can’t change their IMSI without changing their SIM. This makes them ideal identifiers for tracking purposes.


Similarly, every mobile device has an International Mobile Equipment Identity (IMEI) number “baked” into the hardware. You can change your SIM card and your phone number, but you can’t change your IMEI without buying a new device.


IMSI numbers are shared with your cell provider every time you connect to a cell tower—which is all the time. As you move around the world, your phone sends out pings to nearby towers to request information about the state of the network. Your phone carrier can use this information to track your location (to varying degrees of accuracy). This is not quite third-party tracking, since it is perpetrated by a phone company that you have a relationship with, but regardless many users may not realize that it’s happening.


Software and apps running on a mobile phone can also access IMSI and IMEI numbers, though not as easily. Mobile operating systems lock access to hardware identifiers behind permissions that users must approve and can later revoke. For example, starting with Android Q, apps need to request the “READ_PRIVILEGED_PHONE_STATE” permission in order to read non-resettable IDs. On iOS, it’s not possible for apps to access these identifiers at all. This makes other identifiers more attractive options for most app-based third-party trackers. Like phone numbers, IMSI and IMEI numbers are unique and persistent, but not readily available, as most trackers have a hard time accessing them.


Advertising IDs


An advertising ID is a long, random string of letters and numbers that uniquely identifies a mobile device. Advertising IDs aren’t part of any technical protocols, but are built in to the iOS and Android operating systems.


Ad IDs on mobile phones are analogous to cookies on the Web. Instead of being stored by your browser and shared with trackers on different websites like cookies, ad IDs are stored by your phone and shared with trackers in different apps. Ad IDs exist for the sole purpose of helping behavioral advertisers link user activity across apps on a device.


Unlike IMSI or IMEI numbers, ad IDs can be changed and, on iOS, turned off completely. Ad IDs are enabled by default on both iOS and Android, and are available to all apps without any special permissions. On both platforms, the ad ID does not reset unless the user does so manually.


Both Google and Apple encourage developers to use ad IDs for behavioral profiling in lieu of other identifiers like IMEI or phone number. Ostensibly, this gives users more control over how they are tracked, since users can reset their identifiers by hand if they choose. However, in practice, even if a user goes to the trouble to reset their ad ID, it’s very easy for trackers to identify them across resets by using other identifiers, like IP address or in-app storage. Android’s developer policy instructs trackers not to engage in such behavior, but the platform has no technical safeguards to stop it. In February 2019, a study found that over 18,000 apps on the Play store were violating Google’s policy.


Ad IDs are unique, and available to all apps by default. They persist until users manually reset them. That makes them very attractive identifiers for surreptitious trackers.


MAC addresses


Every device that can connect to the Internet has a hardware identifier called a Media Access Control (MAC) address. MAC addresses are used to set up the initial connection between two wireless-capable devices over WiFi or Bluetooth.


MAC addresses are used by all kinds of devices, but the privacy risks associated with them are heightened on mobile devices. Websites and other servers you interact with over the Internet can’t actually see your MAC address, but any networking devices in your area can. In fact, you don’t even have to connect to a network for it to see your MAC address; being nearby is enough.


Here’s how it works. In order to find nearby Bluetooth devices and WiFi networks, your device is constantly sending out short radio signals called probe requests. Each probe request contains your device’s unique MAC address. If there is a WiFi hotspot in the area, it will hear the probe and send back its own “probe response,” addressed with your device’s MAC, with information about how you can connect to it.


But other devices in the area can see and intercept the probe requests, too. This means that companies can set up wireless “beacons” that silently listen for MAC addresses in their vicinity, then use that data to track the movement of specific devices over time. Beacons are often set up in businesses, at public events, and even in political campaign yard signs. With enough beacons in enough places, companies can track users’ movement around stores or around a city. They can also identify when two people are in the same location and use that information to build a social graph.




A smartphone emits probe request to scan for available WiFi and Bluetooth connections. Several wireless beacons listen passively to the requests.

In order to find nearby Bluetooth devices and WiFi networks, your device is constantly sending out short radio signals called probe requests. Each probe request contains your device’s unique MAC address. Companies can set up wireless “beacons” that silently listen for MAC addresses in their vicinity, then use that data to track the movement of specific devices over time.





This style of tracking can be thwarted with MAC address randomization. Instead of sharing its true, globally unique MAC address in probe requests, your device can make up a new, random, “spoofed” MAC address to broadcast each time. This makes it impossible for passive trackers to link one probe request to another, or to link them to a particular device. Luckily, the latest versions of iOS and Android both include MAC address randomization by default.


MAC address tracking remains a risk for laptops, older phones, and other devices, but the industry is trending towards more privacy-protective norms.


Hardware MAC addresses are globally unique. They are also persistent, not changing for the lifetime of a device. They are not readily available to trackers in apps, but are available to passive trackers using wireless beacons. However, since many devices now obfuscate MAC addresses by default, they are becoming a less reliable identifier for passive tracking.


Real-world identifiers


Many electronic device identifiers can be reset, obfuscated, or turned off by the user. But real-world identifiers are a different story: it’s illegal to cover your car’s license plate while driving (and often while parked), and just about impossible to change biometric identifiers like your face and fingerprints.


License plates


Every car in the United States is legally required to have a license plate that is tied to their real-world identity. As far as tracking identifiers go, license plate numbers are about as good as it gets. They are easy to spot and illegal to obfuscate. They can’t be changed easily, and they follow most people wherever they go.


Automatic license plate readers, or ALPRs, are special-purpose cameras that can automatically identify and record license plate numbers on passing cars. ALPRs can be installed at fixed points, like busy intersections or mall parking lots, or on other vehicles like police cars. Private companies operate ALPRs, use them to amass vast quantities of traveler location data, and sell this data to other businesses (as well as to police).


Unfortunately, tracking by ALPRs is essentially unavoidable for people who drive. It’s not legal to hide or change your license plate, and since most ALPRs operate in public spaces, it’s extremely difficult to avoid the devices themselves.


License plates are unique, available to anyone who can see the vehicle, and extremely persistent. They are ideal identifiers for gathering data about vehicles and their drivers, both for law enforcement and for third-party trackers.


Face biometrics


Faces are another class of unique identifier that are extremely attractive to third-party trackers. Faces are unique and highly inconvenient to change. Luckily, it’s not illegal to hide your face from the general public, but it is impractical for most people to do so.


Everyone’s face is unique, available, and persistent. However, current face recognition software will sometimes confuse one face for another. Furthermore, research has shown that algorithms are much more prone to making these kinds of errors when identifying people of color, women, and older individuals.


Facial recognition has already seen widespread deployment, but we are likely just beginning to feel the extent of its impact. In the future, facial recognition cameras may be in stores, on street corners, and docked on computer-aided glasses. Without strong privacy regulations, average people will have virtually no way to fight back against pervasive tracking and profiling via facial recognition.


Credit/debit cards


Credit card numbers are another excellent long-term identifier. While they can be cycled out, most people don’t change their credit card numbers nearly as often as they clear their cookies. Additionally, credit card numbers are tied directly to real names, and anyone who receives your credit card number as part of a transaction also receives your legal name.


What most people may not understand is the amount of hidden third parties involved with each credit card transaction. If you buy a widget at a local store, the store probably contracts with a payment processor who provides card-handling services. The transaction also must be verified by your bank as well as the bank of the card provider. The payment processor in turn may employ other companies to validate its transactions, and all of these companies may receive information about the purchase. Banks and other financial institutions are regulated by the Gramm-Leach-Bliley Act, which mandates data security standards, requires them to disclose how user data is shared, and gives users the right to opt out of sharing. However, other financial technology companies, like payment processors and data aggregators, are significantly less regulated.


Linking identifiers over time


Often, a tracker can’t rely on a single identifier to act as a stable link to a user. IP addresses change, people clear cookies, ad IDs can be reset, and more savvy users might have “burner” phone numbers and email addresses that they use to try to separate parts of their identity. When this happens, trackers don’t give up and start a new user profile from scratch. Instead, they typically combine several identifiers to create a unified profile. This way, they are less likely to lose track of the user when one identifier or another changes, and they can link old identifiers to new ones over time.


Trackers have an advantage here because there are so many different ways to identify a user. If a user clears their cookies but their IP address doesn’t change, linking the old cookie to the new one is trivial. If they move from one network to another but use the same browser, a browser fingerprint can link their old session to their new one. If they block third-party cookies and use a hard-to-fingerprint browser like Safari, trackers can use first-party cookie sharing in combination with TLS session data to build a long-term profile of user behavior. In this cat-and-mouse game, trackers have technological advantages over individual users.


Part 2: From bits to Big Data: What do tracking networks look like?


In order to track you, most tracking companies need to convince website or app developers to include custom tracking code in their products. That’s no small thing: tracking code can have a number of undesirable effects for publishers. It can slow down software, annoy users, and trigger regulation under laws like GDPR. Yet the largest tracking networks cover vast swaths of the Web and the app stores, collecting data from millions of different sources all the time. In the physical world, trackers can be found in billboards, retail stores, and mall parking lots. So how and why are trackers so widespread? In this section, we’ll talk about what tracking networks look like in the wild.




A bar graph showing market share of different web tracking companies. Google is the most prevalent, monitoring over 80% of traffic on the web.

Top trackers on the Web, ranked by the proportion of web traffic that they collect data from. Google collects data about over 80% of measured web traffic. Source: WhoTracks.me, by Cliqz GBMH.





Print