The World Wide Web is increasingly used for commerce and for access to personal information stored in computerized databases. Clearly, the Internet and the Web ease access to existing data sources, some of which contain personal information, such as financial records, medical records, criminal records. More subtly, and perhaps more significantly, by moving many human activities (shopping, entertainment, communication, and so forth) onto the computer and computer networks, the Web makes it possible to track an individual's activities more than ever before. Indeed, many Web sites collect information about you, such as the pages you visit, the searches you request, the items you browse or purchase, or the ads you click, without your consent, and in many cases, without your knowledge.
Several disturbing cases have emerged in recent months. Real Networks included a unique identifier in each copy of their music-playing software; every time the user played a CD, or some music from the net, the software secretly sent that information to Real Networks. Since the user had to supply their e-mail address to Real before obtaining the software, Real Networks could track all of the music played, by e-mail address. Once this data collection was discovered, they modified the software to stop the practice. DoubleClick, one of the dominant banner-advertising agencies, collects data about a user across all of the many sites that DoubleClick services. They were recently drowned in public criticism for a plan that would cross-reference these records of web-surfing activity with the mail-order records of the same people, as found in a database built by Abacus, Inc. They, too, gave up the plan after public outcry. Finally, a study by the California HealthCare Foundation found that many of the top 21 health-oriented sites were including medically sensitive information in the data they passed along to advertisers. Most health web sites have since fixed their sites. Although none of these cases demonstrate any malicious use of the data, they serve to demonstrate the sort of data that can be collected automatically.
Privacy is a concept that is infused with many ethical, legal, and cultural meanings. In the context of the Web, I like Alan F. Westin's definition: "Privacy ... is the claim of individuals ... to determine for themselves when, how, and to what extent information about them is communicated to others ...." The focus here is on control of information about oneself.
One way to achieve privacy is to remain anonymous. Anonymity is dependent on the context, and it seems that the goal of being anonymous is for other people, or organizations, to be unable to identify you beyond that context. For example, many AOL users choose an alias to use in chat rooms. In the context of those chat rooms, they may be "well known." But they are anonymous as long as their correspondents are unable to connect that with a person outside the chat room, say, the local schoolteacher.
Web anonymity is not as simple as it may seem. For example, banner advertisements may enable personal information to "leak" from one site to another: if I provide my name and e-mail address to one DoubleClick customer site, for example, then later visit another DoubleClick customer site "anonymously," I may not be aware of the fact that DoubleClick bought that information from the first site and sold it to the second site. While I am not sure whether DoubleClick or any other advertising agent does it, but the technology enables it, the motivations exist to encourage it, and there are no laws or regulations to prevent it.
Many people believe that to be nameless is to be anonymous. But a name is just a string of letters, and is rarely unique. Other information about you may be more useful than your name. A telephone number is useful if I want to call you, and an e-mail address is useful if I want to e-mail you; in either case, your name may be irrelevant. A social-security number is, by itself, just a nine-digit number, but often it can be used to locate other useful information. Ultimately, the value of information about an individual is directly related to the potential uses of that information. If my goal is to cross-sell a customer a shirt after they add a sweater to their virtual shopping cart, knowledge of their gender or height and weight may be more useful than their name or social-security number.
Indeed, the most common usage for information collected about Web users, particularly that information collected invisibly while they surf the Web, is to target advertisements. Although many people are disturbed when advertising agents build a profile of them, I would expect that most people would prefer well-targeted advertisements to the current barrage of irrelevant and untargeted spam and other ads. Nonetheless, people are quite suspicious. Why? Because the same data collected for targeted advertising might be sold, with no consent or notification necessary, to an organization that has a different use in mind. For example, the data about your purchasing history might be sold to your banker, and used to deny you a loan. Data about the material you've been reading at that medical-information site might be sold to, or leaked to, your insurance agent, your employer, or your school's basketball coach.
In my opinion, every Web site should have a clear and prominent privacy policy, stating at a minimum 1) what information is collected from the user, either explicitly or implicitly, 2) how that information will be used, 3) whether that information might be given to a third party without consent of the user, 4) a list of any other parties that might collect data from visitors to this site, such as advertising agencies, and 5) how the users can easily view, update, or delete the information collected about themselves. These requirements are essentially those suggested by the Federal Trade Commission. Furthermore, the site's privacy policy and its implementation should be audited by a trustable third party; several private-sector organizations exist for this purpose already. Any data collected about users should be stored securely, to avoid leaking private information to anyone else; some hackers have obtained large lists of credit-card numbers from some e-commerce sites,
for example. Any data collected under one policy should not be used under the terms of a looser, updated policy. DoubleClick had promised not to correlate their data with external sources, but later changed their privacy policy to say that they might do so, without informing users that the policy had changed, or promising to discard all the old data first. Finally, if at all possible, data collection should be at the consent of the user, that is, should be based on an "opt-in" rather than "opt-out" policy.
It is not yet clear to me exactly what federal regulations will be necessary to add teeth to these privacy policies, but I have serious doubts that the industry will be able to successfully regulate themselves. Their record so far, in the direct-marketing world and now in the online world, is quite poor.
If you are interested in reading more about these issues, I have collected some links at www.cs.dartmouth.edu/~dfk/privacy.html. There are links to an expanded version of this article, and to The New York Times's excellent stories on the subject, among other things.