Unmasked: What 10 million passwords reveal about the people who choose them
A lot is known about passwords. Most are short, simple, and pretty easy to crack. But much less is known about the psychological reasons a person chooses a specific password. We’ve analyzed the password choices of 10 million people, from CEOs to scientists, to find out what they reveal about the things we consider easy to remember and hard to guess.
10 Million Tiny Windows
Who is the first superhero that comes to mind? What about a number between one and 10? And finally, a vibrant color? Quickly think of each of those things if you haven’t already, and then combine all three into a single phrase.
Now, it’s time for us to guess it.
Is it Superman7red? No, no: Batman3Orange? If we guessed any one of the individual answers correctly, it’s because humans are predictable. And that’s the problem with passwords. True, we gave ourselves the advantage of some sneakily chosen questions, but that’s nothing compared to the industrial-scale sneakiness of purpose-built password-breaking software. HashCat, for instance, can take 300,000 guesses at your password a second (depending on how it’s hashed), so even if you chose Hawkeye6yellow, your secret phrase would, sooner or later, not be secret anymore.
Passwords are so often easy to guess because many of us think of obvious words and numbers and combine them in simple ways. We wanted to explore this concept and, in doing so, see what we could find out about how a person’s mind works when he or she arranges words, numbers, and (hopefully) symbols into a (probably not very) unique order.
We began by choosing two data sets to analyze.
Two Data Sets, Several Caveats
The first data set is a dump of 5 million credentials that first showed up in September 2014 on a Russian BitCoin forum.1 They appeared to be Gmail accounts (and some Yandex.ru), but further inspection showed that, while most of the emails included were valid Gmail addresses, most of the plain-text passwords were either old Gmail ones (i.e. no longer active) or passwords that were not used with the associated Gmail addresses. Nevertheless, WordPress.com reset 100,000 accounts and said that a further 600,000 were potentially at risk.2 The dump appears to be several years’ worth of passwords that were collected from various places, by various means. For our academic purposes, however, this didn’t matter. The passwords were still chosen by Gmail account holders, even if they weren’t for their own Gmail accounts and given that 98 percent were no longer in use, we felt we could safely explore them.3
We used this data set, which we’ll call the “Gmail dump,” to answer demographic questions (especially those related to the genders and ages of password-choosers). We extracted these facts by searching the 5 million email addresses for any that contained first names and years of birth. For example, if an address was John.Smith1984@gmail.com, it was coded as a male born in 1984. This method of inference can be tricky. We won’t bore you with too many technical details here, but by the end of the coding process, we had 485,000 of the 5 million Gmail addresses coded for gender and 220,000 coded for age. At this point, it’s worth bearing in mind the question, “Do users who include their first names and years of birth in their email addresses choose different passwords than those who don’t?”—because it’s theoretically possible they do. We’ll discuss that more a bit later.
For now, though, here’s how the users we coded were divided by decade of birth and gender. Continue Reading…