Tech Straight Talk: The Deep Web and the 99% of the Internet You Can't See
According to Internet World Stats, as much as 40 percent of the total population of the world uses the Internet for a variety of purposes: news, entertainment, email, chat, dating, shopping…in other words, just about anything and everything you used to need an outside world for. But for as extensive as the Internet’s reach has become, and for as much as people rely on it, only a very, very small part of the total Internet is visible to the average person. As time goes on, less and less of the total amount of data stored online is accessible, and there’s a variety of reasons for that.
What we call “the Internet” is mostly made up of websites and data that we can find via a search engine like Google or Bing. The search engines do what the name suggests: they search for Internet content, and then present them in a useful way in response to a search entered by a person. However, the Internet is like an iceberg is the sense that only the very upper tip is visible to observers, and that upper tip is all that search engines are able to find and present in reqponse to a search query. How little a tip are we talking about here? Well, according to OEDB, search engines are only able to see a miniscule 0.03% of the total information that exists on the Internet. That’s a staggeringly low number given that most people these days consider Google to be the ultimate authority on anything and everything that’s out there.
But what about the rest of the Internet? What about the other 99.97% that search engines won’t show you? What does that mysterious “lost continent” of the Internet contain, and why can’t the search engines penetrate in to see what’s there?
The Deep Web
Well, that is the part of the Internet called the Deep Web (or alternately the Undernet, Invisible Web, and Hidden Web), and to go back to the iceberg analogy, the Deep Web is the part that goes deep, deep underwater, far further than most casual observers would expect.
Nobody really knows for sure just how big the Deep Web is: one of the problems with it being unsearchable is that you can’t estimate its size if you don’t know where its boundaries lay. That said, it’s easily hundreds of times bigger than the visible Internet, and it’d probably be safe to say that it’s probably into the thousands of times bigger by this point in time. In most cases there’s no questionable or malicious reasons this data is hidden, and in most cases it’s not even hidden on purpose. The reason Google can't see it is that current search engine technology isn’t designed in a way that allows it to find this data.
The idea that a search engine is only capable of scratching the surface of the Internet is staggering when you consider the fact that there are well over 550 million registered domains. That’s a staggeringly enormous number, especially when you consider the fact that each domain can have multiple subdomains, each of which can contain hundreds or even thousands of pages. Many of these pages aren’t catalogued, and therefore fall within the boundaries of the Deep Web.
BrightPlanet estimated that the Deep Web could be as much as 500 times the size of the surface Internet, and it continues to grow bigger each day. That is an unthinkably huge amount of data, but how can so much of it be invisible to search engines? These are complex mechanisms designed by the best data management minds in the world, so how could so much of the Internet be beyond their reach? Well, it mainly has to do with the limitations of current searching technologies as I said earlier. Let’s look at a bit of technical background to paint the picture of why so much information is out of the sight of search engines.
Search engines build a massive index of the contents of the Internet through the use of automated processes that are commonly known as “spiders” or “crawlers”. These spiders “crawl” websites and other online resources, following hyperlinks on pages they’re already aware of to other pages and other domains. By doing this, it creates a sort of “map” of the internet that, with all its interconnecting links, resembles a gigantic web, hence the name “World Wide Web” and the “spiders” that explore it.
The indexes that spiders create for their search engines are critical to their functionality, containing the data that search engines are entirely dependent on for their ability to return relevant results against specified search criteria. Everyone is used to getting nearly instantaneous results when performing a web search, but without the index, the search engine would have to begin each search from scratch, and scan billions of webpages anytime someone wanted information. The process would be slow, laborious, and inefficient enough to not make it worth it.
The reason the Deep Web can’t be indexed by these spiders mainly has to do with a variety of technical hurdles that prevent them from even seeing that it’s there. Indexing schemes are complicated by things like private websites that require a login to access contents in its “members” section. The same can be said about premium websites where content lives behind a pay wall. If data on a website can only be found by visiting the site and entering keywords into its internal search engine, the spiders won’t be able to find that data either. Sometimes data is stored in encrypted form, or is otherwise incompatible with on unfamiliar to the spiders. Other sites will present a resource, and then cut off public access to the resource after a given time limit has expired.
Since internet-connected private networks (such as internal company networks) are also technically part of the Internet, anything stored on their private servers and systems is also beyond the reach of search indexing efforts. The same goes for pages that were once published to the public, but have since been unpublished. Websites can also flag certain pages (or even the entire domain) to not be indexed by search engines.
Non-web Internet resources such as FTP servers, internet streaming media services that don’t have a web portal (ie not YouTube), and Usenet cannot be reached via hyperlinks, and are also therefore not accessible to web crawls. News websites are often only indexed on a limited basis, and focus on stories that receive a lot of public attention (a Presidential election or a major natural disaster crisis, for example). However, information on obscure, local, or older stories may not get indexed. Instead, you’d have to go directly to a news outlet’s website and use its internal search function (which, as we established earlier, spiders can’t do).
All of these challenges make it difficult or impossible for a search engine to find and index it. The contents of the Deep Web may present technical barriers that search engines can’t get over, but just because it’s not indexed doesn’t mean it’s not worth indexing. There’s a lot of valuable information in the Deep Web, and the indexing technology just isn’t up to capturing it all…at least not yet. One day, maybe, but they can’t currently act enough like normal people to do that.
The list of what the Deep Web contains keeps getting longer and longer, and more and more complicated. It’s not easy for search engineers to continually come up with new ways to accurately index web content as newer technologies for storing and presenting content are developed. The Deep Web has been a tough nut to crack, and so far, nobody has found a good way of cracking it. However, finding the stuff on the Deep Web is only half the challenge: the other half is presenting it in a way that won’t overwhelm end users who are looking for specific information.
Keep in mind, you might just be looking for a good chicken parm recipe, but other people rely on search engines for many other things, including stuff that’s critical to maintaining and advancing the world we live in. Scientists working on new medications, more efficient use of fuel, increase internet speeds, expand communications into new areas we aren’t currently able to, develop space travel technologies, and many other things all are aided greatly by their ability to search for this material on the internet. Non-scientists who need to research historical records, develop business plans, train themselves in new skills, interact with communities of like-minded people, research legal records, and so on also rely on the indexing power of search engines.
There is so many masters that search engines need to serve, and all those needs have to be factored in when the search algorithms are developed. It’s a staggering amount of work, and frankly, the fact that they’ve made the technology as powerful as it is currently is an astounding accomplishment. The potential for what search technology can further offer in the future is equally impressive, but the technical challenges preventing them from currently doing so are not trivial.
That’s part of the intrigue of trying to crack the Deep Web: the information is out there, and in manty cases, those who possess it would have no problem whatsoever with sharing it with the public if search engines could develop a way to access and index it. The day may come when those challenges are addresses, but as of right now there’s a big hunk of that iceberg sitting under the surface that has a big question mark hanging over it.
The Dark Web
That brings us to the most remote part of the Deep Web, the one that a lot of people are worried about for a lot of reasons, some good and some bad. While the Deep Web contains a huge repository of untapped resources that simply haven’t had the doors opened yet, the Dark Web is a place where the folks using it would prefer that you not only keep the door shot, but lock it with about ten heavy bolts, bar it shut, nail boards over it, and throw some superglue in whatever cracks are left open.
The Dark Web is the part of the Internet where data really is intentionally hidden, and often for good reason: there are many cases when the information and communications taking place there would result in severe consequences for the people there if they were caught. Most of the time, accessing the Dark Web is only possible using Internet browser software specifically designed to access and navigate it. Most common web browsers like Microsoft Edge, Google Chrome, Apple Safari, or Mozilla Firefox aren’t able to access the Dark Web, which actually functions on a different design that the rest of the Internet.
Now, there’s a good reason for the special browser software: it’s specifically designed to protect both the source and destination of any communication traveling through the Dark Web, minimize security weaknesses that can be exploited to identify Dark Web users, and even obscure where in the world the user is. It even uses a different addressing and routing scheme than the main Internet, whose addressing scheme can often be used to narrow down the geolocation of a normal Internet user. This kind of anonymity is an immensely powerful benefit to people who need it for both really good and really bad reasons, because it allows them to circumvent government controls (both good and bad) and transfer information, goods, and services…both legal and illegal ones.
Obviously, the secretive nature of the Dark Web creates the impression that there’s a lot of shady things happening there, and there is. Fairly or not, a lot of the attention is cast on the bad stuff in the Dark Web: drugs, stolen personal information, weapons, and even hitmen. Just about every highly-illegal thing you can imagine can be found on the Dark Web, and just like in real life, finding this stuff isn’t easy.
Obviously, you can’t search for these things with Google (and I would highly advise you not to try), but not for the same reasons you can’t use it to find things in the Deep Web. Unlike the data incompatibilities or database complexities that limit a search engine’s ability to index the Deep Web, the Dark Web is specifically designed to use a different routing architecture than the regular, public Internet.
In fact, the “roadways” within the Dark Web are so different than the public Internet that you can’t use standard browsers like Google Chrome or Firefox to browse its contents. In order to do that, you need to use a special browser called The Onion Router, or Tor for short. Tor is designed to run on the Dark Web’s unique design, which not only uses encryption, but also routes connections through multiple servers around the world using a special addressing scheme that makes users nearly impossible to track.
While you can use the Dark Web as a “secret entrance” to browse the public Internet anonymously, it also provides access to “hidden services” in the form of websites that are only accessible on the Dark Web. These websites, which end in “.onion” as opposed to “.com” or “.net”, are often hosted on servers in countries that don’t cooperate with law enforcement agencies from countries that track and prosecute international crimes.
There are plenty of sites to find whatever you want, most infamously the Silk Road, an online marketplace for drugs, guns, and pretty much anything else you wouldn’t be able to legally buy in a store. The intentionally difficult-to-track nature of Tor made it a big win for the FBI when they nabbed Ross Ulbricht, the man behind the website, but it also allowed multiple copycat sites to pop up the instant Silk Road went down, and effectively continue the services it offered with minimal interruption.
You may be asking yourself how such an environment was created, and why more hasn’t been done to shut it down or render it inaccessible through other means. Well, you may be surprised to know that Tor and the Dark Web was actually created by the US Naval Research Laboratory. Yes, as in the American military. They created it to aid political dissidents, whistleblowers, and others who have important information that can hurt the bad guys, but would otherwise fear violent repercussions for sharing it.
Unfortunately, the anonymity provided by Tor was so secure that it quickly became a haven for criminals who found its ability to obscure their identities and activities quite handy. The Dark Web, just like any other powerful tool, is a double-edged sword that has created as many problems as it has solved. It’s also ironic that criminals are able to use US military-created technology to cover their tracks.
Bitcoin (And Other Cryptocurrencies)
Okay, but if everything on the Dark Web is supposed to happen anonymously, how can transactions for all these illegal transactions happen? Surely Dark Web consumers aren’t whipping out their Visa card when they need to buy a few human livers, are they? No, they’re not: they’re using an encrypted digital currency called Bitcoin that, like everything else on the Dark Web, is designed to not be traceable to whoever owns or uses them.
Bitcoin’s anonymous and unregulated nature allows for freedoms that any traditional form of currency can’t, but there are drawbacks to using them. Since they’re not backed or regulated by any nation, their value tends to fluctuate rather radically. Online Bitcoin storage mechanisms aren’t insured as the analogous, traditional banks are, and this has caused problems when these storage mechanisms have been compromised, costing Bitcoin owners a ton of money in the process, and since Bitcoins are designed to be untraceable, there was no way to track down whoever was behind the heists.
That being said, Bitcoin and everything else on the Dark Web does have its positive uses as well. Though it’s easy to focus on the illegal things (and the news reports certainly do that with zeal), there are many other hidden services on the Dark Web that are perfectly legitimate. Many of the same services you use on the public Internet (or Clearnet, as Dark Web dwellers refer to it) are available on the Dark Web: email, file sharing and storage, social media, forums, news, and more like them are available to those who don’t trust their information to more popular Clearnet providers like Google, Microsoft, or Amazon.
There are also a number of communities on the Dark Web for groups who may not be able to meet safely in person in their home location. This obviously includes political activists, as well as government and corporate whistleblowers, but also those whose lifestyles (especially in countries with "unapproved" personal lifestyles, and we'll leave it at that) could open them up to violence or death in a more open setting.
Now, the flip side is that the search engines that do exist in the Dark Web have only limited ability to index its contents, and no ability to personalize results according to your location or tendencies based on previous searches. However, most of those on the Dark Web can live with that trade-off in return for not having their online behavior tracked or being subjected to suspiciously targeted online advertisements.
In fact, a research paper from the University of Luxembourg found that the hidden services that focus on human rights and freedom of information are every bit as popular as those hosting illegal content or activities, possible more so. So while the Dark Web does provide an avenue for some pretty questionable activities, its value in the human rights and freedom of information spheres is such that the Dark Web and Tor cannot be easily dismissed as merely a vehicle for criminals.
The amount of data contained in the Deep Web and Dark Web increases on a daily basis, and in doing so, makes the job of search engines that much harder. It’s a good thing and a bad thing. However, programmers are up to the challenge of not only creating new algorithms and methods for finding and indexing it all, but organizing it in a way that will be usable by regular individuals, researchers, and professionals alike.
Until then, the Deep Web, and especially the Dark Web, will continue to be the mysterious, distant island everyone can see but not get to. It’s also important not to allow ourselves to get so focused on the shady elements of the Dark Web that we allow a negative stigma to become an impediment to its development as an important and powerful tool for legitimate uses. Rather than sensationalize the worst, it’s important to think about all the ways the data contained in these remote corners of the Internet could continue to transform and evolve our society.