Dark Web vs Deep Web vs Surface Web

Written by Stuart Clarke

Chief Executive Officer

And How to Incorporate Each Within OSINT Investigations

Many consider the internet to be one monolithic structure accessible via search engines like Google, Yahoo and Bing. In reality, quintillions of bytes are created on the internet each day, but this data is dispersed through 3 different parts of the internet — the surface web, deep web and dark web.

A popular analogy for the anatomy of the internet is an iceberg — only a small percentage of it is visible when using a typical search engine. Whilst obtaining precise figures for the proportion of the internet attributable to either the surface, deep or dark web is difficult, most estimates place the proportion of the internet formed by the deep web at around 93%.¹

The dark web is even harder to measure. It is likely to occupy less than 0.1% of the internet, with the remainder (around 6%) consisting of the surface web, which is indexed and accessible by standard search engines.

There are various open source investigation (OSINT) techniques that investigators can use to understand deep and dark web activity, thereby allowing connections to be mapped across these three distinct components of the internet. This article will break down the dark web, deep web and surface web, and look at the opportunities and limitations they have within OSINT investigations.

Suggested reading: To learn more about open source investigation best practices, check out our recent publication — The OSINT Handbook

The Surface Web

The surface web is an information system formed by the world wide web and accessible to the general public via search engines. It is crawled by bots, or spiders, which follow links (URLs) and analyse web content by reading code on the page. The web crawling process itself has two parts:crawling and indexing, the former referring to the process of crawling links to discover webpages, and the latter to the process of analysing the content on those web pages for indexing. Indexed results will then appear on the search engine results page (SERPs).

In summary:

The surface web refers to the indexed world wide web: the most popular information system used to access content over the internet.
Google, Bing, Yahoo and other search engines employ web crawlers or web spiders, which crawl the web for indexable content. This content is organised in the search engine for retrieval via search.
The surface web contains publicly accessible web information that is indexable (e.g. it is not placed behind a subscription wall, private login, paywall, or is labelled to not be indexed).
Whilst search engines are not particularly equitable, as search engine optimisation (SEO) dictates the order of the SERPs, it’s still theoretically possible to find any indexed web page using a search engine.

The surface web in OSINT investigations

The surface web is the first port of call for almost any OSINT research process. While the data accessible through the surface web is small compared to that of the deep web, it represents a well-structured and well-organised source of information that contextualises the publicly visible components of networks, events and connections.

Opportunities and Uses

Within the context of OSINT investigations, the surface web provides investigators with an opportunity to trace publicly visible names and identifiers to kickstart investigations. News events, public reports, and public forums provide solid starting points. Poor or lazy surface web habits, like a user having the same username on a public forum or YouTube as they do on a dark web marketplace, can allow investigators to start building connections.

Publicly available and accessible social media is an important component of the surface web, enabling OSINT researchers to draft networks and their connections. Social media tools (SOCMINT) can be deployed to optimise the task of surface web network mapping, allowing investigators to generate initial leads, visualise connections and discover further avenues for exploration.

Limitations and Challenges

While the surface web provides a powerful means to explore and contextualise public information, the potency of OSINT investigations is multiplied by the deep and dark web. The surface web is limited — it’s explicitly designed for easy search and navigation by the public. It’s possible to get quite far with just a surface web investigation, but researchers are likely to hit a brick wall at some point.

In addition to this, the extent to which search engines dominate the surface web limits its usefulness within OSINT investigations. Search engines are designed for consumers, not investigators, and as a result search engines bring back results they think the user wants to see. This can be useful for consumers searching for a product, but very unhelpful for investigators seeking out unbiased information. SEO further undermines the effectiveness of the surface web in OSINT investigations, making results more reflective of marketing spend and strategy than relevance or quality.

The Deep Web

The deep web is often conflated with the dark web in public discourse, but they are not the same. Web content in the dark web is de facto ‘invisible’ to search engines because they are unable to crawl it. Where this differs from the deep web is that much of the deep web is not intentionally hidden from public access, whereas dark web content is deliberately obscured.

Often, search crawlers can’t index the deep web because web pages instruct them not to, and the content usually requires authentication to access. Any webmaster can place a script on their website (called the robots.txt) to instruct web crawlers to not crawl certain URLs.

Deep web sources include:

Grey literature, which includes corporate and working papers, white papers, reports, evaluations, and unpublished academic data.
Database material which is not indexed by search crawlers, but is instead indexed internally, and is therefore not directly accessible using surface web browsers.
Paywall and password-protected content from academic, corporate, governmental, legal, financial, NGO and medical/public health sources.
Data contained on private intranets or cloud storage like OneDrive, DropBox etc.
Emails and messages sent using messaging platforms and web apps.

The deep web in OSINT investigations

The deep web is colossal, perhaps 500 times larger than the surface web, and much of it is considered open source — the fact it isn’t indexed and readily accessible by commercial search engines is irrelevant.²

Opportunities and Uses

Deep web grey literature provides a powerful means to discover links and discrepancies between unindexed records, leaked information and public filings. OSINT researchers can use the deep web to map networks using both publicly accessible social media information and social media data contained within the deep web, including images, video and metadata.

Deep web OSINT provides data obtained from behind logins, e.g. publicly available forums that require membership, corporate records and sanctions lists. In these instances, the information is intended and available for public consumption but requires user authentication, unlike the surface web.

Limitations and Challenges

The primary challenge of using the deep web arises from the fact that standard search engines do not index it in the same way they do the surface web, making it far more difficult to navigate. Furthermore, this means that relationships and considerable expertise are essential in order to access all the data sources across the deep web.

The Dark Web

The dark web is often defined as an extension of the deep web, which is true in that the dark web is also hidden from surface web indexing. However, the dark web is hidden by intent, and designed with specific technologies to protect user anonymity. Numerous prominent surface websites, including Facebook and the New York Times, host mirrors of their content on the dark web for this exact reason, as it allows political dissent and freedom of expression in authoritarian countries without fear of identification.³

The dark web uses cryptographic methods to partially anonymise users. This is done primarily by relaying encrypted traffic through a series of nodes, also known as onion routing, using TOR (or The Onion Router) browsers. This obfuscates IP addresses and other identifiers, hiding the user’s requests and communications. The network infrastructure is dynamic and randomised, making connections difficult to trace.

Key dark web facts:

The dark web has hosted criminal activity and black markets since the late 1990s, and also hosts marketplaces that sell everything from drugs and firearms to stolen data and illegal services.
As well as drugs, firearms, and financial crime, the dark web is used by terrorist groups from around the world.
Following the 2015 Paris terrorism attack, many ISIS propaganda websites and archives were unearthed on the dark web.⁴ There is evidence that terrorist groups use the dark web for fundraising, communications, and the purchase of weaponry.
The dark web also has a history of use for internal corporate or governmental risk discussions, e.g. preceding an expected data leak or whistleblowing event.

The dark web in OSINT investigations

The dark web provides a rich source of illicit data. However, it’s estimated that only 6.7% of users access TOR specifically for illegal or illicit purposes.⁵ That is still roughly 1 in 20, a very large proportion compared to those who access the surface web for the same end goals.

Opportunities and Uses

OSINT investigators can form links between the surface and dark web via users’ own poor anonymity measures, leaking their own personal information in the process of communicating with others via forums. One such prolific example is drug dealer Carl Stewart, who was successfully prosecuted on the basis of his fingerprints being found on the wrapper of some stilton cheese, which he posted a picture of to an encrypted messaging service.⁶

Everything from usernames to forum signatures and captions can be linked between the surface and dark web. Researchers can even use natural language processing (NLP) techniques to correlate how users use written language to communicate with their networks.

Limitations and Challenges

The challenges of utilising dark web data revolve around dark web access, which requires specialised tools and network configurations to remain anonymous and not expose the researcher’s identity. The dark is also incredibly unstructured, and not indexed in the same way as the surface web. This makes navigation and finding information relevant to investigations very difficult. Other considerations include risk of exposure to malware, illegal or distressing content.

Many of these challenges can be navigated with specialised dark web OSINT tools. These enable safe, anonymous access to the dark web, allowing researchers to interrogate and analyse information in a careful, granular way, converting the dark web into a safe resource for investigations.

Investigations require multiple web sources

The modern internet consists of 3 layers, the surface web, deep web and dark web. Whilst these exist as independent entities, each source is complementary to the other, and OSINT researchers can use the vertical links between them to further their investigations.

Here at Blackdot, our goal is to enable the safe and secure gathering of open source data (OSD) from the surface, deep and dark web. Modern OSINT investigations require seamless access to various OSINT sources — any missing source might mean a missing piece of the puzzle. Using the dark web safely and efficiently represents the final frontier for OSINT, providing a powerful means to switch between each layer of the web, mapping connections and following threads between disparate information.

Utilising the numerous data points that can be pulled from across the dark, deep, and surface web from a single platform requires a powerful OSINT solution. That’s why Videris was created, providing OSINT investigators with:

Data collection, analysis and visualisation in a single platform.
Anonymity and data security.
Intelligent automation to inform human decision making.

If you’re interested in finding out what Videris can do for your OSINT investigations, book a demo with us today.

¹What is the dark web? How to access it and what you’ll find

²White Paper: The Deep Web: Surfacing Hidden Value

³Who’s Afraid of the Dark? Hype Versus Reality on the Dark Web

⁴ Terrorist Migration to the Dark Web

⁵Who Commits Crime On TOR? A New Analysis Has A Surprising Answer

⁶Cheese photo leads to Liverpool drug dealer’s downfall

Cookie	Duration	Description
__hssrc	session	This cookie is set by Hubspot. According to their documentation, whenever HubSpot changes the session cookie, this cookie is also set to determine if the visitor has restarted their browser. If this cookie does not exist when HubSpot manages cookies, it is considered a new session.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__hssc	30 minutes	This cookie is set by HubSpot. The purpose of the cookie is to keep track of sessions. This is used to determine if HubSpot should increment the session number and timestamps in the __hstc cookie. It contains the domain, viewCount (increments each pageView in a session), and session start timestamp.
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Cookie	Duration	Description
__hstc	1 year 24 days	This cookie is set by Hubspot and is used for tracking visitors. It contains the domain, utk, initial timestamp (first visit), last timestamp (last visit), current timestamp (this visit), and session number (increments for each subsequent session).
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_UA-30568652-1	1 minute	This is a pattern type cookie set by Google Analytics, where the pattern element on the name contains the unique identity number of the account or website it relates to. It appears to be a variation of the _gat cookie which is used to limit the amount of data recorded by Google on high traffic volume websites.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
_hjFirstSeen	30 minutes	This is set by Hotjar to identify a new user’s first session. It stores a true/false value, indicating whether this was the first time Hotjar saw this user. It is used by Recording filters to identify new user sessions.
hubspotutk	1 year 24 days	This cookie is used by HubSpot to keep track of the visitors to the website. This cookie is passed to Hubspot on form submission and used when deduplicating contacts.

Cookie	Duration	Description
_ga_K2NT2CSZ1K	2 years	No description
_hjAbsoluteSessionInProgress	30 minutes	No description
_hjid	1 year	This cookie is set by Hotjar. This cookie is set when the customer first lands on a page with the Hotjar script. It is used to persist the random user ID, unique to that site on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_hjIncludedInPageviewSample	2 minutes	No description
AnalyticsSyncHistory	1 month	No description
li_gc	2 years	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.

Dark Web vs Deep Web vs Surface Web

And How to Incorporate Each Within OSINT Investigations

The Surface Web

The surface web in OSINT investigations

The Deep Web

The deep web in OSINT investigations

The Dark Web

The dark web in OSINT investigations

Investigations require multiple web sources

Other articles you maybe interested in

The missing piece of the puzzle? OSINT in public sector counter-fraud strategy

This Year in OSINT

Contents

Sign-up to our newsletter

Get the latest news and insights sent straight to your inbox

Product

Solutions

Industries

Resources

Get the latest news and insights sent straight to your inbox