Scratching the Surface: Why alternative approaches to surface web search are a vital part of the OSINT investigator’s toolkit
By Stuart Clarke
In a world where 2.5 quintillion bytes of data are created every day, the internet is a key data source for almost every business. The use of this publicly available data as a source of intelligence is known as Open-Source Intelligence – or OSINT. With the ever-increasing volumes of data available online, it’s unsurprising that OSINT’s popularity is growing: many organisations are employing this technique for use cases such as competitor or market intelligence, due diligence checks and fraud investigations.
For regulated businesses, however, OSINT holds additional significance. Organisations such as financial institutions and law firms must adhere to complex legal requirements when taking on new clients, with the FCA requiring UK banks to perform enhanced due diligence checks ‘in situations that present a higher risk of money laundering’. These requirements are often fulfilled using basic internet searches through platforms like Google, but in many cases, such as AML investigations or the onboarding of high-risk clients, this is insufficient. A brief Google search should not be considered an enhanced due diligence investigation – in fact, it may not even suffice as a more basic check.
This is because search engines search only the surface web; a small proportion of all the data available online. The rest of the data available online – known as the deep web – is often ignored. Failure to make use of the deep web during high-risk investigations puts organisations at risk of missing key data and being accused of negligence. Search engines use techniques like SEO and zero click searching to push investigators towards the most popular results, whilst lesser-known entities are unlikely to feature. Add in the complexities of search localization and multi-language searches, and the likelihood of finding all relevant information in a global investigation dwindles further. For financial institutions, these barriers could be the difference between spotting a case of terrorist financing and receiving vast regulatory fines for failing to do so.
Yet using the deep web is easier said than done. In this article, we’ll explore why surface web data is inadequate for many OSINT use cases, and how investigators can improve the way they search to find crucial data that can change the course of their investigations.
Optimise your use of OSINT and start fighting financial crime more effectively.
Download our eBook for a comprehensive assessment of OSINT, its challenges, and how technology can help financial institutions generate a greater dividend from OSINT.
The problem with search engines
Search engines are a part of everyday life for most people. Their popularity is largely down to the fact that they make it easy to find information. Considering their ease of use, it might seem fair to expect that they do the same for investigators, but this is not always the case. Below are some key reasons why investigators should reconsider their approach to search engines:
Designed for consumers, not investigations
The purpose of search engines is not to answer business questions, but rather to make money by showing the result that the search engine thinks the searcher wants to see. This is useful for a consumer trying to buy the best product but can quickly become unhelpful when seeking out unbiased information about the subject of an investigation.
Take, for example, a search you might run as a consumer: ‘best restaurants in Cambridge’. If you are based in Cambridge or have already conducted similar searches, a search engine will try to predict your search for you – a concept known as ‘zero-click searching’. If it correctly predicts the search, you can press enter to see relevant results without clicking or finishing typing your search term.
As results appear, you’re likely to click on the first few options. If you have time, you might even look at some of the options on the second page of results. You will then, probably, decide which restaurant you want to go to. The whole decision-making process has been strongly influenced by the search engine you’ve used.
By deciding what you should search and showing you certain results first, the search engine has presented you with only a subset of options and directed you towards a particular choice of restaurant. Search Engine Optimisation (SEO) – the way that websites adapt their content to appear higher in search engine rankings and therefore attract more traffic – is not always reflective of quality, but rather of the organisation behind the site’s marketing spend. In this example, you’ll see the most popular restaurants first: possibly chains with large budgets and online presences. Smaller, more independent businesses are likely to be further down the rankings.
As a consumer, the search engine’s influence over this process isn’t necessarily negative. It makes for a smooth, user-friendly experience and is, above all, very quick. For an investigator, however, the search engine’s manipulation of results is far more likely to be problematic. If the information that’s most relevant is hidden on the 26th page of results, the investigator may struggle to identify it within the timeframes they are required to work in.
Results aren’t reliable
As we’ve already discussed, the first results that appear after your search are likely to do so for reasons that have less to do with their relevance, and more to do with SEO. If the investigator only reviews the first few pages of results, this can skew the direction of the investigation.
However, this isn’t the only way that reliability of the results you see can be affected. In some cases, the investigation can be affected by privacy laws which can lead to the removal of relevant results. This does not always mean that the content has been removed from the internet : instead, it is no longer indexed by the search engine, so no longer available through surface web searches using that particular search engine. However, unless the internet user in question has requested that their data be removed by each individual search engine – a lengthy process which they are unlikely to complete – the same data will still be available through alternative search engines. This is a great example of where the ability to search both multiple search engines and the deep web can be vital.
Another difficulty many investigators face is that the content of the internet is changing constantly. An investigator may identify a highly relevant result, then find that it has been changed or taken down a few hours or days later, leading to their assessment being deemed unreliable. This means that the investigator must save every bit of information they identify as important to an evidential standard – an additional, intensely time-consuming, activity.
Global nature of investigations
Very few of today’s investigations are restricted to one country. Typically, investigations into criminal networks will span numerous countries, meaning that an investigator needs to take into account many local factors.
As mentioned above, localisation and privacy laws can affect the results that are returned. However, the investigator’s choice of search engine is even more important in a global investigation. Different search engines may be better equipped to return results from different countries depending on local languages and preferences. For example, Google is the search engine of choice in many Western countries, but a multitude of other options boast higher popularity in other parts of the world. If an investigator wants to get a complete picture of a global network, they may need to search the same term in multiple different search engines to be certain that they haven’t missed anything.
Furthermore, searches as part of global investigations are likely to yield multi-language results. Unless the investigator speaks every language that results are written in, there will be a need to translate results. Clicking on each result and translating each one individually, only to find that the result in question isn’t relevant, is another time-intensive task that wastes valuable investigator time – and leaves room for error and misinterpretation.
Highly complex searches often necessary
SEO and localisation can make results less relevant, but the high volumes of information returned by search engines can make even relevant results difficult to identify. In order to restrict the number of results to just those that are of interest, investigators often have to construct complex Boolean searches.
For example, an investigator conducting an enhanced due diligence investigation may wish to search a subject’s name against a number of risk terms. To do this, they have to construct a search that covers all of their needs. Depending on the number of risk terms involved, more than one search may be needed to cover all bases. Once the investigator has constructed and run the search, they will likely still have a very large number of results left to review, which will be subject to all of the challenges we’ve already discussed. Plus, after all of this, the investigator has still only checked results on one search engine – a single walled garden of results. There is still no guarantee that important information is not available elsewhere.
An additional consideration for those investigating organised crimes is searching the Dark Web. Accessible only through specialist browsers such as Tor, criminals flock to the Dark Web as it provides an extra layer of protection and anonymity. Unlike rest of the internet, the Dark Web does not want to be found, so traditional search engines will not find anything at all – therefore missing potentially critical evidence.
An alternative approach
As an investigator, it can feel like search engines are holding a monopoly on internet data. In spite of these challenges, search engines can appear to be the only option for investigators keen to make use of open-source data. But alternatives do exist. Increasingly, technology is available that helps investigators circumvent the challenges of search engines, allowing them to get to the information they need faster, and leaving them more time to analyse the data in question.
Videris, our OSINT investigations platform, is one example of technology that enables collaboration and accelerates investigations. The following features can help investigators to find the right information quickly, in turn improving investigation efficiency and accuracy.
Although search engines bring back very large numbers of results, they are, counterintuitively, walled gardens of data because they only index a subset of what’s available on the internet. To be absolutely certain that they are reviewing all relevant data, investigators need to collect from the widest possible selection of sources, including a range of search engines. Naturally, this takes the investigator a considerable amount of time to do manually. This is where Videris comes in – by allowing them to search multiple sources simultaneously and bring results back into a single location, it reduces the time spent on manual search and allows for easier review. Crucially, Videris also collects deep web data such as corporate records so that investigators can easily access the wealth of information that is less accessible through search engines. Enriching search engine results with deep web data provides a treasure trove of additional intelligence and contextual information such as historic and current company officer data, and related entities.
Videris also works with data partners in order to provide secure access to Dark Web data. By accessing this data through a single, secure platform, investigators are able to find relevant information about criminal networks in the least transparent areas of the internet without the need for additional browsers or security measures.
Searching multiple sources isn’t helpful unless you can sort them to find what you need and scrolling through thousands of results isn’t an efficient way to achieve this. Videris filters and categorises data so that it’s easy to drill down to topics of particular interest – such as risk-related content. These filters also allow the investigator to filter on the primary language returned in the results and to identify key entities such as person names, locations and business names.
When a subject of interest has a common name, it can be almost impossible to work out which results relate to them – unless you have enough context already. Videris allows the investigator to enter all of the information they already have about a person, then uses this context to show the results that are most relevant first. Furthermore, while an investigator is exploring web content in the secure Videris Browser known intelligence and entities are automatically highlighted on the page so that they can be easily identified. Contextual ranking and highlighting mean that the investigator no longer has to guess which results are relevant, or waste time reviewing them all.
We’ve already discussed the frustrations investigators can experience around disappearing evidence that has not been saved or captured. In Videris, all of the user’s activities are logged and sourced. Web content is preserved, automatically saved and added to the investigator’s report. avoiding a situation where they are unable to produce evidence for their recommendations.
The case for an investigator-centric approach
The aim of using technology to enhance search in OSINT investigations is to reduce the time the investigator spends on manual tasks where they can add little value. In contrast to technologies like AI, that seek to automate and make decisions, Videris allows the investigator to retain control and decision-making powers throughout the investigation. Instead of spending time constructing complex searches, noting down sources and reviewing endless irrelevant results, technology can allow the investigator to prioritise thorough analysis and effective recommendations.
The relevance of this approach extends beyond search engines: it’s also useful when analysing the multitude of other sources that might be used as part of an OSINT investigation. Crucially, allowing the investigator to take centre stage also reduces the chance of becoming a victim of disinformation: search engines may not be able to tell that something is fake, but a skilled investigator is more likely to be able to make this assessment. Using several sources is also helpful when trying to verify information – which is made easier through Videris’ ability to search multiple sources at once.
Search engines are incredibly useful and the case for and against their use is highly nuanced. There’s little doubt that investigators are aware of the limitations and the need to adopt a more defensible and holistic approach to search. By augmenting the investigator’s abilities with technology, organisations can get the best of both worlds: improved efficiency and accuracy when accessing the surface web data that search engines do index, alongside the vital deep web data they don’t.