The AI industry is converging on a single vision: autonomous agents that operate across your entire digital environment. Microsoft's Copilot can take control of your mouse and keyboard. OpenAI's Frontier platform promises "AI co-workers" that log into applications and execute tasks with minimal human involvement.
The pitch is compelling: delegate your work to AI, supervise from above and watch productivity multiply. But beneath the marketing lies a tension that deserves more scrutiny. General-purpose AI agents with broad system access and the ability to take autonomous action face architectural security challenges. Understanding the nature of those challenges is essential for any organisation evaluating how to adopt AI responsibly, but especially in the complex regulatory and investigative environment in which Blackdot operates.
This article examines the problems that general-purpose agents introduce and considers questions that any organisation handling sensitive data should be ask in order to mitigate these risks.
The security problem: When the attack surface is language itself
Traditional software security works because you can define boundaries. For example, an API accepts specific data types in specific formats, or a function validates its inputs against a known schema. These boundaries are imperfect, but they are enumerable: you can reason about the attack surface, test against it and patch known vulnerabilities.
Where AI agents are concerned, the nature of the attack surface is categorically different. When an AI agent processes natural language, its input is unbounded, ambiguous and impossible to validate against a schema. Every document, email or webpage it processes is a potential injection vector. Unlike a SQL injection, where the fix is parameterised queries - a deterministic solution to a well-defined problem - prompt injections operate in a space where the legitimate input and the adversarial input both consist of natural language. There is no equivalent of input sanitisation when the input is, by design, free-form human language that the model must interpret and act on.
The evidence from production systems supports this concern. In January 2026, security researchers at Varonis disclosed an attack called Reprompt that turned Microsoft Copilot into a silent data exfiltration tool with a single click on a legitimate Microsoft link. The bypass was remarkably simple: Copilot's safeguards only checked for malicious content on the first prompt, not subsequent requests. The researchers simply instructed it to repeat each action twice. No plugins or user interaction with Copilot were required, and the attacker maintained control even after the Copilot chat was closed.
Once the initial injection succeeded, the attacker's server could issue follow-up instructions in an ongoing chain, asking Copilot questions like "summarise all files the user accessed today" or "where does the user live?", which were all invisible to client-side monitoring tools.
Microsoft patched this vulnerability, and credit is due for its responsible handling of the disclosure. But the trajectory matters more than individual incidents; the breadth and diversity of attacks being discovered suggests we are seeing the beginning of a much larger problem. Reprompt exploited a URL parameter and a missing check on subsequent requests. A separate attack, EchoLeak, used hidden instructions in email HTML and exploited a Content Security Policy allowlist. A further attack called ZombieAgent, disclosed by Radware in January 2026 implanted malicious rules directly into the long-term memory of OpenAI's ChatGPT. These are fundamentally different attack vectors exploiting different architectural properties, and each was discovered independently within a matter of months.
Perhaps most tellingly, OpenAI itself has conceded the point. In a December 2025 blog post about hardening its ChatGPT Atlas browser agent, the company wrote: "Prompt injection, much like scams and social engineering on the web, is unlikely to ever be fully ‘solved’".
The important question is whether complex defence mechanisms will reduce the practical risk over time to an acceptable level. It may, for some use cases, but the asymmetry is daunting: defenders must secure every possible interaction an agent might have, across every context, against every conceivable adversarial input. Attackers just need to find one that works and prompt injection attacks can be crafted by anyone who can write a sentence.
The oversight problem: You can't review what you don't understand
In most cases, the industry's answer to these risks is human oversight. Users are positioned as "supervisors" of their AI agents. Ars Technica recently described the emerging model as developers becoming "middle managers of AI - not writing the code or doing the analysis themselves, but delegating tasks, reviewing output, and hoping the agents underneath them don't quietly break things".
The problem is that meaningful review requires intimate knowledge of the system being reviewed. A developer can review code they wrote, or code written within an architecture they understand, because they carry a mental model of how the pieces fit together: what each component does and where the boundaries are. Autonomous agents erode this prerequisite, so that the human supervisor is reviewing something they have no intimate knowledge of.
To some extent, this is a challenge that applies to any complex automated system at scale. But a system with defined steps, structured transformations, and deterministic logic at its decision points lets the reviewer trace a specific output back through specific steps and verify each one. A general-purpose agent operating through natural language reasoning across multiple systems offers far fewer such handholds. The intermediate steps are probabilistic, the reasoning opaque, and the path from input to output may not be reproducible.
At the volume that autonomous agents produce, the reviewer cannot penetrate the output because they lack both the context to evaluate it and the time to acquire that context. This is an inherent consequence of delegating work to a system that operates opaquely at machine speed.
The liability gap
When a general-purpose agent causes harm, the question of accountability enters difficult territory. The vendor's terms of service place responsibility with the user, who didn't make the decision - the agent acted autonomously. In the case of attacks like Reprompt, the user may not even have been aware anything was happening.
For regulated industries where accountability must be clear and decisions must be auditable, this ambiguity is operationally disqualifying. The question "who is responsible for this outcome?" must have an answer. General-purpose agents, operatingacross system boundaries through probabilistic reasoning, make that answer difficult to construct.
The tension at the core
The fundamental challenge can be stated as a tension between three desirable properties:
- Broad capability: the ability to process natural language from diverse sources and take actions across multiple systems
- Autonomy: operating without meaningful human review of each action
- Security: resistance to adversarial manipulation through the data the agent processes
Increasing any one of these properties tends to compromise at least one of the others. A broadly capable, highly autonomous agent presents a larger attack surface which is harder to defend, but constraining that surface means reducing either capability or autonomy.
Whether this tension amounts to a hard impossibility or a very difficult engineering challenge is an open question with which the industry needs to engage proactively. The current trajectory of shipping increasingly capable, increasingly autonomous agents only heightens this need.
Four questions to ask before deploying AI agents
None of this means AI has no role in sensitive operations – it can be transformative for investigative work. But the architecture matters enormously; ‘move fast and patch later’ is not an acceptable strategy when the data at stake includes active investigations, regulatory obligations and people's lives.
Every organisation evaluating AI agents where sensitive data is at stake should be asking four key questions:
- What is the blast radius if this agent is compromised? An agent with access to a single, bounded system presents a containable risk, while an agent with access to email, file storage, communication tools and operational databases presents a catastrophic one.
- Can you trace any given output back to the inputs and reasoning that produced it? If the agent's reasoning is opaque and its intermediate steps are not reproducible, then you cannot defend its conclusions to a regulator or a court.
- Where does liability sit if something goes wrong? If the answer is unclear, the risk has not been properly allocated, regardless of what the vendor's terms of service say.
- Does the architecture constrain the AI to defined capabilities, or does it grant open-ended access? There is a substantial difference in risk profile between an AI system that can execute five specific, validated operations and one that can do anything a human can do in a browser.
For many organisations handling sensitive data, the honest assessment is that constrained, domain-specific AI architectures with defined tool interfaces, structured inputs and outputs, deterministic decision logic and genuine human oversight at meaningful decision points, offer a better balance of capability and security than general-purpose agents do today.
This is the philosophy behind how we've built Videris. Videris Automate uses constrained AI within governed pipelines - language models doing what they excel at, with deterministic validation before any consequential action is taken. Videris Investigate gives analysts interactive tools to examine and verify every stage of an automated workflow, maintaining the intimate knowledge of their cases that meaningful oversight demands. Not because constrained AI is less ambitious, but because in the domains we serve, it's the only approach that's honest about the risks.