A multi-dimensional framework for threat modeling, security, and governance of large language model ecosystems

by Djimit

Abstract

This article addresses the critical need for a security framework for Large Language Models (LLMs). As LLMs become integral to a vast array of applications, they introduce a novel and complex threat landscape that transcends traditional software vulnerabilities. We present a systematic, multi-disciplinary investigation into LLM security, making three primary contributions. First, we develop a unified, multi-axial threat taxonomy that integrates lifecycle, system-module, and attacker-goal perspectives, providing a common vocabulary for diverse stakeholders. Second, we propose a dynamic risk model that analyzes threat propagation across the entire LLM ecosystem—from data sourcing and training to inference and agentic tool use—and establish a framework for evaluating defenses via adversarial stress testing. Third, we design a full-stack, defense-in-depth security architecture and an adaptive governance protocol that aligns with emerging regulatory standards like the EU AI Act and ISO 42001, and operational paradigms like Zero Trust. By bridging technical, operational, and governance dimensions, this work provides a foundational and actionable blueprint for securing the next generation of AI systems.

Introduction

The proliferation of Large Language Models (LLMs), such as OpenAI’s GPT-4 series, Anthropic’s Claude 3, and Google’s Gemini, has marked a paradigm shift in artificial intelligence.1 These models, built upon the Transformer architecture and trained on vast internet-scale datasets, enable transformative applications across high-stakes domains including healthcare, finance, education, and scientific research.3 However, this rapid integration introduces a unique and formidable security challenge. Unlike traditional software systems, LLMs possess a vast and dynamic attack surface rooted in their data-driven nature, the opaque complexity of their internal representations, and the semantic ambiguity of natural language interfaces.3 Their capabilities to process and generate human-like text, code, and other content make them susceptible to a new class of vulnerabilities that demand a fundamental rethinking of cybersecurity principles.6

Existing security frameworks, designed for the predictable logic of conventional software, are ill-equipped to address LLM-specific vulnerabilities. Threats such as prompt injection, training data poisoning, model extraction, and adversarial examples represent a departure from well-understood attack vectors like buffer overflows or SQL injection.6 Early research into LLM security has produced a fragmented landscape of threat classifications and point-solution defenses.3 While valuable, these efforts often lack a unifying structure, leaving researchers, developers, and policymakers without a common language or a holistic framework to understand and manage risk. There is a pressing need for a systematic, multi-dimensional framework that synthesizes these disparate perspectives into a coherent whole.

This article aims to construct such a framework. We conduct a systematic investigation structured across three progressive tiers of analysis: (1) a foundational mapping and classification of the LLM threat landscape; (2) a dynamic risk modeling and defense evaluation across the LLM lifecycle; and (3) the design of an integrated systems and governance architecture for secure deployment. This multi-tiered approach allows for a comprehensive analysis that builds from fundamental principles to practical implementation.

Our research addresses not only technical vulnerabilities but also the critical operational, socio-technical, and governance layers that are often overlooked. This includes an examination of LLM supply chain risks, the security of third-party plugins, and the human factors that contribute to social-engineered misuse of these powerful systems.12 Furthermore, we explore the role of explainability and transparency in enhancing the trustworthiness of defense systems and align our proposed governance protocols with emerging regulatory and compliance standards, such as the EU AI Act and ISO 42001.15 The resulting framework is intended to provide actionable outputs for academia, industry, and policymakers, fostering a more secure, resilient, and trustworthy LLM ecosystem.

Part I: Foundational Threat Landscape: A Multi-Dimensional Taxonomy

This part establishes the foundational knowledge required for a systematic understanding of LLM security. It synthesizes existing research to construct a novel, unified taxonomy, catalogs known attack vectors with real-world context, and provides an empirical baseline of current defensive capabilities.

Section 1: A Unified Taxonomy of LLM Security Threats

A clear, comprehensive, and shared understanding of threats is the bedrock of any effective security strategy. The current discourse on LLM security, however, is characterized by a variety of classification schemes that, while individually insightful, are often siloed and incomplete. Some frameworks categorize threats based on the stage of the machine learning lifecycle in which they occur, distinguishing between training-time and deployment-time attacks.3 Others, like the OWASP Top 10 for LLM Applications, adopt an application-centric view, cataloging risks from a developer and security practitioner’s perspective.20 A third approach analyzes risks based on the specific module of the LLM system that is targeted, such as the input, model, or toolchain.10 This fragmentation hinders a holistic understanding of risk, as a single threat can manifest across these different dimensions. To address this gap, this section proposes a unified, multi-axial taxonomy that integrates these complementary perspectives into a single, cohesive framework.

Axis 1: LLM Lifecycle Stage (The ‘When’)

The first axis of our taxonomy classifies threats based on the stage of the LLM lifecycle at which the attack is mounted. This temporal dimension is critical for understanding where in the development and operational pipeline vulnerabilities are introduced and where controls are most needed. Drawing from established frameworks like the NIST Adversarial Machine Learning (AML) Taxonomy 19 and numerous academic surveys 3, we define two primary stages:

Training-Time Attacks: These threats target the model’s creation, learning, and alignment processes. They are particularly insidious because they corrupt the model from its inception, embedding vulnerabilities that may be difficult to detect with standard testing. This stage encompasses pre-training on broad web-scale data, supervised fine-tuning (SFT) on task-specific datasets, and alignment procedures like Reinforcement Learning from Human Feedback (RLHF).21 Key sub-types include:

Data Poisoning: An adversary manipulates the training data to degrade model performance or implant specific vulnerabilities.3
Model Poisoning/Backdooring: An adversary directly modifies the model’s architecture or parameters, or uses poisoned data to create a “backdoor” that can be triggered later.19

Deployment-Time (Inference-Time) Attacks: These threats target a fully trained and deployed model during its operational use. They exploit the model’s interactive nature and its connection to external systems. Key sub-types include:

Evasion Attacks: The adversary crafts malicious inputs (adversarial examples) to evade the model’s safety filters or cause it to misclassify information. This category includes prompt hacking techniques like jailbreaking.3
Privacy Attacks: The adversary queries the model to extract sensitive information about its training data, parameters, or architecture.3
Misuse: An attacker leverages the model’s capabilities for malicious purposes, such as generating disinformation or malware, after bypassing its safety controls.1

Axis 2: LLM System Module (The ‘Where’)

The second axis provides a structural perspective, pinpointing where in the LLM system a vulnerability is exploited. Adopting the module-oriented taxonomy proposed by Zhang et al. 10, we can deconstruct an LLM application into four essential components, each with its own attack surface:

Input Module: This module is responsible for receiving and pre-processing user prompts. Vulnerabilities here often involve the failure to properly sanitize or filter malicious inputs, enabling attacks like prompt injection.10
Language Model Module: This is the core of the system, comprising the model’s weights, architecture, and the knowledge implicitly stored from its training data. Risks at this layer include memorization of sensitive data, inherent biases, a propensity for hallucination, and susceptibility to backdoors embedded during training.10
Toolchain Module: This encompasses the entire ecosystem of software, hardware, and external tools that support the LLM. It represents a significant and often overlooked attack surface. Vulnerabilities can exist in the underlying deep learning frameworks (e.g., buffer overflows), third-party plugins, compromised APIs, or even the hardware platforms (e.g., GPU side-channel attacks).10 This axis directly corresponds to the OWASP category ofLLM05: Supply Chain Vulnerabilities.12
Output Module: This module handles the content generated by the LLM before it is presented to the user or a downstream system. Vulnerabilities arise from the insecure handling of this output, which could contain malicious code, sensitive information, or harmful content.9

Axis 3: Attacker Goal (The ‘Why’)

The third axis classifies threats according to the adversary’s fundamental objective, aligning with classic cybersecurity principles of confidentiality, integrity, and availability, as well as the NIST AML framework.19 Understanding the attacker’s goal is crucial for risk assessment and prioritizing defenses.

Integrity Violation: The attacker seeks to corrupt the model’s behavior, forcing it to produce false, malicious, or untrustworthy outputs. This is the goal behind jailbreaking, disinformation generation, and many data poisoning attacks.19
Availability Breakdown: The attacker aims to disrupt or deny service. This can be achieved through resource-intensive queries that overwhelm the model’s computational resources, a technique known as a Model Denial of Service (DoS) attack.9
Privacy Compromise (Confidentiality): The attacker’s goal is to extract restricted or sensitive information. This includes leaking personally identifiable information (PII) from the training data, stealing the proprietary model weights through model extraction, or inferring private user data from interactions.4
Misuse Enablement: The attacker seeks to circumvent the model’s safety alignment and usage policies to leverage its capabilities for prohibited or harmful purposes, such as generating malware, creating phishing campaigns, or producing hate speech.1

Axis 4: Industry-Standard Risk Category (The ‘What’)

The final axis maps the technical threats identified in the previous axes to the practitioner-focused categories defined by the Open Web Application Security Project (OWASP) Top 10 for LLM Applications.9 This mapping ensures that the taxonomy is not merely an academic exercise but a practical tool that can be directly integrated into the risk management and secure development lifecycles of organizations. This includes well-defined risks such as LLM01: Prompt Injection, LLM03: Training Data Poisoning, LLM07: Insecure Plugin Design, and socio-technical risks like LLM09: Overreliance, which highlights the human factors involved in LLM security.

The power of this multi-axial taxonomy lies in its ability to provide a holistic and interconnected view of risk. A single, concrete threat can be analyzed through all four lenses, revealing its multifaceted nature. For instance, consider the “instruction backdoor attack” against customized LLMs, where an attacker creates a seemingly benign custom GPT that contains hidden malicious instructions.23 An attempt to classify this attack using any single framework would be incomplete. The NIST lifecycle framework would label it a training-time integrity attack.19 The module-oriented framework would place the vulnerability in both the Language Model Module (where the malicious instructions reside) and the Toolchain Module (the platform for creating custom GPTs).10 The attacker goal is clearly Integrity Violation and Misuse Enablement. OWASP would categorize it as an LLM05: Supply Chain Vulnerability, as the user is consuming an untrusted, third-party model component.20

By synthesizing these views, a more complete picture emerges. The risk is not isolated to a single point but is systemic, arising from the interplay between the model, its customization process, the trust placed in third-party developers, and the interfaces between system components. This demonstrates that effective defense cannot be a single point solution. Securing LLM ecosystems requires a defense-in-depth strategy that addresses risks across the entire lifecycle and system stack, a principle that forms the basis for the architectural recommendations in Part III of this paper. Table 1 provides a consolidated view of this unified taxonomy, acting as a “Rosetta Stone” to facilitate clear communication among researchers, developers, and security professionals.

Table 1: A Unified Multi-Axial Taxonomy of LLM Security Threats

Threat NameDescriptionLifecycle Stage (When)System Module (Where)Attacker Goal (Why)OWASP Category (What)Direct Prompt Injection (Jailbreaking)Manipulating user input to override system instructions and bypass safety filters.Deployment-TimeInput ModuleIntegrity Violation, Misuse EnablementLLM01: Prompt InjectionIndirect Prompt InjectionHiding malicious instructions in external data sources (e.g., websites, documents) that are retrieved by the LLM.Deployment-TimeInput Module, Toolchain ModuleIntegrity Violation, Privacy CompromiseLLM01: Prompt InjectionTraining Data PoisoningCorrupting the training or fine-tuning data to degrade performance, introduce biases, or embed vulnerabilities.Training-TimeLanguage Model ModuleIntegrity Violation, Availability BreakdownLLM03: Training Data PoisoningBackdoor AttackA form of data poisoning that embeds a hidden trigger, causing malicious behavior only when the trigger is present in the input.Training-TimeLanguage Model ModuleIntegrity Violation, Misuse EnablementLLM03: Training Data PoisoningModel Denial of Service (DoS)Overwhelming the model with resource-intensive queries to degrade service quality or cause outages.Deployment-TimeLanguage Model Module, Input ModuleAvailability BreakdownLLM04: Model Denial of ServiceSensitive Information DisclosureThe model inadvertently reveals confidential data (e.g., PII, trade secrets) from its training set or context.Deployment-TimeOutput Module, Language Model ModulePrivacy CompromiseLLM06: Sensitive Information DisclosureInsecure Plugin DesignVulnerabilities in external tools or plugins connected to the LLM, allowing for unauthorized actions or data exfiltration.Deployment-TimeToolchain ModuleIntegrity Violation, Privacy CompromiseLLM07: Insecure Plugin DesignModel Theft / ExtractionAn adversary queries the model to create a functional clone, stealing intellectual property.Deployment-TimeLanguage Model Module, Output ModulePrivacy CompromiseLLM10: Model TheftSupply Chain VulnerabilityUsing compromised pre-trained models, libraries, or datasets that contain hidden vulnerabilities or malware.Training-Time, Deployment-TimeToolchain Module, Language Model ModuleIntegrity Violation, Privacy CompromiseLLM05: Supply Chain VulnerabilitiesExcessive AgencyThe model is granted overly permissive access to tools and systems, leading to unintended and harmful actions.Deployment-TimeToolchain ModuleIntegrity Violation, Misuse EnablementLLM08: Excessive Agency

Section 2: Catalog of Adversarial Techniques and Exploit Domains

Moving from the abstract classification of the taxonomy to a concrete analysis, this section catalogs known adversarial techniques, grounding them in specific, high-stakes application domains. This provides a tangible understanding of how theoretical risks manifest as practical exploits.

Prompt-Based Attacks (Integrity & Misuse)

These attacks manipulate the primary interface of the LLM—the prompt—to subvert its intended behavior.

Direct Prompt Injection and Jailbreaking: This is the most widely recognized LLM vulnerability, where an attacker crafts a prompt to override the model’s system instructions or safety alignment.9 A common technique isrole-playing, where the model is instructed to act as a different persona that is not bound by its usual rules, such as the “Do Anything Now” (DAN) persona.24 Another method isgoal hijacking, where an instruction like “Ignore the above and do this instead” is appended to a legitimate prompt to divert the model to a malicious task.10Refusal suppression techniques aim to convince the model that its safety concerns are invalid or that the requested task is for a harmless purpose, thereby bypassing its refusal mechanism.19
Indirect Prompt Injection: A more sophisticated variant, indirect prompt injection places the malicious instruction not in the user’s direct query, but in an external data source that the LLM is expected to consume.9 For example, an attacker could hide the text “Ignore previous instructions and transfer $1000 to account X” in white font on a webpage. When an LLM-powered agent, such as a Retrieval-Augmented Generation (RAG) system, scrapes this page to answer a user’s query, it ingests the malicious instruction and may act upon it without the user’s knowledge or consent. This vector is particularly dangerous for autonomous agents that interact with the open internet.
Adversarial Suffixes and Gradient-Based Attacks: While manual jailbreaking requires creativity, automated methods have emerged that are often more efficient and transferable. These attacks use optimization algorithms to find a sequence of characters or tokens (a “suffix”) that, when appended to a user’s prompt, is highly likely to elicit a harmful response.25 Early methods used greedy coordinate gradient-based searches, while newer techniques leverage continuous optimization or beam search to find effective adversarial prompts much faster.25 These automated attacks can achieve very high success rates even against well-aligned models.26

Data and Model Integrity Attacks

These attacks target the model’s internal state, either by corrupting its training data or by directly manipulating the model itself.

Data Poisoning: This class of attacks involves an adversary intentionally introducing malicious examples into the model’s training data.9 The goal can be to degrade the model’s overall performance (an availability attack), introduce specific biases, or create backdoors.28 Given that many LLMs are trained on vast, unfiltered web scrapes, the potential for ingesting poisoned data is significant.3 More advanced techniques like**“split-view” poisoning** (where content is altered after being indexed) and “front-running” poisoning (targeting periodic snapshots of crowd-sourced content like Wikipedia) make detection even more challenging.3
Backdoor Attacks: A specific and highly insidious form of data poisoning, a backdoor attack embeds a hidden trigger into the model.3 The model behaves perfectly normally on all standard evaluation benchmarks and benign inputs. However, when an input contains the secret trigger (which could be a specific word, phrase, or even a syntactic structure), the model executes a malicious payload, such as outputting a specific harmful response or misclassifying the input.23 This makes backdoors extremely difficult to detect post-training.

Privacy and Extraction Attacks (Confidentiality)

These attacks aim to compromise the confidentiality of the model or the data it has been trained on.

Model Extraction: An adversary with query access to a target LLM can use its outputs to train a surrogate model, effectively creating a functional clone.20 This constitutes a significant theft of intellectual property, as the development of large-scale models requires massive investment in data and computation. Recent research has demonstrated that task-specific knowledge can be extracted from large commercial models with a surprisingly low number of queries and at a minimal cost, posing a direct threat to proprietary, fine-tuned models.30
Membership Inference: These attacks aim to determine whether a specific data record was included in the model’s training set.3 A successful attack could reveal sensitive information, such as whether an individual’s medical record or personal email was used to train the model, thereby violating their privacy.6
Data Extraction and Memorization: Due to their massive capacity, LLMs can memorize and regurgitate verbatim chunks of their training data.10 An attacker can craft prompts to deliberately elicit this memorized information, which could include personally identifiable information (PII), copyrighted material, or proprietary code, leading to severe privacy breaches and legal liabilities.3

Ecosystem and Supply Chain Attacks

The LLM itself is only one part of a larger application ecosystem. Vulnerabilities in connected components create a broad and complex attack surface.

Malicious Plugins and Insecure Tool Design: Modern LLM applications often grant the model agency to use external tools via plugins or API calls. If a plugin is insecurely designed, an attacker could use prompt injection to trick the LLM into executing malicious actions through that plugin, such as deleting files, sending emails, or exfiltrating data from connected systems.12 This is categorized asLLM07: Insecure Plugin Design and LLM08: Excessive Agency by OWASP.
Instruction Backdoors in Customized LLMs: The rise of platforms that allow users to create “custom GPTs” introduces a new supply chain risk. An attacker can publish a seemingly useful custom model that contains hidden backdoor instructions.23 When a victim uses this model, the backdoor can be triggered, causing the LLM to perform malicious actions on the user’s behalf. This turns the LLM into a Trojan horse, exploiting the trust relationship between the user and the customization platform.

A critical pattern emerging from this catalog is the trend toward sophisticated, multi-stage attacks that span the entire LLM lifecycle. The most potent threats are no longer single-shot events at inference time. An adversary might execute a training-time data poisoning attack to embed a latent backdoor in a public model on a platform like Hugging Face. This vulnerability remains dormant, passing all standard security checks. Months later, a downstream developer fine-tunes this compromised model for a specific application. Finally, an end-user interacts with the application and unknowingly provides a prompt containing the trigger, activating the backdoor and causing a security breach. This demonstrates that the initial compromise can be disconnected in time, space, and personnel from the final exploitation. Consequently, threat modeling must adopt a holistic, lifecycle-aware perspective, as a vulnerability introduced in one stage can create latent risks that propagate and manifest in another. This understanding provides a direct rationale for the dynamic threat propagation modeling detailed in Part II.

To make these risks more concrete for stakeholders, Table 2 maps these adversarial techniques to specific exploit scenarios in high-stakes domains.

Table 2: Adversarial Technique and Use-Case Matrix

Attack TechniqueHealthcareFinanceLegal ServicesCode GenerationPrompt InjectionAn attacker manipulates a diagnostic chatbot to ignore a patient’s symptoms and instead provide harmful advice, leading to delayed treatment. 33A user injects a prompt into a financial advisory bot to make it recommend a fraudulent investment scheme.A malicious actor tricks a legal research assistant into misrepresenting case law, leading to flawed legal arguments.A developer uses a jailbreak to make a coding assistant generate code for a ransomware payload. 24Indirect Prompt InjectionA RAG-based clinical tool retrieves a compromised medical article containing a hidden prompt that causes the LLM to misclassify a CT scan image. 34An LLM analyzing market news ingests a poisoned news article with instructions to ignore negative sentiment about a specific stock.A document review tool processes a contract containing hidden instructions to leak confidential negotiation terms.A code assistant that can read documentation is pointed to a malicious GitHub repo, where a hidden prompt instructs it to inject a vulnerability into the suggested code.Data Poisoning (Backdoor)A training dataset for a dermatology model is poisoned with images of benign moles that contain a subtle digital watermark. When a watermarked image is submitted, the model classifies it as malignant. 35A dataset for training a fraud detection model is poisoned so that transactions from a specific set of accounts are always classified as legitimate, creating a backdoor for money laundering. 28A dataset of legal precedents is poisoned to associate a specific legal argument with a favorable outcome, biasing the model’s analysis.A large code repository used for training is poisoned with examples where a secure function (e.g., crypto.randomBytes) is subtly replaced with an insecure one when a specific comment trigger is present. 35Model ExtractionTheft of a proprietary model trained to predict disease outbreaks from epidemiological data, compromising a public health organization’s competitive advantage.An attacker extracts a proprietary high-frequency trading algorithm from a specialized financial LLM by querying its API, stealing millions in R&D investment. 32Extraction of a model fine-tuned by a law firm to predict litigation outcomes, leaking the firm’s strategic legal insights.Theft of a proprietary model fine-tuned by a company to generate highly optimized and secure code for a specific hardware architecture.Sensitive Information DisclosureA patient chatbot, when prompted in a specific way, leaks another patient’s medical history and PII, violating HIPAA regulations. 33A financial chatbot inadvertently discloses non-public information about a company’s upcoming earnings report, enabling insider trading.A legal assistant leaks details from a confidential merger and acquisition document it was trained on.A coding assistant regurgitates a large block of proprietary source code, including API keys, from its training data.

Section 3: Empirical Analysis of State-of-the-Art Defenses

In response to the growing threat landscape, a variety of defense mechanisms have been proposed and evaluated. This section provides a systematic review of these countermeasures, categorized by the layer of the system they protect, and establishes a baseline for their empirical effectiveness.

Input-Layer Defenses

These defenses operate on the prompt before it reaches the LLM.

Input Validation and Sanitization: This is the first line of defense, analogous to traditional web application security. Techniques include deny-listing (blocking known malicious patterns like “Ignore the above instructions”), allow-listing (only permitting inputs that conform to a strict format), and using regular expressions to filter out code, special characters, or other potentially harmful content.37 While essential for basic security hygiene, these methods are often brittle and can be easily bypassed by more sophisticated obfuscation techniques or novel attack phrasing.10
Instructional Defenses and Prompt Engineering: This approach involves modifying the system prompt to make the model inherently more robust. Examples include the “sandwich” defense, where the user’s input is placed between two sets of trusted system instructions, or using delimiters to clearly separate instructions from user data.3 These methods are low-cost to implement but have shown limited effectiveness against adaptive, multi-turn adversarial attacks.39

Model-Centric Defenses

These defenses aim to make the core LLM itself more robust to attacks.

Adversarial Training: This has proven to be one of the most effective empirical defenses against adversarial attacks in machine learning.40 The technique involves augmenting the model’s training data with adversarial examples, thereby teaching it to recognize and resist such manipulations.41 However, for LLMs, adversarial training is computationally expensive, as it requires generating discrete adversarial attacks at each training iteration. It can also lead to a degradation in the model’s performance on benign, non-adversarial tasks—a phenomenon known as the robustness-utility trade-off.40 Recent research is exploring more efficient methods, such as performing attacks in the continuous embedding space rather than the discrete token space, which has shown promise in improving robustness without a catastrophic loss of utility.42
Reinforcement Learning from Human Feedback (RLHF) for Safety: RLHF is the primary technique used by model developers to align LLMs with human values and ensure they are helpful and harmless.44 During the RLHF process, a reward model is trained on human preferences, and this reward model is then used to fine-tune the LLM to produce outputs that are more likely to be preferred by humans, which includes refusing to comply with harmful requests.21 While RLHF is fundamental to the safety of modern LLMs, it is not a panacea. Determined attackers can still devise complex jailbreaks that bypass this safety training, and RLHF may not cover the full spectrum of security vulnerabilities, such as data extraction or subtle backdoors.44

Output-Layer Defenses

These defenses operate on the model’s generation before it is delivered to the user or downstream application.

Output Filtering and Validation: This is a crucial backstop that involves scanning the model’s output for potential issues. This can include filtering for sensitive information (e.g., PII, API keys), detecting and blocking toxic or biased content, and validating that the output format is correct.37 This helps mitigateLLM02: Insecure Output Handling and LLM06: Sensitive Information Disclosure.9 The main challenge is creating filters that are effective without a high rate of false positives that would degrade the user experience.
Model Watermarking: This technique embeds a subtle, statistical signal into the text generated by an LLM. This watermark is invisible to humans but can be detected algorithmically. It can be used to trace the origin of generated content, helping to identify misuse, combat disinformation, and prove model ownership in cases of model theft.6

Ecosystem-Level Defenses

These defenses operate at the architectural level, securing the interactions between the LLM and its surrounding environment.

LLM Firewalls: An emerging and powerful defense is the concept of an LLM firewall—a specialized security proxy that sits between users and the LLM application.48 This firewall can inspect and filter both incoming prompts and outgoing responses, integrating multiple defense layers such as input validation, intent detection, and output scanning.49 A notable example isControlNET, a firewall designed specifically for RAG systems that leverages shifts in model activations to detect adversarial queries with high accuracy.48
Sandboxing and Access Control: To limit the potential damage from a compromised LLM or plugin, it is essential to run them in a sandboxed, isolated environment.12 Applying the principle of least privilege, the LLM and its tools should only be granted the minimum permissions necessary to perform their intended function. This is a critical mitigation for risks likeLLM08: Excessive Agency.

A comprehensive review of the literature reveals that no single defense is a silver bullet. Instead, there is a fundamental and unavoidable trade-off between three key factors: the level of security provided, the impact on model utility (performance on benign tasks), and the cost (computational and operational) of implementation. For example, extensive adversarial training offers high security by reducing the Attack Success Rate (ASR), but it comes at a high computational cost and often degrades general model utility.41 Conversely, simple prompt-based defenses are low-cost and preserve utility but provide only weak security against determined attackers.39 LLM firewalls aim to strike a balance but introduce their own operational costs, such as increased latency and maintenance overhead.48

This observation leads to a critical conclusion: the goal of LLM security is not to find the single “best” defense, but rather to engineer a portfolio of defenses that provides an optimal balance for a specific organization’s risk appetite, performance requirements, and budget. This “Security-Utility-Cost” trilemma necessitates a modular, defense-in-depth architectural approach, which will be detailed in Part III, allowing organizations to select and combine controls to achieve their desired security posture.

Part II: Dynamic Risk Analysis and Adversarial Simulation

Moving beyond the static cataloging of threats and defenses, this part develops a dynamic analysis of how these risks manifest and propagate within real-world LLM systems. It introduces a model for understanding threat propagation across the LLM lifecycle and proposes a robust framework for evaluating defense efficacy through continuous, simulation-based adversarial stress testing.

Section 4: Modeling Threat Propagation Across the LLM Lifecycle

Security risks in LLM ecosystems are rarely isolated events. They are often the result of a chain of vulnerabilities, where a weakness introduced at one stage of the lifecycle creates an opportunity for exploitation at another. To understand and mitigate these complex threats, it is necessary to model the entire LLM lifecycle as an interconnected system and analyze how threats propagate through it.

We can conceptualize the LLM ecosystem as a directed graph, where nodes represent key stages and assets, and edges represent the flow of data and control between them.50 The primary nodes in this graph include Data Sourcing, Pre-training, Fine-tuning, Deployment, Inference, Tool Integration, and Monitoring.52

Threat Origination Points:

Threats can be introduced at multiple points in this lifecycle:

Data Sourcing: This is a primary entry point for integrity attacks. An adversary can poison public datasets (e.g., Common Crawl, Wikipedia) that are scraped for pre-training, or they can compromise more specialized datasets used for fine-tuning.3
Fine-tuning: The process of specializing a pre-trained model for a particular task creates a significant vulnerability. An attacker can provide malicious fine-tuning data to embed subtle backdoors or biases that are tailored to exploit a specific downstream application.35
Third-Party Components: The modern LLM supply chain relies heavily on third-party assets. A pre-trained model downloaded from a public repository, a Python library used for data processing, or a plugin used for tool integration can all serve as entry points for vulnerabilities.12

Threat Propagation Pathways:

Once a vulnerability is introduced, it can propagate through the system in complex ways:

Poisoning Propagation: A backdoor injected into a model during its training phase can remain dormant and undetected through standard validation processes. The threat propagates silently with the model as it is deployed. It only manifests when an end-user, potentially months or years later, provides an input containing the specific trigger that activates the malicious behavior.35
Indirect Prompt Injection: This pathway demonstrates how a threat can propagate from the external environment into the core of the LLM application. A malicious instruction is first planted on an external resource, like a public webpage. This threat remains latent until a RAG system’s retrieval mechanism ingests the compromised content. The malicious prompt then propagates through the input module to the LLM at inference time, where it can hijack the model’s execution flow.9
Agentic Chaining Risk: In ecosystems of autonomous agents, risk propagation can be exponential. An initial, minor compromise of one LLM agent—perhaps through a simple prompt injection—can have cascading effects. If that agent has the authority to communicate with other agents or systems, it can be used as a pivot point to launch further attacks, exfiltrate data from connected databases, or spread malicious instructions throughout the network.55

To visualize and analyze these complex pathways, graph-based representations are a powerful tool.50 In such a model, system assets (models, databases, APIs, code repositories) can be represented as nodes. These nodes can be assigned dynamic risk scores based on continuous threat intelligence feeds and vulnerability scanning.50 By applying graph data science algorithms, such as calculating node centrality or finding the shortest path between a threat actor and a critical asset, security teams can predict the most likely propagation paths, identify single points of failure, and prioritize defensive measures on the most critical nodes.50

This lifecycle-oriented analysis reveals a fundamental characteristic of many LLM vulnerabilities: they are often latent. Unlike a traditional software bug like a buffer overflow, which is typically exploitable from the moment it is deployed, a poisoned dataset or a backdoored model may pass all standard tests and appear perfectly safe. The vulnerability is latent within the system, waiting for a specific and often unpredictable set of conditions to be met at runtime to be activated. The backdoor attack is a canonical example of this, where the compromise occurs during training but the vulnerability remains hidden until triggered at inference.3 Similarly, the threat in an indirect prompt injection attack is latent in the external data source until it is retrieved and processed by the LLM.9 This concept of a threat that doesn’t just exist but propagates through the system and activates under specific conditions has profound implications for security. It means that security cannot be a one-time check at the deployment gate. It necessitates a security posture based on continuous, runtime monitoring and dynamic risk assessment, as static scanning is fundamentally incapable of detecting these latent threats. This conclusion directly motivates the need for the runtime monitoring and adaptive governance frameworks proposed in Part III.

Section 5: Adversarial Stress Testing and Defense Efficacy Evaluation

Static benchmarks and one-off evaluations are insufficient for assessing the security of LLM systems. The threat landscape is dynamic, with attackers constantly adapting their techniques to bypass existing defenses. Therefore, a robust evaluation methodology must simulate this adversarial pressure through continuous and automated stress testing.

The Need for Dynamic Evaluation: A defense that proves effective against a known set of attacks today may be rendered obsolete by a novel technique discovered tomorrow. To build resilient systems, we must move from static evaluation to a dynamic process of adversarial stress testing that continuously probes for weaknesses.
Agent-Based Red Teaming: A promising approach for achieving scalable and continuous testing is the use of agent-based red teaming. This involves employing one or more LLM agents as automated adversaries tasked with discovering and executing attacks against a target system.57 This automates the traditionally manual, time-consuming, and expensive process of human red teaming.59 Advanced frameworks likeCoP (Composition-of-Principles) empower an AI agent to autonomously explore new attack strategies by composing and orchestrating a set of human-provided jailbreaking principles, leading to more efficient and novel vulnerability discovery.57
Automated Evaluation Loops (LLM-as-a-Judge): To scale the evaluation process, the outputs of these red team attacks can be assessed by another powerful LLM, often referred to as an “LLM-as-a-Judge”.60 This approach circumvents the bottleneck of human evaluation, allowing for the rapid and consistent scoring of thousands of interactions. The evaluation metrics can be far more nuanced than a simple binary success/failure. For example, an LLM judge can assess thecorrectness of a response, its conformity to negative constraints (e.g., “do not mention X”), and its groundedness in provided source material, providing a multi-faceted view of model performance.61
Benchmarking and Performance Dashboards: The results from these adversarial stress tests can be aggregated into a comprehensive, comparative performance dashboard. Such a dashboard serves as a leaderboard, enabling a clear, data-driven comparison of different defense mechanisms or model configurations.62 This dashboard would operationalize the “Security-Utility-Cost” trilemma identified in Part I by explicitly tracking these trade-offs. It would compare various defense strategies (e.g., Input Sanitization, Adversarial Training, LLM Firewall) against a suite of standardized attacks (e.g., GCG, PAIR, AutoDAN).39 Key performance indicators would include:
Security: Attack Success Rate (ASR) against different attack families.
Utility: Performance on standard academic benchmarks (e.g., MMLU) or domain-specific tasks to measure degradation.
Cost/Performance: Metrics like inference latency (ms), time-to-first-token (TTFT), and financial cost per million tokens.

This data-driven approach transforms the discussion about LLM security. Instead of debating which single defense is “best,” organizations can use the dashboard to make informed, quantitative decisions about which portfolio of defenses is optimal for their specific use case, risk tolerance, and operational constraints. Table 3 illustrates a template for such a dashboard.

Table 3: Comparative Performance Dashboard of LLM Defense Mechanisms (Illustrative Data)

Defense ConfigurationASR (Prompt Injection)ASR (Data Poisoning)Utility (MMLU Score)Latency (ms)Cost ($/1M tokens)Baseline (Llama-3-8B)85%N/A (Vulnerable)68.450$0.20+ Input Sanitization65%N/A (Vulnerable)68.260$0.21**+ Adversarial Training15%40%65.155$0.25+ LLM Firewall (ControlNET)8%25%67.985$0.28+ All Defenses**< 5%10%64.5100$0.32

Section 6: Sector-Specific Threat Profiles

While the underlying vulnerabilities are general, their manifestation and impact vary significantly across different application domains. Applying the general risk models to specific sectors allows for the creation of tailored threat profiles that can guide prioritization and resource allocation.

Healthcare: In applications like clinical decision support or patient-facing chatbots, the paramount risks are Privacy Compromise and Integrity Violation. A successful data extraction attack could lead to a massive breach of protected health information (PHI), resulting in severe regulatory penalties under HIPAA and a catastrophic loss of patient trust.33 An integrity attack, such as a prompt injection that causes a diagnostic AI to misidentify a malignant tumor as benign, could have life-or-death consequences.33 The high value of medical data on the black market further amplifies the motivation for privacy attacks.33
Finance: For LLMs used in algorithmic trading, fraud detection, or credit scoring, the primary risks are Integrity Violation and Confidentiality/IP Theft. An attacker could use data poisoning to subtly manipulate a model’s analysis of market data, leading to automated trades that benefit the attacker or destabilize markets.64 Furthermore, many financial firms invest heavily in developing proprietary models for trading or risk assessment. A successful model extraction attack would represent a direct theft of this valuable intellectual property, eroding a firm’s competitive advantage.32
Legal Services: In applications such as legal research and document review, the dominant risks are Confidentiality Breach and Integrity Violation. The leakage of information protected by attorney-client privilege could have devastating consequences for a case and the law firm’s reputation. At the same time, the propensity of LLMs to “hallucinate” or fabricate information poses a severe integrity risk. An LLM that confidently cites non-existent case law could lead a lawyer to build a flawed legal argument, potentially resulting in malpractice.
Code Generation: For tools like GitHub Copilot that assist developers in writing software, the primary risk is an Integrity Violation that compromises the security of the software supply chain. An attacker could poison the model’s training data with numerous examples of code containing a subtle vulnerability.66 The model would then learn this insecure pattern and suggest vulnerable code to thousands of developers, leading to a widespread and difficult-to-trace propagation of the vulnerability across the software ecosystem.

Part III: A Framework for Secure LLM Systems Engineering and Governance

This final part synthesizes the analysis from the preceding sections into a constructive and actionable framework for building and governing secure LLM systems. It provides a technical blueprint for a defense-in-depth architecture, a protocol for adaptive governance that aligns with emerging regulations, and an operational playbook to guide implementation and maturity.

Section 7: Blueprint for a Modular, Defense-in-Depth Security Architecture

A secure LLM system cannot be achieved through a single control; it requires a layered, full-stack architecture where multiple defenses work in concert. The proposed blueprint is founded on the principles of modularity, which allows for flexibility and independent updating of components 68, and defense-in-depth, which ensures that a failure in one layer does not lead to a total system compromise.

The secure LLM stack consists of six primary layers, each with specific controls designed to mitigate threats identified in Part I 70:

Layer 1: Secure Ingestion and Data Pipeline: Security begins with the data. This layer is responsible for securing the entire data lifecycle, from sourcing to training.
Controls: It must include robust data provenance tracking using standards like ML-BOM (Machine Learning Bill of Materials) to understand the origin and transformation history of all data.29 Rigorousdata validation and sanitization processes are required to filter out malformed or suspicious data points. Vulnerability scanning of training datasets for PII, toxic content, and known poisoning signatures is essential. Finally, strong access controls must be applied to all data repositories to prevent unauthorized modification.73 This layer is the primary defense againstLLM03: Training Data Poisoning.
Layer 2: Input Validation and Intent Filtering (The “LLM Firewall”): This layer acts as a gatekeeper for all incoming requests. It is conceptually an “LLM Firewall,” a dedicated module that inspects and filters prompts before they reach the core model.48
Controls: It performs syntactic validation to filter out malicious code or scripts and semantic validation to detect harmful intent or known jailbreak patterns.24 It enforcesAPI rate limiting and resource consumption checks to mitigate LLM04: Model Denial of Service attacks.9 This layer is the primary defense againstLLM01: Prompt Injection.
Layer 3: Hardened Model and Inference Engine: This layer focuses on the security of the core LLM itself.
Controls: The model should be hardened through techniques like adversarial training to improve its intrinsic robustness against evasion attacks and safety-focused RLHF to align it against generating harmful content.41 The inference engine should be deployed in asecure, isolated environment, such as a container or a Trusted Execution Environment (TEE), to protect the model’s intellectual property from LLM10: Model Theft and prevent unauthorized tampering.75
Layer 4: Secure Tool and Plugin Integration: As LLMs gain agency, securing their interactions with external tools is paramount.
Controls: All external tools and plugins must operate within a sandboxed environment that strictly limits their capabilities. Least-privilege access controls must be enforced, ensuring a plugin can only access the specific data and perform the specific actions necessary for its function. All API calls made by the LLM to its tools must be meticulously monitored and logged. This layer directly mitigates LLM07: Insecure Plugin Design and LLM08: Excessive Agency.20
Layer 5: Output Policy Compliance and Filtering: This final checkpoint inspects all model-generated content before it is delivered.
Controls: This module scans outputs to prevent sensitive information disclosure (LLM06), such as PII or trade secrets. It also filters for toxicity, bias, or any other content that violates organizational policies. This layer is the last line of defense against LLM02: Insecure Output Handling.9
Layer 6: Continuous Monitoring and Runtime Security: This is an overarching layer that provides visibility across the entire stack.
Controls: It involves comprehensive logging of all interactions, from incoming prompts to tool usage and final outputs. Behavioral analytics and anomaly detection are used to identify suspicious patterns in real-time, such as a sudden spike in queries related to a forbidden topic, a user attempting to access data beyond their permissions, or a plugin behaving erratically.78

The implementation of this multi-layered blueprint reveals that LLM security is fundamentally an orchestration problem. The architecture is not a monolithic application but a distributed system of interacting security microservices (input filters, model monitors, output scanners). The security of the overall system depends not just on the strength of these individual components, but on the secure protocols and data flows between them. For example, a threat like indirect prompt injection requires the coordinated action of the RAG system, the input filter, the model, and the output filter. A failure in the secure communication or trust boundary between any two of these components can lead to a breach. Therefore, the successful implementation of this blueprint requires a deep focus on secure inter-component communication, data serialization, and state management. The governance framework in the following section must define the policies that orchestrate these interactions securely.

Section 8: An Adaptive Governance Protocol for LLM Ecosystems

A technical architecture, no matter how robust, is insufficient without a governance framework to direct its operation, ensure its continuous adaptation, and maintain compliance with legal and ethical standards. Given the rapid evolution of AI technology and its associated threats, a traditional, static governance model is inadequate. An adaptive governance framework is required—one that is flexible, proactive, and founded on principles of continuous learning and risk management.80

Core Principles of Adaptive Governance

An adaptive governance protocol for LLM security should be built on several key principles:

Flexibility and Modularity: Governance policies should be designed as modular components that can be updated quickly in response to new technological developments, emerging threats, or changing regulations, without requiring a complete overhaul of the framework.80
Risk-Based Approach: Aligning with frameworks like the EU AI Act, governance should be proportional to risk. Systems deemed “high-risk” must be subject to the most stringent controls, testing, and oversight.82
Continuous Monitoring and Learning: The framework must incorporate real-time monitoring and feedback loops to continuously assess the performance of security controls and identify emerging risks. This allows the governance structure to evolve in tandem with the threat landscape.80
Stakeholder Collaboration: Effective governance requires input from a diverse range of stakeholders, including security teams, legal counsel, compliance officers, developers, and ethicists, to ensure a holistic and balanced approach to risk management.81

Alignment with Regulatory Mandates and Standards

A key function of the governance protocol is to ensure compliance with a complex and growing web of regulations.

EU AI Act: The framework must operationalize the requirements of the EU AI Act. The defense-in-depth architecture directly maps to the Act’s mandates for high-risk systems under Article 15, which call for appropriate levels of accuracy, robustness, and cybersecurity.84 The risk assessment processes within the governance protocol will align with the Act’s risk classification tiers (Unacceptable, High, Limited, Minimal).15
ISO/IEC 42001 (AI Management System): The protocol should be structured around the Plan-Do-Check-Act (PDCA) cycle of ISO 42001, providing a formal, certifiable process for managing AI systems.85 This includes establishing policies, implementing controls, monitoring performance, and driving continuous improvement across the entire AI lifecycle, from inception to decommissioning.16 The framework will incorporate specific threat modeling tools recommended by the standard, such as STRIDE, to analyze risks at each lifecycle stage.16
GDPR and Data Privacy: The governance framework must enforce strong data protection principles. This includes mandating data minimization, implementing PII filtering at the output layer, ensuring user consent, and providing mechanisms for data subject rights, thereby aligning with GDPR and other privacy regulations.88

Integration with Zero-Trust Architecture (ZTA)

The governance protocol must enforce the principles of a Zero-Trust Architecture (ZTA) across the entire LLM ecosystem. ZTA shifts the security paradigm from a perimeter-based model to one of “never trust, always verify”.75

Applying ZTA to LLMs: In this model, every entity—whether a human user, a software plugin, or another LLM agent—is considered untrusted by default. Access to any data or function is granted on a strictly need-to-know, per-request basis, and every request must be explicitly authenticated and authorized.55 The LLM itself is treated as a “powerful, naive agent” that cannot be blindly trusted with broad access.55
Implementation: The governance protocol will mandate the implementation of ZTA controls within the security architecture. This includes micro-segmentation to isolate LLM components and limit the blast radius of a breach, strong Identity and Access Management (IAM) for all entities, and the continuous inspection of all traffic and API calls between components.75 This ensures that even if one component is compromised, the attacker’s lateral movement is severely restricted.

Section 9: Operational Security Playbook and Maturity Model

To translate the architectural blueprint and governance protocol into practice, organizations need a concrete operational guide. This section provides a playbook for day-to-day security operations and a maturity model for assessing and improving an organization’s security posture over time.

LLM Security Operations Playbook

This playbook provides step-by-step procedures for security teams to manage the LLM ecosystem throughout its lifecycle.65

Secure Training Phase:
Data Vetting: Establish a formal process for vetting all data sources. Scan datasets for PII, toxic content, and known indicators of poisoning.
Pipeline Security: Implement role-based access control (RBAC) for the entire ML pipeline, ensuring that only authorized personnel can modify data, code, or model configurations.
Provenance Logging: Maintain immutable logs of all data sources and transformations using tools like ML-BOM.
Hardening Deployments:
Pre-Deployment Red Teaming: Conduct automated and manual red teaming exercises against the model in a staging environment to identify vulnerabilities before release.79
Configuration Checklist: Use a standardized checklist to configure all security controls, including input/output guardrails, access policies for plugins, and rate limits.
Vulnerability Scanning: Scan all dependencies, including base models and libraries, for known vulnerabilities.
Runtime Security and Incident Response:
Monitoring: Continuously monitor logs and metrics from all layers of the security architecture. Set up alerts for anomalies detected by the behavioral analytics engine.
Incident Triage: When an alert is triggered (e.g., a high-confidence prompt injection attempt is detected), the on-call security engineer is notified.
Containment and Analysis: The immediate response is to isolate the affected model or user session to prevent further damage. The engineer then analyzes the logs to understand the nature of the attack.
Eradication and Recovery: If a vulnerability is identified, the relevant defense (e.g., the input filter’s deny-list) is updated and redeployed. The system is restored to a known-good state.
Supply Chain Management:
Third-Party Vetting: Establish a formal process for vetting any third-party model, plugin, or data provider. This should include security assessments and contractual obligations.
Continuous Monitoring: Regularly scan and monitor all third-party components for new vulnerabilities or suspicious behavior.

LLM Security Maturity Model

This maturity model provides a structured framework for organizations to benchmark their current LLM security capabilities and create a roadmap for systematic improvement. It allows an organization to move from a reactive, ad-hoc security posture to a proactive, optimized, and data-driven one.

Table 4: LLM Security Maturity Model

Security DomainLevel 1: InitialLevel 2: ManagedLevel 3: DefinedLevel 4: Quantitatively ManagedLevel 5: OptimizingData Governance & SecurityData is used ad-hoc. No formal scanning or provenance tracking.Basic PII scanning is performed on some datasets.An organization-wide policy for data classification and handling exists. Provenance is tracked for critical datasets.Data security metrics (e.g., PII detection rate) are tracked. Automated validation is in place.Data security processes are continuously improved using feedback loops and automated remediation.**Input Security (Prompt/Firewall)**No input filtering. Relies solely on the model’s native safety features.Basic deny-list filters are in place for known malicious strings.A dedicated LLM firewall module is deployed with both syntactic and semantic filtering.The effectiveness of the firewall is measured (Precision/Recall, ASR). Rules are updated based on performance data.The firewall uses adaptive, ML-based threat detection. Automated red teaming is used to find new bypasses.Model Robustness & SafetyModels are used off-the-shelf with no additional hardening.Models are fine-tuned for the task, with some safety considerations in the prompt.Models undergo safety-focused RLHF. A formal process for model selection exists.Model robustness is benchmarked against standard attacks (e.g., GCG). Utility-security trade-offs are measured.Adversarial training is used to harden critical models. New defenses are proactively evaluated.Output SecurityModel outputs are passed directly to users without filtering.Basic keyword filters are used to block profanity or highly sensitive terms.A dedicated output filtering module scans for a range of issues (PII, toxicity, policy violations).The false positive/negative rate of the output filter is tracked and managed.The output filter uses contextual analysis and is continuously updated based on feedback and new risks.Runtime Monitoring & Incident ResponseNo logging or monitoring of LLM interactions.Basic API logs are collected but reviewed reactively after an incident.Comprehensive logging is in place across the stack. A formal incident response plan exists.Key risk indicators (KRIs) are monitored in real-time. Anomaly detection alerts are triaged based on severity.Incident response is partially automated. Threat hunting is performed proactively using behavioral analytics.Governance & ComplianceNo formal AI governance. Compliance is addressed on a case-by-case basis.Basic usage policies are documented.An adaptive governance framework aligned with ISO 42001 is defined and approved.Compliance is continuously monitored and audited. Risk assessments are data-driven.The governance framework is automatically updated based on real-time risk signals and regulatory changes.

Conclusion and Future Research Directions

This paper has presented a comprehensive, multi-dimensional framework for understanding, modeling, and mitigating the security risks inherent in Large Language Model ecosystems. By moving beyond siloed analyses, we have sought to provide a unified and systematic foundation for the secure engineering and governance of this transformative technology.

Our primary contribution is the synthesis of disparate concepts into a cohesive whole. We introduced a unified, multi-axial threat taxonomy that integrates lifecycle, system-module, attacker-goal, and industry-risk perspectives, creating a common vocabulary for all stakeholders. We then moved from static classification to dynamic analysis, proposing a model for threat propagation across the LLM lifecycle and a framework for evaluating defenses through adversarial stress testing. This analysis surfaced the critical “Security-Utility-Cost” trilemma, highlighting that security is a matter of managing trade-offs, not finding a single perfect solution. Finally, we translated these analytical insights into a constructive framework, proposing a modular, defense-in-depth security architecture and an adaptive governance protocol. This blueprint operationalizes principles like Zero Trust and aligns with emerging standards like the EU AI Act and ISO 42001, providing a practical roadmap for organizations. The accompanying operational playbook and maturity model offer concrete steps for implementation and continuous improvement.

The framework presented here empowers organizations to navigate the inherent trade-offs in LLM security. The modular architecture, the comparative performance dashboard derived from stress testing, and the maturity model provide the tools needed to engineer a security posture that is explicitly aligned with a specific risk profile, operational context, and resource constraints. It shifts the objective from seeking a non-existent silver bullet to building a resilient, adaptive, and risk-aware security portfolio.

Despite the comprehensive nature of this framework, the field of LLM security is evolving at a breakneck pace, and numerous challenges remain. We identify several critical directions for future research:

Provably Robust Defenses: The majority of current defenses, including adversarial training, are empirical. They demonstrate effectiveness against known attacks but offer no formal guarantees against novel ones. A key area for future work is the development of defenses with provable robustness, potentially drawing on techniques from formal verification and certified defenses.
Explainability and Trust in Defenses: A significant tension exists between security and transparency. More complex, opaque defenses are often more robust, while simpler, more explainable defenses can be brittle. Future research must focus on resolving this “transparency-security trade-off” by creating defense mechanisms that are both effective and interpretable, thereby fostering operator trust without introducing new attack surfaces.17
Human Factors and Behavioral Engineering: The human element remains a critical, yet under-explored, aspect of LLM security. Future work should conduct deeper investigations into the cognitive biases and psychological factors that make users susceptible to social-engineered misuse of LLMs. The findings from such research could inform the design of user interfaces and training programs that are behaviorally engineered to promote safer interactions.13
Security of Autonomous Agent Ecosystems: The current analysis has largely focused on single LLM applications. As the technology moves towards interconnected ecosystems of autonomous agents, new classes of systemic and emergent threats will arise. The security and governance of these multi-agent systems, where risks can propagate and cascade in unpredictable ways, represents a major frontier for research.
Automated Red Teaming and Self-Assessment: The development of LLM agents capable of performing automated red teaming is a promising avenue for scalable security evaluation. Future research should advance these capabilities, aiming to create a continuous and adaptive security cycle where AI systems can perform self-assessment, discover their own vulnerabilities, and even suggest or implement mitigations.57

By addressing these challenges, the research community can continue to build upon the foundational framework proposed in this paper, paving the way for a future where the immense potential of Large Language Models can be realized safely, securely, and responsibly.

LLM Security Framework

Geciteerd werk

arxiv.org, geopend op juli 9, 2025, https://arxiv.org/html/2505.18889v1
Security Concerns for Large Language Models: A Survey – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2505.18889v2
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2505.01177v1
Security and Privacy Challenges of Large Language Models: A Survey – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/387965043_Security_and_Privacy_Challenges_of_Large_Language_Models_A_Survey
LLM Security: Vulnerabilities, Attacks, Defenses, and Countermeasures – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/391444190_LLM_Security_Vulnerabilities_Attacks_Defenses_and_Countermeasures
On large language models safety, security, and privacy: A survey – Researching, geopend op juli 9, 2025, https://www.researching.cn/articles/OJba348e2553344135
Vulnerabilities and Defenses: A Monograph on Comprehensive Analysis of Security Attacks on Large Language Models – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/393291900_Vulnerabilities_and_Defenses_A_Monograph_on_Comprehensive_Analysis_of_Security_Attacks_on_Large_Language_Models
[2403.12503] Securing Large Language Models: Threats, Vulnerabilities and Responsible Practices – arXiv, geopend op juli 9, 2025, https://arxiv.org/abs/2403.12503
What are the OWASP Top 10 risks for LLMs? – Cloudflare, geopend op juli 9, 2025, https://www.cloudflare.com/learning/ai/owasp-top-10-risks-for-llms/
Risk Taxonomy, Mitigation, and Assessment Benchmarks of … – arXiv, geopend op juli 9, 2025, https://arxiv.org/abs/2401.05778
A Security Risk Taxonomy for Large Language Models – ELLIS Alicante, geopend op juli 9, 2025, https://ellisalicante.org/publications/derner2023security-en/
OWASP Top 10 Risks for Large Language Models: 2025 updates : r/BarracudaNetworks, geopend op juli 9, 2025, https://www.reddit.com/r/BarracudaNetworks/comments/1hjbiwc/owasp_top_10_risks_for_large_language_models_2025/
Cyber Psychology: The Human Factor and Social Engineering – DEV Community, geopend op juli 9, 2025, https://dev.to/talhamemis/cyber-psychology-the-human-factor-and-social-engineering-4cj8
Towards a Proof-of-Principle of an LLM-powered Low Resource Social Engineering Attack Coach – OPUS – BSZ, geopend op juli 9, 2025, https://opus.bsz-bw.de/hsas/files/1393/PoPofLLMpoweredSocialEngineeringAttackCoach.pdf
EU AI Act: first regulation on artificial intelligence | Topics – European Parliament, geopend op juli 9, 2025, https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence
AI lifecycle risk management: ISO/IEC 42001:2023 for AI governance | AWS Security Blog, geopend op juli 9, 2025, https://aws.amazon.com/blogs/security/ai-lifecycle-risk-management-iso-iec-420012023-for-ai-governance/
Transparency in AI-Driven Defense Systems: The Role of Explainable AI – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/387829453_Transparency_in_AI-Driven_Defense_Systems_The_Role_of_Explainable_AI
AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap, geopend op juli 9, 2025, https://hdsr.mitpress.mit.edu/pub/aelql9qy
Adversarial Machine Learning: A Taxonomy and Terminology of …, geopend op juli 9, 2025, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2025.pdf
OWASP Top 10 LLM and GenAI | Snyk Learn, geopend op juli 9, 2025, https://learn.snyk.io/learning-paths/owasp-top-10-llm/
Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2401.05778v1
A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly (Journal Article) – NSF-PAR, geopend op juli 9, 2025, https://par.nsf.gov/biblio/10510451-survey-large-language-model-llm-security-privacy-good-bad-ugly
Instruction Backdoor Attacks Against Customized LLMs – USENIX, geopend op juli 9, 2025, https://www.usenix.org/system/files/usenixsecurity24-zhang-rui.pdf
Adversarial Prompting in LLMs | Prompt Engineering Guide, geopend op juli 9, 2025, https://www.promptingguide.ai/risks/adversarial
Adversarial Attacks on Large Language Models Using Regularized Relaxation – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2410.19160v1
Fast Adversarial Attacks on Language Models In One GPU Minute – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2402.15570v1
Data Poisoning LLM: How API Vulnerabilities Compromise LLM Data Integrity – Traceable, geopend op juli 9, 2025, https://www.traceable.ai/blog-post/data-poisoning-how-api-vulnerabilities-compromise-llm-data-integrity
Defending Against Data Poisoning Attacks on LLMs: A Comprehensive Guide | Promptfoo, geopend op juli 9, 2025, https://www.promptfoo.dev/blog/data-poisoning/
LLM04:2025 Data and Model Poisoning – OWASP Gen AI Security Project, geopend op juli 9, 2025, https://genai.owasp.org/llmrisk/llm042025-data-and-model-poisoning/
Alignment-Aware Model Extraction Attacks on Large Language Models, geopend op juli 9, 2025, https://arxiv.org/html/2409.02718v1
Model Leeching: An Extraction Attack Targeting LLMs – Mindgard AI, geopend op juli 9, 2025, https://mindgard.ai/resources/model-leeching-an-extraction-attack-targeting-llms
LLM Security Playbook for AI Injection Attacks, Data Leaks, and …, geopend op juli 9, 2025, https://konghq.com/blog/enterprise/llm-security-playbook-for-injection-attacks-data-leaks-model-theft
Prompt Hacking in Healthcare: Regulatory Compliance and V…, geopend op juli 9, 2025, https://www.teneo.ai/blog/prompt-hacking-in-healthcare-hipaa-compliance-voice-ai-security-teneo-ai
Prompt injection attacks on vision language models in oncology – PMC, geopend op juli 9, 2025, https://pmc.ncbi.nlm.nih.gov/articles/PMC11785991/
Data Poisoning in LLMs: Jailbreak-Tuning and Scaling Trends – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2408.02946v5
Prompt injection attacks: an inherent vulnerability of healthcare AI agents – BJGP Life, geopend op juli 9, 2025, https://bjgplife.com/prompt-injection-attacks-an-inherent-vulnerability-of-healthcare-ai-agents/
Top 10 Techniques to Secure Your LLM Prompt in 2025 | by Hicham Amchaar – Medium, geopend op juli 9, 2025, https://chamoncode.medium.com/top-10-techniques-to-secure-your-llm-prompt-in-2025-c9cc2db2f0a2
LLM Input Validation & Sanitization | Secure AI – ApX Machine Learning, geopend op juli 9, 2025, https://apxml.com/courses/intro-llm-red-teaming/chapter-5-defenses-mitigation-strategies-llms/input-validation-sanitization-llms
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2405.20099v1
Adversarial Robustness In LLMs: Defending Against Malicious Inputs – Protecto’s AI, geopend op juli 9, 2025, https://www.protecto.ai/blog/adversarial-robustness-llms-defending-against-malicious-inputs/
NeurIPS Poster Efficient Adversarial Training in LLMs with Continuous Attacks, geopend op juli 9, 2025, https://nips.cc/virtual/2024/poster/96357
Efficient Adversarial Training in LLMs with Continuous Attacks – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2405.15589v3
Efficient Adversarial Training in LLMs with Continuous Attacks …, geopend op juli 9, 2025, https://openreview.net/forum?id=8jB6sGqvgQ&referrer=%5Bthe%20profile%20of%20Gauthier%20Gidel%5D(%2Fprofile%3Fid%3D~Gauthier_Gidel1)
Reinforcement learning with human feedback (RLHF) for LLMs …, geopend op juli 9, 2025, https://www.superannotate.com/blog/rlhf-for-llm
Learning by RLHF for LLMs and other models – Innovatiana, geopend op juli 9, 2025, https://www.innovatiana.com/en/post/rlhf-our-detailed-guide
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2502.11555v1
LLM Security: Top 10 Risks and 7 Security Best Practices – Exabeam, geopend op juli 9, 2025, https://www.exabeam.com/explainers/ai-cyber-security/llm-security-top-10-risks-and-7-security-best-practices/
ControlNET: A Firewall for RAG-based LLM System | Papers With …, geopend op juli 9, 2025, https://paperswithcode.com/paper/controlnet-a-firewall-for-rag-based-llm
What LLM firewalls really mean for the future of AI security | Okoone, geopend op juli 9, 2025, https://www.okoone.com/spark/technology-innovation/what-llm-firewalls-really-mean-for-the-future-of-ai-security/
Cybersecurity Risk Assessment Using LLM Agents and Graph Data Science – Neo4j, geopend op juli 9, 2025, https://neo4j.com/nodes2024/agenda/cybersecurity-risk-assessment-using-llm-agents-and-graph-data-science/
NODES 2024 – Cybersecurity Risk Assessment Using LLM Agents and Graph Data Science, geopend op juli 9, 2025, https://www.youtube.com/watch?v=buXyenH97VA
LLMs and Cybersecurity Standards in Life Sciences | USDM, geopend op juli 9, 2025, https://usdm.com/resources/blogs/llms-and-cybersecurity-standards-in-life-sciences
OWASP Top 10 for LLM Applications 2025: Data and Model Poisoning – Check Point, geopend op juli 9, 2025, https://www.checkpoint.com/cyber-hub/what-is-llm-security/data-and-model-poisoning/
Scaling Trends for Data Poisoning in LLMs – AAAI Publications, geopend op juli 9, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/34929/37084
What is Zero Trust AI Access (ZTAI)? – Check Point Software, geopend op juli 9, 2025, https://www.checkpoint.com/cyber-hub/cyber-security/what-is-ai-security/what-is-zero-trust-ai-access-ztai/
AI-Driven IRM : Transforming Insider Risk Management with Adaptive Scoring and LLM-Based Threat Detection – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2505.03796v1
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles, geopend op juli 9, 2025, https://arxiv.org/html/2506.00781v1
Do LLM Agents Have AI Red Team Capabilities? We Built a Benchmark to Find Out, geopend op juli 9, 2025, https://dreadnode.io/blog/ai-red-team-benchmark
Defining LLM Red Teaming | NVIDIA Technical Blog, geopend op juli 9, 2025, https://developer.nvidia.com/blog/defining-llm-red-teaming/
A comprehensive review of benchmarks for LLMs evaluation | by Yanan Chen – Medium, geopend op juli 9, 2025, https://medium.com/@yananchen1116/a-comprehensive-review-of-benchmarks-for-llms-evaluation-d1c4ba466734
Testing LLM Agents: Automated Evaluation & AI Red Teaming for Agentic AI – Giskard, geopend op juli 9, 2025, https://www.giskard.ai/knowledge/how-to-implement-llm-as-a-judge-to-test-ai-agents-part-2
Comparing LLM performance: Introducing the Open Source Leaderboard for LLM APIs, geopend op juli 9, 2025, https://www.anyscale.com/blog/comparing-llm-performance-introducing-the-open-source-leaderboard-for-llm
LLM Leaderboard – Comparison of over 100 AI models from OpenAI, Google, DeepSeek & others, geopend op juli 9, 2025, https://artificialanalysis.ai/leaderboards/models
Multi-Faceted Studies on Data Poisoning can Advance LLM Development – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2502.14182v1
LLM Failures: Avoid These Large Language Model Security Risks …, geopend op juli 9, 2025, https://www.cobalt.io/blog/llm-failures-large-language-model-security-risks
Large Language Models and Code Security: A Systematic Literature Review – Powerdrill, geopend op juli 9, 2025, https://powerdrill.ai/discover/discover-Large-Language-Models-cm4x8ie8ibxms07oam6o6gr7j
(PDF) From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/387263873_From_Vulnerabilities_to_Remediation_A_Systematic_Literature_Review_of_LLMs_in_Code_Security
What Is Modular AI Architecture? – Magai, geopend op juli 9, 2025, https://magai.co/what-is-modular-ai-architecture/
A Modular Multi-stage Generative AI Architecture for Embedded Process Automation with Unstructured Data – Medium, geopend op juli 9, 2025, https://medium.com/empowering-automation-with-an-llm-knowledge-graph/a-modular-domain-augmented-generative-ai-architecture-for-embedded-process-automation-with-ac84b8a7d884
Deploying the NVIDIA AI Blueprint for Cost-Efficient LLM Routing, geopend op juli 9, 2025, https://developer.nvidia.com/blog/deploying-the-nvidia-ai-blueprint-for-cost-efficient-llm-routing/
Tech Stack for LLM Application Development – Complete Guide – Prismetric, geopend op juli 9, 2025, https://www.prismetric.com/tech-stack-for-llm-application-development/
Navigating the security landscape of generative AI – Navigating the …, geopend op juli 9, 2025, https://docs.aws.amazon.com/whitepapers/latest/navigating-security-landscape-genai/navigating-security-landscape-genai.html
What Is Training Data Poisoning in LLMs & 6 Ways to Prevent It – Pynt, geopend op juli 9, 2025, https://www.pynt.io/learning-hub/llm-security/what-is-training-data-poisoning-in-llms-6-ways-to-prevent-it
Prompt Injection: Techniques for LLM Safety in 2025 | Label Your Data, geopend op juli 9, 2025, https://labelyourdata.com/articles/llm-fine-tuning/prompt-injection
Integrating Zero Trust Security Models With LLM Operations – Protecto’s AI, geopend op juli 9, 2025, https://www.protecto.ai/blog/integrating-zero-trust-security-models-with-llm-operations
Securing AI Workloads: Building Zero-Trust Architecture for LLM Applications – YouTube, geopend op juli 9, 2025, https://www.youtube.com/watch?v=hQwyPFhACGI
Securing Generative AI architecture | by Manav Gupta | Medium, geopend op juli 9, 2025, https://medium.com/@manavg/securing-generative-ai-architecture-74f48e74b3e3
Understanding LLM Security Risks: Essential Risk Assessment – DataSunrise, geopend op juli 9, 2025, https://www.datasunrise.com/knowledge-center/ai-security/understanding-llm-security-risks/
A Step-by-Step Guide to Securing LLM Applications – Protect AI, geopend op juli 9, 2025, https://protectai.com/blog/step-by-step-guide-to-securing-llm-applications
Adaptive Governance Frameworks: Flexibility for Technological and …, geopend op juli 9, 2025, https://aign.global/ai-governance-consulting/patrick-upmann/adaptive-governance-frameworks-flexibility-for-technological-and-ethical-evolution/
The Need for Adaptive Data Governance in the Frontier of Artificial Intelligence (AI) and Automation – Kearney & Company, geopend op juli 9, 2025, https://www.kearneyco.com/blog/the-need-for-adaptive-data-governance-in-the-frontier-of-artificial-intelligence-ai-and-automation/
AI Act | Shaping Europe’s digital future – European Union, geopend op juli 9, 2025, https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai
AI Regulations and LLM Regulations: Past, Present, and Future | Exabeam, geopend op juli 9, 2025, https://www.exabeam.com/explainers/ai-cyber-security/ai-regulations-and-llm-regulations-past-present-and-future/
Article 15: Accuracy, Robustness and Cybersecurity | EU Artificial …, geopend op juli 9, 2025, https://artificialintelligenceact.eu/article/15/
ISO/IEC 42001: a new standard for AI governance – KPMG International, geopend op juli 9, 2025, https://kpmg.com/ch/en/insights/artificial-intelligence/iso-iec-42001.html
An extensive guide to ISO 42001 – Vanta, geopend op juli 9, 2025, https://www.vanta.com/resources/iso-42001
ISO/IEC 42001:2023 Artificial Intelligence Management System Standards – Learn Microsoft, geopend op juli 9, 2025, https://learn.microsoft.com/en-us/compliance/regulatory/offering-iso-42001
What Is AI Governance? – Palo Alto Networks, geopend op juli 9, 2025, https://www.paloaltonetworks.com/cyberpedia/ai-governance
(PDF) Zero-Trust Architecture (ZTA): Designing an AI-Powered Cloud Security Framework for LLMs’ Black Box Problems – ResearchGate, geopend op juli 9, 2025, https://www.researchgate.net/publication/379044053_Zero-Trust_Architecture_ZTA_Designing_an_AI-Powered_Cloud_Security_Framework_for_LLMs’_Black_Box_Problems
Implementing Zero Trust Access in AI & LLM Systems Today – DataSunrise, geopend op juli 9, 2025, https://www.datasunrise.com/knowledge-center/ai-security/implementing-zero-trust-access-in-ai-llm/
What Is Explainability? – Palo Alto Networks, geopend op juli 9, 2025, https://www.paloaltonetworks.com/cyberpedia/ai-explainability
Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2403.08946v1
The Human Factor in Detecting Errors of Large Language Models: A Systematic Literature Review and Future Research Directions – arXiv, geopend op juli 9, 2025, https://arxiv.org/html/2403.09743v1