OpenRAG-Soc Benchmarks Indirect Prompt Injection in RAG Systems
Retrieval-augmented generation (RAG) systems increasingly rely on user-generated web and social content to ground large language model responses. While this design improves factual coverage and freshness, it also introduces a web-native attack surface that traditional LLM security evaluations fail to capture. A new study, Hidden-in-Plain-Text: A Benchmark for Social-Web Indirect Prompt Injection in RAG, introduces OpenRAG-Soc, a compact and reproducible benchmark designed to evaluate these risks end to end
Threat Model and Attack Surface
The research focuses on two primary threats. The first is indirect prompt injection, where adversarial instructions are embedded in third-party web content and executed when retrieved by the RAG system. The second is retrieval poisoning, where attackers manipulate indexed content to bias retriever rankings and surface malicious documents in top-k results. Unlike direct prompt injection, these attacks persist through ingestion pipelines and exploit carriers that commonly survive HTML, Markdown, and accessibility processing.
The assumed adversary controls a subset of web pages but has no access to model weights, system prompts, or internal retriever logic. This mirrors real-world deployment conditions for web-facing RAG applications.
OpenRAG-Soc Benchmark Design
OpenRAG-Soc standardizes evaluation across the full RAG pipeline, from ingestion to answer generation. The benchmark includes a corpus of over 6,000 social-style web pages containing both visible and hidden payloads. These payloads are embedded using carriers such as hidden HTML spans, off-screen CSS, alt text, ARIA labels, and Unicode zero-width characters. A smaller subset also targets PDF text layers and SVG metadata to evaluate non-HTML ingestion paths.
The benchmark supports interchangeable sparse and dense retrievers, including BM25 and modern embedding-based retrievers, and evaluates performance across multiple top-k retrieval depths. A fixed “no-new-instructions-from-context” prompt template is used to isolate retrieval and generation effects from prompt engineering variance.
Metrics and Evaluation Methodology
OpenRAG-Soc introduces paired metrics that measure both generation-time and retrieval-time impact. Attack success is quantified using an answer-time instruction-following rate, while poisoning impact is measured through ranking shifts such as ΔMRR@10 and ΔnDCG@10. Utility and latency are also reported to capture the operational cost of defenses.
This combined measurement approach enables apples-to-apples comparison between defenses, retriever types, and carrier classes, addressing a major gap in existing RAG security evaluations.
Defense Mechanisms and Effectiveness
The benchmark evaluates three deployable mitigations: HTML and Markdown sanitization, Unicode normalization, and attribution-gated answering. Sanitization neutralizes hidden or off-screen carriers, normalization removes zero-width and homoglyph characters, and attribution gating restricts outputs to cited retrieved spans.
Experimental results show that no single defense is sufficient across all carriers. However, the combined application of all three defenses consistently reduces attack success rates to low single digits, even under adaptive attack conditions, while incurring minimal latency and utility degradation. Importantly, both sparse and dense retrievers follow the same defense effectiveness ordering.
Implications for RAG Deployments
OpenRAG-Soc demonstrates that indirect prompt injection and retrieval poisoning are practical, measurable risks in real-world RAG systems. The study provides concrete evidence that simple hygiene measures, when applied systematically across the pipeline, significantly harden web-integrated LLM applications.
By offering a lightweight, reproducible benchmark, OpenRAG-Soc enables practitioners to continuously assess exposure, compare mitigations, and track regressions as RAG systems evolve. This work establishes a technical foundation for securing RAG deployments against web-native threats without requiring access to model internals.
No Comment! Be the first one.