# HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

Wen-Sheng Lien

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

vincentlien.ii13@nycu.edu.tw

Yu-Kai Chan

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

ctw33888.ee13@nycu.edu.tw

Hao-Lung Hsiao

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

hlhsiao.cs13@nycu.edu.tw

Bo-Kai Ruan

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

bkruan.ee11@nycu.edu.tw

Meng-Fen Chiang

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

meng.chiang@nycu.edu.tw

Chien-An Chen

E.SUN Bank

Taipei, Taiwan

lukechen-15953@esunbank.com

Yi-Ren Yeh

National Kaohsiung Normal

University

Kaohsiung, Taiwan

yryeh@nknu.edu.tw

Hong-Han Shuai

National Yang Ming Chiao Tung

University

Hsinchu, Taiwan

hhshuai@nycu.edu.tw

## Abstract

Graph-based Retrieval-Augmented Generation (RAG) typically operates on binary Knowledge Graphs (KGs). However, decomposing complex facts into binary triples often leads to semantic fragmentation and longer reasoning paths, increasing the risk of retrieval drift and computational overhead. In contrast, \( n \) -ary hypergraphs preserve high-order relational integrity, enabling shallower and more semantically cohesive inference. To exploit this topology, we propose HyperRAG, a framework tailored for \( n \) -ary hypergraphs featuring two complementary retrieval paradigms: (i) HyperRetriever learns structural-semantic reasoning over \( n \) -ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. Hy-perRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that Hyper-Retriever bridges reasoning gaps through adaptive and interpretable \( n \) -ary chain construction, benefiting both open and closed-domain QA. Our codes are publicly available at https://github.com/Vincent-Lien/HyperRAG.git.

## CCS Concepts

- Information systems \( \rightarrow \) Retrieval models and ranking; Language models; Question answering.

## Keywords

Hypergraph-based Retrieval-Augmented Generation, N-ary Relational Knowledge Graphs, Multi-hop Question Answering, Memory-Guided Adaptive Retrieval

## ACM Reference Format:

Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang, Chien-An Chen, Yi-Ren Yeh, and Hong-Han Shuai. 2026. Hyper-RAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation. In Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3774904.3792710

## 1 Introduction

Retrieval-Augmented Generation (RAG) has established itself as a critical mechanism for augmenting Large Language Models (LLMs) with non-parametric external knowledge during inference [12, 17, 19, 20]. By dynamically retrieving verifiable information from external corpora without the need for extensive fine-tuning, RAG effectively mitigates intrinsic LLM limitations such as hallucinations and temporal obsolescence. This paradigm has proven particularly transformative for knowledge-intensive tasks, including open-domain question answering (QA), fact verification, and complex information extraction, driving significant innovation across both academia and industry.

Current RAG methodologies broadly fall into three categories: document-based, graph-based, and hybrid approaches. Document-based methods utilize dense vector retrieval to match queries with textual segments, offering scalability but often failing to capture complex structural dependencies [5, 6]. Conversely, graph-based methods leverage Knowledge Graphs (KGs) to explicitly model relationships, enabling multi-hop reasoning over structured data [15, 31]. Hybrid approaches attempt to bridge these paradigms, balancing comprehensiveness with efficiency. However, despite the reasoning potential of graph-based methods, the prevailing reliance on binary KGs presents fundamental topological limitations.

![bo_d6nbbuc601uc73e2hrig_1_187_247_648_687_0.jpg](images/bo_d6nbbuc601uc73e2hrig_1_187_247_648_687_0.jpg)

Figure 1: Structural Comparison of (a) Knowledge Graphs and (b) Hypergraphs. For a given question \( q \) ,(a) requires 3-hop reasoning over binary facts, while (b) enables single-hop inference via an \( n \) -ary relational fact, yielding a more compact and expressive multi-entity representation.

Traditional graph-based RAG methods predominantly rely on binary knowledge graphs, which suffer from notable limitations when applied to closed-domain question-answering scenarios. Specifically, binary KG approaches encounter two fundamental structural limitations. First, Semantic Fragmentation arises because binary relations limit the expressiveness required to capture complex multi-entity interactions, forcing the decomposition of holistic facts into disjoint triples that fail to represent intricate semantic nuances. Second, this fragmentation leads to Path Explosion, where conventional approaches incur significant computational costs due to the need for deep traversals over the vast binary relation space to reconnect these facts, enabling error propagation and undermining real-world practicality [18, 37]. To address these limitations, recent work advocates hypergraphs for structured retrieval in RAG. Hypergraphs natively encode higher-order ( \( n \) -ary) relations that bind multiple entities and roles, providing a richer semantic substrate than binary graphs [26]. As illustrated in Figure 1, the Path Explosion issue is evident when answering a question grounded on the topic entity "Bruce Seth Green," which requires a 3-hop binary traversal on a standard KG. In contrast, this reduces to a single hop through an \( n \) -ary relation in a hypergraph, yielding a more compact representation. Hypergraphs enable the direct modeling of higher-order relational chains, effectively mitigating Semantic Fragmentation and reducing the reasoning steps required to capture complex dependencies.

Motivated by these insights, we introduce HyperRAG, an innovative retrieval-augmented generation framework designed explicitly for reasoning over \( n \) -ary hypergraphs. HyperRAG integrates two novel adaptive retrieval variants: (i) HyperRetriever, which uses a multilayer perceptron (MLP) to fuse structural and semantic em-beddings, constructing query-conditioned relational chains that enable accurate and interpretable evidence aggregation within context and token constraints; and (ii) HyperMemory, which leverages the parametric memory of an LLM to guide beam search, dynamically scoring \( n \) -ary facts and entities for query-adaptive path expansion. By combining higher-order reasoning with shallower yet more expressive chains that locate key evidence without multi-hop traversal. Replacement of the \( n \) -ary structure with a binary reduces the average MRR from 36.45% to 34.15% and the average Hits@10 from 40.59% to 36.82% (Table 3), indicating gains in response quality.

Our key contributions are summarized as follows.

- We propose HyperRAG, a pioneering framework that shifts the graph-RAG paradigm from binary triples to \( n \) -ary hypergraphs, tackling the issues of semantic fragmentation and path explosion.

- We introduce HyperRetriever, a trainable MLP-based retrieval module that fuses structural and semantic signals to extract precise, interpretable evidence chains with low latency.

- We develop HyperMemory, a synergistic retrieval approach that utilizes LLM parametric knowledge to guide symbolic beam search over hypergraphs for complex query adaptive reasoning.

- Extensive evaluation across closed-domain and open-domain benchmarks demonstrates that HyperRAG consistently outperforms strong baselines, offering a superior trade-off between retrieval accuracy, reasoning interpretability, and system latency.

## 2 Preliminaries

### 2.1 Background

Definition 2.1 ( \( n \) -ary Relational Knowledge Graph). An \( n \) -ary relational knowledge graph, or hypergraph, represents relational facts involving two or more entities and one or more relations. Formally, following the definition in [43], a hypergraph is defined as \( \mathcal{G} = \left( {\mathcal{E},\mathcal{R},\mathcal{F}}\right) \) , where \( \mathcal{E} \) denotes the set of entities, \( \mathcal{R} \) denotes the set of relations, and \( \mathcal{F} \) the set of \( n \) -ary relational facts (hyperedges). Each \( n \) -ary fact \( {f}^{n} \in  \mathcal{F} \) , which consists of two or more entities, is represented as: \( {f}^{n} = {\left\{  {e}_{i}\right\}  }_{i = 1}^{n} \) , where \( {\left\{  {e}_{i}\right\}  }_{i = 1}^{n} \subseteq  \mathcal{E} \) is a set of \( n \) entities with \( n \geq  2 \) .

Unlike binary knowledge graphs, \( n \) -ary representation inherently captures higher-order relational dependencies among multiple entities. \( n \) -ary relations cannot be faithfully decomposed into combinations of binary relations without losing structural integrity or introducing ambiguity in semantic interpretation [1, 9, 35]. We formalize faithful reduction and show that any straightforward binary scheme violates at least one of: (i) recoverability of the original tuples, (ii) role preservation, or (iii) multiplicity of co-participations. Please refer to Appendix A for more details on the recoveryability of role-preserving hypergraph reduction, roles, and multiplicity.

### 2.2 Problem Formulation

Problem (Hypergraph-based RAG). Given a question \( q \) , a hyper-graph \( \mathcal{G} \) representing \( n \) -ary relational structures, and a collection of source documents \( \mathcal{D} \) , the goal of hypergraph-based retrieval-augmented generation (RAG) is to generate faithful and contextually grounded answers \( a \) by leveraging salient multi-hop relational chains from \( \mathcal{G} \) and extracting relevant textual evidence from \( \mathcal{D} \) .

Complexity: Native \( n \) -ary Hypergraph Retrieval. Let \( {N}_{e} = \left| \mathcal{E}\right| \) , \( {N}_{f} = \left| \mathcal{F}\right| \) , and \( \bar{n} \) be the average arity. A query binds \( k \) role-typed arguments, \( q = {\left\{  \left( {r}_{i} : {a}_{i}\right) \right\}  }_{i = 1}^{k} \) , and asks for the remaining \( n - k \) roles. We maintain sorted posting lists over role incidences, \( \mathcal{P}\left( {r : a}\right)  = \; \{ f \in  \mathcal{F} : \left( {r : a}\right)  \in  f\} \) , with length \( d\left( {r : a}\right) \) . To answer \( q \) , the \( n \) -ary based retriever intersects the \( k \) posting lists by hyperedge IDs and reads the missing roles from each surviving hyperedge. Let \( {n}^{ \star  } \) be the (max/avg) arity among matches. The running time is given by:

\[
{T}_{\mathrm{{HYP}}}\left( q\right)  = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}d\left( {{r}_{i} : {a}_{i}}\right)  + \text{ out }}\right) , \tag{1}
\]

where out is the number of matching facts. In typical schemas, the relation arity is often bounded by a small constant (e.g., triadic, \( n \leq  3 \) ). As a result, for each match the retriever touches exactly one hyperedge record to materialize the unbound roles, yielding per-output overhead \( O\left( 1\right) \) .

Complexity: Standard Binary KG Retrieval. Suppose each \( n \) - ary fact \( f \) is reified as an event node \( {e}_{f} \) with \( n \) role-typed binary edges (e.g., \( {\operatorname{role}}_{j}\left( {{e}_{f},{a}_{j}}\right) \) ). For each binding \( \left( {{r}_{i} : {a}_{i}}\right) \) , use the list of event IDs posted \( {\mathcal{P}}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right) \) and intersect the \( k \) lists to obtain candidate events to mirror the hypergraph intersection. For each surviving \( {e}_{f} \) , follow its remaining \( \left( {n - k}\right) \) role-edges to materialize unbound arguments. Let \( {d}_{\text{ event }}\left( {r : a}\right)  = \left| {{\mathcal{P}}_{\text{ event }}\left( {r : a}\right) }\right| \) and let \( {n}^{ \star  } \) be the (max/avg) arity over matches. The running time is given by:

\[
{T}_{\mathrm{{BIN}}}\left( q\right)  = O\left( {\mathop{\sum }\limits_{{i = 1}}^{k}{d}_{\text{ event }}\left( {{r}_{i} : {a}_{i}}\right)  + \text{ out } \cdot  \left( {{n}^{ \star  } - k}\right) }\right) . \tag{2}
\]

Under a schema-bounded arity, the per-result overhead is up to \( \bar{n} \) role lookups to materialize the remaining arguments. In contrast, the hypergraph returns them from a single record.

Complexity Gap. In a native hypergraph, all arguments of an \( n \) - ary fact co-reside in a single hyperedge record, thus materializing a hit, is one read, i.e., \( O\left( 1\right) \) per result under bounded arity. In contrast, in an event-reified binary KG, the fact is split across \( n \) role-typed edges, reachable only via the intermediate event node \( {e}_{f} \) . As a result, materializing requires up to \( \left( {n - k}\right) \) pointer chases, yielding out \( \cdot  \bar{n} \) term, and usually incurs extra indirections/cache misses.

## 3 Methodology

We propose HyperRAG, a novel framework that enhances answer fidelity by integrating reasoning over condensed \( n \) -ary relational facts with textual evidence. As depicted in Figure 2, HyperRAG features two retrieval paradigms: (i) HyperRetriever, which performs adaptive structural-semantic traversal to build interpretable, query-conditioned relational chains; (ii) HyperMemory, which utilizes the parametric knowledge of the LLM to guide symbolic beam search. Both variants ground the generation process in hypergraph structures, ensuring faithful and accurate multi-hop reasoning.

![bo_d6nbbuc601uc73e2hrig_2_950_260_682_816_0.jpg](images/bo_d6nbbuc601uc73e2hrig_2_950_260_682_816_0.jpg)

Figure 2: The overall framework of HyperRAG.

### 3.1 HyperRetriever: Relational Chains Learning

The motivation behind learning to extract fine-grained \( n \) -ary relational chains over hypergraph structures stems from two key challenges: (i) the well-documented tendency of LLMs to hallucinate factual content and (ii) the vast combinatorial search space of hypergraphs under limited token and context budgets [25]. To mitigate these challenges, we introduce a lightweight yet expressive retriever that integrates structural and semantic cues to rank salient \( n \) -ary facts aligned with query intent.

3.1.1 Topic Entity Extraction. The purpose of obtaining the topic entity is to ground the query semantics onto hypergraphs \( \mathcal{G} \) . Formally, given a query \( q \) , we request an LLM with prompt \( {p}_{\text{ topic }} \) to identify a set of topic entities that appear in \( q \) in an LLM as follows:

\[
{\mathcal{E}}_{q} = \operatorname{LLM}\left( {{p}_{\text{ topic }}, q}\right)
\]

where \( {\mathcal{E}}_{q} \) denotes the set of extracted entities in the query \( q \) .

3.1.2 Hyperedge Retrieval and Triple Formation. For each extracted topic entity \( {e}_{s} \in  {\mathcal{E}}_{q} \) , we retrieve its incident hyperedges from \( \mathcal{F} \) , formally defined as follows:

\[
{\mathcal{F}}_{{e}_{s}} = \left\{  {{f}^{n} \in  \mathcal{F} : {e}_{s} \in  {f}^{n}}\right\}  .
\]

Each hyperedge \( {f}^{n} \in  {\mathcal{F}}_{{e}_{s}} \) defines an \( n \) -ary relation over a subset of \( n \) entities. To enable pairwise reasoning, we derive a set of pseudobinary triples by enumerating ordered entity pairs within each hyperedge for query \( q \) as follows:

\[
{\mathcal{T}}_{q} = \left\{  {\left( {{e}_{h},{f}^{n},{e}_{t}}\right)  \mid  {f}^{n} \in  {\mathcal{F}}_{{e}_{s}},{e}_{h} \in  {f}^{n},{e}_{t} \in  {f}^{n}}\right\}  , \tag{3}
\]

where each pseudo-binary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) consists of a head entity, the originating hyperedge, and a tail entity.

3.1.3 Structural Proximity Encoding. To capture the structural proximity between entities in the hypergraph, we adapt the directional distance encoding (DDE) mechanism from SubGraphRAG [21], extending it from binary relations to \( n \) -ary hyperedges. Formally, for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right)  \in  {\mathcal{T}}_{q} \) , we compute its directional encoding in the following steps:

- One-Hot Initialization: For each entity \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we initialize a one-hot indicator for the head entity:

\[
{s}_{e}^{\left( 0\right) } = \left\{  \begin{array}{ll} 1, & \text{ if }\exists \left( {{e}_{h},{f}^{n},{e}_{t}}\right)  \in  {\mathcal{T}}_{q}\text{ such that }e = {e}_{h}, \\  0, & \text{ otherwise. } \end{array}\right. \tag{4}
\]

- Bi-directional Feature Propagation: For each layer \( l = 0,\ldots , L \) , we propagate features over the set of derived triples \( {\mathcal{T}}_{q} \) . Forward propagation simulates how the head entity \( {e}_{h} \) reaches out to the tail entity \( {e}_{t} \) as follows:

\[
{s}_{e}^{\left( l + 1\right) } = \frac{1}{\left| \left\{  {e}^{\prime } \mid  \left( {e}^{\prime },\cdot , e\right)  \in  {\mathcal{T}}_{q}\right\}  \right| }\mathop{\sum }\limits_{{\left( {{e}^{\prime },\cdot , e}\right)  \in  {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( l\right) }. \tag{5}
\]

In contrast, backward propagation updates head encodings based on tail-to-head influence:

\[
{s}_{e}^{\left( r, l + 1\right) } = \frac{1}{\left| \left\{  {e}^{\prime } \mid  \left( e,\cdot ,{e}^{\prime }\right)  \in  {\mathcal{T}}_{q}\right\}  \right| }\mathop{\sum }\limits_{{\left( {e,\cdot ,{e}^{\prime }}\right)  \in  {\mathcal{T}}_{q}}}{s}_{{e}^{\prime }}^{\left( r, l\right) }. \tag{6}
\]

- Bi-directional Encoding: After \( L \) rounds of propagation, we concatenate the forward and backward encodings to obtain the final vector for each entity \( e \) as follows:

\[
{s}_{e} = \left\lbrack  {{s}_{e}^{\left( 0\right) }\begin{Vmatrix}{s}_{e}^{\left( 1\right) }\end{Vmatrix}\cdots \begin{Vmatrix}{s}_{e}^{\left( L\right) }\end{Vmatrix}{s}_{e}^{\left( r,1\right) }\parallel \cdots \parallel {s}_{e}^{\left( r, L\right) }}\right\rbrack  , \tag{7}
\]

where \( \parallel \) denotes vector concatenation. Note that the backward propagation starts from \( l = 1 \) , as \( l = 0 \) is shared in both directions.

- Triple Encoding: For each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) , we define its structural proximity encoding as follows:

\[
\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right)  = \left\lbrack  {{s}_{{e}_{h}}\parallel {s}_{{e}_{t}}}\right\rbrack \tag{8}
\]

which is passed to a lightweight parametric neural function to compute the plausibility score for each candidate triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) given query \( q \) .

3.1.4 Contrastive Plausibility Scoring. To reduce the search space in the hypergraph structure, we address the challenge that similarity-based retrieval often introduces noisy or irrelevant triples. To mitigate this, we train a lightweight MLP classifier \( {f}_{\theta } \) to estimate the plausibility of each triple candidate and prune uninformative ones.

To this end, the training set is prepared with positive and negative samples. Let \( {P}_{q}^{ * } \) denote the shortest path of triples connecting the topic entity to a correct answer in the hypergraph \( \mathcal{G} \) . The positive samples \( {\mathcal{T}}_{i}^{ + } \) at hop \( i \) consist of triples in \( {P}_{q}^{ * } \) , denoted as \( {\mathcal{T}}_{i}^{ + } = \left\{  \left( {{e}_{h, i},{f}_{i}^{n},{e}_{t, i}}\right) \right\} \) . Negative samples \( {T}_{i}^{ - } \) consist of all other triples incident to the head entity \( {e}_{i} \) at hop \( i \) that are not in \( {P}_{q}^{ * } \) . At each exploration step, only positive triples are expanded at each hop, while negative ones are excluded. Each triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) is encoded in a feature vector by concatenating its contextual and structural encodings:

\[
\mathbf{x} = \left\lbrack  {\varphi \left( q\right) \begin{Vmatrix}{\varphi \left( {e}_{h}\right) }\end{Vmatrix}\varphi \left( {f}^{n}\right) \begin{Vmatrix}{\varphi \left( {e}_{t}\right) }\end{Vmatrix}\delta \left( {{e}_{h},{f}^{n},{e}_{t}}\right) }\right\rbrack  , \tag{9}
\]

where \( \varphi \) denotes an embedding model that maps the textual content of the query \( \left( q\right) \) , head entity \( \left( {e}_{h}\right) \) , hyperedge \( \left( {f}^{n}\right) \) , and tail entity \( \left( {e}_{t}\right) \) , into vector representations, forming the candidate pseudobinary triple \( \left( {{e}_{h},{f}^{n},{e}_{t}}\right) \) . The classifier outputs a plausibility score \( {f}_{\theta }\left( \mathbf{x}\right)  \in  \left\lbrack  {0,1}\right\rbrack \) , trained using binary cross-entropy as follows:

\[
\mathcal{L} =  - \frac{1}{N}\mathop{\sum }\limits_{{i = 1}}^{N}\left\lbrack  {{y}_{i}\log \left( {{f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right)  + \left( {1 - {y}_{i}}\right) \log \left( {1 - {f}_{\theta }\left( {\mathbf{x}}_{i}\right) }\right) }\right\rbrack  . \tag{10}
\]

3.1.5 Adaptive Search. At inference time, we initiate the retrieval process with initial triples of topic entities and compute their plausibility scores using the trained MLP, \( {f}_{\theta }\left( \mathbf{x}\right) \) . Triples exceeding a plausibility threshold \( \tau \) are retained, and their tail entities are used as frontier entities in the next hop. This expansion-filtering cycle continues until no new triples satisfy the threshold. However, using a fixed threshold \( \tau \) can be problematic: it may be too strict in sparse hypergraphs, limiting retrieval, or too lenient in dense hypergraphs, leading to an overload of irrelevant triples. To mitigate this, we implement an adaptive thresholding strategy. We initialize with \( {\tau }_{0} = {0.5} \) , allow a maximum of \( {N}_{\max } = 5 \) threshold reductions, and define \( M = {50} \) as the minimum acceptable number of hyperedges per hop. At hop \( i \) , we retrieve the set of triples, \( {\mathcal{T}}_{q, \geq  {\tau }_{j}} = \left\{  {\left( {{e}_{h},\mathbf{h},{e}_{t}}\right)  \mid  {f}_{\theta }\left( x\right)  \geq  {\tau }_{j}}\right\} \) under the current threshold \( {\tau }_{j} \) . If \( \left| {\mathcal{T}}_{q, \geq  {\tau }_{j}}\right|  < M \) , we iteratively reduce the threshold as follows:

\[
{\tau }_{j + 1} = {\tau }_{j} - c,\;j = 0,\ldots ,{N}_{\max } - 1, \tag{11}
\]

where \( c = {0.1} \) is the decay factor. This process continues until \( \begin{Vmatrix}{\mathcal{T}}_{q, \geq  {\tau }_{j}}\end{Vmatrix} \geq  M \) or the reduction limit is reached. To further adapt to structural variations in the hypergraph, we incorporate a density-aware thresholding policy. Given the density of the hypergraph \( \Delta \left( \mathcal{G}\right) \) and the predefined lower and upper bounds \( {\Delta }_{\text{ lo }} \) and \( {\Delta }_{\text{ up }} \) , we classify the hypergraph and adjust \( {\tau }_{0} \) accordingly to balance coverage and precision as follows:

\[
{\mathcal{M}}_{\mathcal{G}} = \left\{  \begin{array}{ll} {\mathcal{M}}_{\text{ low }}, & \Delta \left( \mathcal{G}\right)  \leq  {\Delta }_{\mathrm{{lo}}}, \\  {\mathcal{M}}_{\text{ mid }}, & {\Delta }_{\mathrm{{lo}}} < \Delta \left( \mathcal{G}\right)  \leq  {\Delta }_{\mathrm{{up}}}, \\  {\mathcal{M}}_{\text{ high }}, & \Delta \left( \mathcal{G}\right)  > {\Delta }_{\mathrm{{up}}} \end{array}\right. \tag{12}
\]

After convergence or exhaustion of threshold reduction attempts, the retrieval strategy is adjusted based on the assigned graph density category. For low-density graphs \( \left( {\mathcal{M}}_{\text{ low }}\right) \) , the retriever selects from previously discarded triples those that satisfy the final plausibility threshold. For medium and high-density graphs \( \left( {\mathcal{M}}_{\text{ mid }}\right. \) and \( \left. {\mathcal{M}}_{\text{ high }}\right) \) , the strategy additionally expands from the tail entities of these newly accepted triples to increase the depth of reasoning. This density-aware adjustment prevents over-retrieval in sparse graphs while enabling more profound and broader exploration in dense graphs. To further control expansion in high-density settings, where the number of candidate hyperedges may become excessive, we impose an upper bound on the number of retrieved triples per hop. This constraint effectively limits entity expansion, accelerates retrieval, and reduces the inclusion of low-utility information.

3.1.6 Budget-aware Contextualized Generator. After completion of the retrieval process, we organize the selected elements into a structured input for the generator. Following the context layout protocol of HyperGraphRAG [25], we include (i) entities and their associated descriptions, (ii) hyperedges along with their participating entities, and (iii) supporting source text chunks linked to each entity or hyperedge. Due to input length constraints, we prioritize components based on their utility. As shown in the ablation study of HyperGraphRAG, n-ary relational facts (i.e., hyperedges) contribute the most to reasoning performance, followed by entities and then source text. We therefore allocate the token budget accordingly: 50% for hyperedges, 30% for entities, and 20% for source chunks. To further maximize informativeness, we order hyperedges and entities according to their plausibility scores \( {f}_{\theta }\left( \cdot \right) \) , with graph connectivity as a secondary criterion. The selected components are then sequentially filled in the order: hyperedges, entities, and source chunks. Components are filled in priority order and any unused budget is passed to the next category. The contextualized evidence resulting context, together with the original query \( q \) , is then passed to the LLM to generate the final answer Answer as:

Answer \( \mathrel{\text{ := }} \operatorname{LLM}\left( {\text{ Context }, q}\right) \) .(13)

### 3.2 HyperMemory: Relational Chain Extraction

To improve interpretability and context awareness in path retrieval, we avoid naive top- \( k \) heuristics with LLM-guided scoring that leverages the model's parametric memory to assess the salience of hyper-edges and entities. This enables retrieval to be guided by contextual priors and query intent, facilitating more targeted and meaningful relational exploration.

3.2.1 Memory-Guided Beam Retriever. Specifically, we design beam search with width \( w = 3 \) and depth \( d = 3 \) , where \( w \) denotes the number of paths ranked in the top order retained at each iteration, and \( d \) specifies the maximum number of expansion steps. Following the process of the Learnable Relational Chain Retriever, we begin by identifying the set of topic entities \( {\mathcal{E}}_{q} \) from the input query \( q \) using an LLM-based entity extractor. For each topic entity \( {e}_{s} \in  {\mathcal{E}}_{q} \) , we retrieve its incident hyperedge set \( {\mathcal{F}}_{{e}_{s}} \) . Each hyperedge \( {f}^{n} \in  {\mathcal{F}}_{{e}_{s}} \) is scored for relevance to both \( {e}_{s} \) and \( q \) using a prompt \( {p}_{\text{ edge }} \) :

\[
{\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid  {e}_{s}, q}\right)  \sim  \operatorname{LLM}\left( {{p}_{\text{ edge }},{e}_{s},{f}^{n}, q}\right) . \tag{14}
\]

We retain the top- \( w \) hyperedges, denoted \( {H}_{{e}_{s}}^{ + } \) , based on the score \( {\mathcal{S}}_{\mathcal{F}}\left( \cdot \right) \) . Next, for each \( {f}^{n} \in  {\mathcal{F}}_{{e}_{s}}^{ + } \) , we identify unvisited tail entities \( {e}_{t} \) and score their relevance using a second prompt \( {p}_{\text{ entity }} \) :

\[
{\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid  {f}^{n}, q}\right)  \sim  \operatorname{LLM}\left( {{p}_{\text{ entity }},{f}^{n},{e}_{t}, q}\right) . \tag{15}
\]

Next, each resulting candidate triple \( \left( {{e}_{s},{f}^{n},{e}_{t}}\right) \) receives a weighted composite score as follows:

\[
\mathcal{S}\left( {{e}_{s},{f}^{n},{e}_{t}}\right)  = {\mathcal{S}}_{\mathcal{F}}\left( {{f}^{n} \mid  {e}_{s}, q}\right)  \cdot  {\mathcal{S}}_{\mathcal{E}}\left( {{e}_{t} \mid  {f}^{n}, q}\right) . \tag{16}
\]

From the current set of candidate triples, we retain the top- \( w \) based on the final triple scorer \( \mathcal{S}\left( \cdot \right) \) . The tail entities of these selected paths define the next expansion frontier. At each depth \( i \) , we evaluate whether the accumulated evidence suffices to answer the query. All retrieved triples are assembled into a contextualized component \( {C}_{i} \) , which is passed to the LLM for an evidence sufficiency check:

\[
\operatorname{LLM}\left( {{p}_{\text{ ctx }},{C}_{i}, q}\right)  \rightarrow  \{ \text{ yes, no }\} \text{ , Reason. } \tag{17}
\]

If the result is yes, terminate the search and proceed to generation. Otherwise, if \( i < d \) , the search continues until the next iteration.

3.2.2 Contextualized Generator. The entities and hyperedges retrieved are organized in a fixed format context, as defined in Eq.(13). This contextualized evidence Context, combined with the original query \( q \) , is then passed to the LLM to generate the final Answer.

## 4 Experiments

We quantitatively evaluate the effectiveness and efficiency of Hyper-Retriever against RAG baselines both in-domain and cross-domain settings. Ablation studies highlight the benefits of adaptive expansion and \( n \) -ary relational chain learning, complemented by qualitative analyzes that illustrate the precision and efficiency of the adaptive retrieval process.

### 4.1 Experimental Setup

4.1.1 Datasets. We conduct experiments under both open-domain and closed-domain multi-hop question answering (QA) settings. For in-domain evaluation, we use three widely adopted benchmark datasets: HotpotQA [42], MuSiQue [38], and 2WikiMulti-HopQA [16]. To evaluate cross-domain generalization, we adopt the WikiTopics-CLQA dataset [11], which tests zero-shot inductive reasoning over unseen entities and relations at inference time. Comprehensive dataset statistics are summarized in Appendix B.2.

4.1.2 Evaluation Metrics. We employ four standard metrics to assess performance, aligning with established protocols for each benchmark type. For open-domain QA datasets, where the objective is precise answer generation, we report Exact Match (EM) and F1 scores. For WikiTopics-CLQA, which involves ranking correct entities from a candidate list, we utilize Mean Reciprocal Rank (MRR) and Hits@k to evaluate retrieval fidelity. All metrics are reported as percentages (%), with higher values indicating better performance.

4.1.3 Baselines. To evaluate the effectiveness of our approach, we compare HyperRAG with RAG baselines with varying retrieval granularities, enabling a systematic analysis of how evidence structure affects retrieval effectiveness and answer generation in both open- and closed-domain settings. Specifically, we include: RAPTOR [33], which retrieves tree-structured nodes; HippoRAG [14], which retrieves free-text chunks; ToG [37], which retrieves relational subgraphs; and HyperGraphRAG [25], which retrieves a heterogeneous mixture of entities, relations, and textual spans.

4.1.4 Implementation Details. All baselines and our proposed methods utilize gpt-40-mini as the core model for both graph construction and question answering. For HyperRetriever, we additionally employ the pretrained text encoder gte-large-en-v1.5 to produce dense embeddings for entities, relations, and queries. With 434M parameters, this GTE-family model achieves strong performance on English retrieval benchmarks, such as MTEB, and offers an efficient balance between inference speed and embedding quality, making it well-suited for semantic subgraph retrieval. All experiments were implemented in Python 3.11.13 with CUDA 12.8 and conducted on a single NVIDIA RTX 3090 (24 GB). Peak GPU memory usage remained within 24 GB due to dynamic allocation.

### 4.2 Open-domain Answering Performance

4.2.1 Setup. For HyperRetriever, a lightweight MLP \( {f}_{\theta } \) scores the plausibility of candidate hyperedges, enabling aggressive pruning that reduces traversal complexity without compromising reasoning quality. For HyperMemory, we set beam width \( w = 3 \) and depth \( d = 3 \) to balance retrieval coverage against computational cost. Comprehensive prompt definitions for edge scoring \( \left( {p}_{\text{ edge }}\right) \) , entity ranking \( \left( {p}_{\text{ entity }}\right) \) , context evaluation \( \left( {p}_{\text{ ctx }}\right) \) , and generation are provided in the Appendix.

<table><tr><td rowspan="2">Topic</td><td colspan="2">RAPTOR</td><td colspan="2">HippoRAG</td><td colspan="2">ToG</td><td colspan="2">HyperGraphRAG</td><td colspan="2">HyperRetriever Hy</td><td colspan="2"></td><td colspan="2">Rel. Gain (%)</td></tr><tr><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td></tr><tr><td>ART</td><td>3.44</td><td>4.13</td><td>8.42</td><td>9.77</td><td>2.99</td><td>3.20</td><td>17.18</td><td>21.68</td><td>19.31</td><td>24.31</td><td>15.63</td><td>19.17</td><td>12.40</td><td>12.13</td></tr><tr><td>AWARD</td><td>20.57</td><td>25.13</td><td>32.80</td><td>38.65</td><td>8.70</td><td>9.35</td><td>51.64</td><td>63.43</td><td>52.66</td><td>65.28</td><td>47.34</td><td>56.98</td><td>1.98</td><td>2.93</td></tr><tr><td>EDU</td><td>4.94</td><td>5.90</td><td>23.82</td><td>26.37</td><td>9.09</td><td>9.49</td><td>43.44</td><td>50.05</td><td>44.79</td><td>51.63</td><td>41.68</td><td>46.95</td><td>3.11</td><td>3.16</td></tr><tr><td>HEALTH</td><td>18.85</td><td>22.04</td><td>25.72</td><td>29.59</td><td>7.14</td><td>7.95</td><td>31.46</td><td>37.94</td><td>32.68</td><td>39.26</td><td>27.48</td><td>33.13</td><td>3.88</td><td>3.48</td></tr><tr><td>INFRA</td><td>10.95</td><td>12.79</td><td>23.88</td><td>27.11</td><td>9.87</td><td>10.67</td><td>37.18</td><td>44.82</td><td>38.92</td><td>45.77</td><td>35.77</td><td>41.69</td><td>4.68</td><td>2.12</td></tr><tr><td>LOC</td><td>16.55</td><td>18.68</td><td>19.88</td><td>23.08</td><td>3.45</td><td>3.83</td><td>29.92</td><td>34.38</td><td>31.80</td><td>36.85</td><td>30.73</td><td>35.95</td><td>6.28</td><td>7.18</td></tr><tr><td>ORG</td><td>12.00</td><td>14.54</td><td>36.20</td><td>41.70</td><td>6.61</td><td>7.33</td><td>64.68</td><td>74.89</td><td>62.87</td><td>71.21</td><td>52.26</td><td>59.84</td><td>-2.80</td><td>-4.91</td></tr><tr><td>PEOPLE</td><td>10.74</td><td>13.10</td><td>15.39</td><td>18.28</td><td>3.90</td><td>4.40</td><td>20.67</td><td>28.10</td><td>21.62</td><td>28.48</td><td>18.96</td><td>25.29</td><td>4.60</td><td>1.35</td></tr><tr><td>SCI</td><td>6.84</td><td>8.66</td><td>15.62</td><td>18.86</td><td>6.87</td><td>7.28</td><td>25.92</td><td>34.54</td><td>25.15</td><td>32.30</td><td>21.50</td><td>27.53</td><td>-2.97</td><td>-6.49</td></tr><tr><td>SPORT</td><td>11.31</td><td>13.28</td><td>22.78</td><td>26.01</td><td>7.51</td><td>8.53</td><td>37.40</td><td>44.91</td><td>39.37</td><td>45.56</td><td>33.64</td><td>39.72</td><td>5.27</td><td>1.45</td></tr><tr><td>TAX</td><td>10.48</td><td>11.08</td><td>24.77</td><td>26.65</td><td>6.22</td><td>6.50</td><td>35.15</td><td>40.94</td><td>37.20</td><td>40.98</td><td>33.65</td><td>38.19</td><td>5.83</td><td>0.10</td></tr><tr><td>AVG</td><td>11.52</td><td>13.58</td><td>22.66</td><td>26.01</td><td>6.58</td><td>7.14</td><td>35.88</td><td>43.24</td><td>36.94</td><td>43.78</td><td>32.60</td><td>38.59</td><td>2.95</td><td>1.23</td></tr></table>

Table 1: Performance comparison of domain generalization across 11 diverse topics. The "Rel. Gain" column highlights the substantial relative improvement of our approach over the best baseline, averaged across all domains (metrics in %).

<table><tr><td rowspan="2">Model</td><td colspan="2">HotpotQA</td><td colspan="2">MuSiQue</td><td colspan="2">2WikiMultiHopQA</td></tr><tr><td>EM(%)</td><td>F1(%)</td><td>EM(%)</td><td>F1(%)</td><td>EM(%)</td><td>F1(%)</td></tr><tr><td>RAPTOR</td><td>35.50</td><td>41.56</td><td>15.00</td><td>16.31</td><td>22.50</td><td>22.95</td></tr><tr><td>HippoRAG</td><td>49.50</td><td>55.87</td><td>14.50</td><td>17.43</td><td>30.00</td><td>30.44</td></tr><tr><td>ToG</td><td>10.08</td><td>11.00</td><td>2.70</td><td>2.69</td><td>5.20</td><td>5.34</td></tr><tr><td>HyperGraphRAG</td><td>51.00</td><td>42.69</td><td>22.00</td><td>20.02</td><td>42.50</td><td>30.17</td></tr><tr><td>HyperRetriever</td><td>42.50</td><td>43.65</td><td>13.50</td><td>14.15</td><td>34.00</td><td>34.06</td></tr><tr><td>HyperMemory</td><td>35.50</td><td>41.51</td><td>8.00</td><td>12.96</td><td>31.50</td><td>32.56</td></tr><tr><td>Rel. Gain (%)</td><td>-16.67</td><td>-21.87</td><td>-38.64</td><td>-29.32</td><td>-20.00</td><td>11.89</td></tr></table>

Table 2: Performance comparison on HotpotQA, MuSiQue, and 2WikiMultiHopQA. Rel. Gain (%) indicates the relative performance gains achieved by our model compared with the best baselines. The best results are bolded, and the second best are underlined.

4.2.2 Results. Table 2 details the Exact Match (EM) and F1 scores across three open-domain QA benchmarks. HyperRetriever consistently outperforms the HyperMemory variant on HotpotQA and MuSiQue, demonstrating superior capability in identifying evidential relational chains. This advantage is attributed to its learnable MLP-based plausibility scorer and density-aware expansion strategy, which affords precise control over retrieval depth. In contrast, HyperMemory relies on the fixed parametric memory of the LLM, rendering it less adaptable to domain-specific relational patterns. When compared to external KG-based RAG baselines, we observe a performance divergence based on graph topology. On HotpotQA and MuSiQue, HyperRetriever exhibits a performance gap (e.g., 38.64% lower EM on MuSiQue), likely because these datasets require the rigid structural guidance of explicit KG priors for cross-document navigation. However, on 2WikiMultiHopQA, HyperRe-triever reverses this trend, achieving an 11.89% relative F1 improvement. This suggests that while KG priors aid in sparse settings, HyperRetriever is uniquely effective at exploiting the denser, complex relational contexts found in 2WikiMultiHopQA.

### 4.3 Closed-domain Generalization Performance

To evaluate adaptability to closed-domain \( n \) -ary knowledge graphs, we evaluate the performance of HyperRAG on the WikiTopics-CLQA dataset (Table 1). The results demonstrate a strong generalization across diverse topic-specific hypergraphs. In particular, our learnable variant, HyperRetriever, achieved the highest overall answer precision, with average improvements of 2.95% (MRR) and 1.23% (Hits@10) compared to the second-best baseline, Hyper-GraphRAG. These gains are statistically significant \( \left( {p \ll  {0.001}}\right) \) , with \( t \) -test values of \( {1.46} \times  {10}^{-{17}} \) for MRR and \( {2.41} \times  {10}^{-6} \) for Hits@10, suggesting the empirical reliability of our approach. HyperRetriever secures top performance in 9 out of the 11 categories-for instance, achieving relative gains of 12.40% (MRR) and 12.13% (Hits@10) in the ART domain-and consistently ranks second in the remaining two. This broad efficacy highlights the robustness of HyperRe-triever's adaptive retrieval mechanism. Unlike baselines that are sensitive to domain-specific graph density, HyperRetriever's learnable MLP scorer dynamically calibrates its expansion strategy to suit varying \( n \) -ary topologies, ensuring high precision even in complex reasoning tasks. In contrast, our memory-guided variant, Hyper-Memory, consistently underperforms against to HyperRetriever. This variant serves as a critical ablation to probe the limitations of an LLM's intrinsic parametric memory for \( n \) -ary retrieval. The results confirm that prompt-based scoring alone, without the explicit structural learning provided by HyperRetriever, is insufficient for multi-hop reasoning in closed domains.

<table><tr><td rowspan="2">Topic</td><td colspan="2">Full</td><td colspan="2">w/o Entities</td><td colspan="2">w/o Hyperedges</td><td colspan="2">Chunks</td><td colspan="4">//o Adaptive Search w Binary KG</td></tr><tr><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td><td>MRR</td><td>Hits@10</td></tr><tr><td>ART</td><td>26.03</td><td>31.00</td><td>27.28</td><td>31.00</td><td>24.03</td><td>27.00</td><td>24.17</td><td>27.00</td><td>26.33</td><td>31.00</td><td>14.00</td><td>15.00</td></tr><tr><td>AWARD</td><td>56.91</td><td>70.00</td><td>43.22</td><td>61.00</td><td>55.95</td><td>69.00</td><td>55.01</td><td>66.00</td><td>52.98</td><td>66.00</td><td>48.92</td><td>53.00</td></tr><tr><td>EDU</td><td>49.00</td><td>56.00</td><td>43.24</td><td>52.00</td><td>47.93</td><td>52.00</td><td>42.67</td><td>47.00</td><td>47.53</td><td>53.00</td><td>38.20</td><td>42.00</td></tr><tr><td>HEALTH</td><td>41.25</td><td>47.00</td><td>37.17</td><td>43.00</td><td>37.70</td><td>40.00</td><td>39.33</td><td>47.00</td><td>39.20</td><td>46.00</td><td>36.17</td><td>39.00</td></tr><tr><td>INFRA</td><td>34.85</td><td>43.00</td><td>35.17</td><td>43.00</td><td>30.87</td><td>39.00</td><td>38.75</td><td>44.00</td><td>35.50</td><td>45.00</td><td>30.50</td><td>32.00</td></tr><tr><td>LOC</td><td>38.75</td><td>42.50</td><td>44.58</td><td>47.50</td><td>37.50</td><td>40.00</td><td>33.13</td><td>37.50</td><td>41.67</td><td>47.50</td><td>39.58</td><td>42.50</td></tr><tr><td>ORG</td><td>46.79</td><td>58.97</td><td>58.75</td><td>65.00</td><td>45.92</td><td>55.00</td><td>53.00</td><td>60.00</td><td>38.07</td><td>45.00</td><td>47.50</td><td>47.50</td></tr><tr><td>PEOPLE</td><td>14.20</td><td>22.00</td><td>21.23</td><td>28.00</td><td>13.73</td><td>19.00</td><td>20.03</td><td>26.00</td><td>13.37</td><td>20.00</td><td>19.33</td><td>22.00</td></tr><tr><td>SCI</td><td>25.91</td><td>36.00</td><td>18.67</td><td>22.00</td><td>24.53</td><td>32.00</td><td>26.09</td><td>38.00</td><td>21.14</td><td>32.00</td><td>24.00</td><td>27.00</td></tr><tr><td>SPORT</td><td>31.04</td><td>40.00</td><td>35.83</td><td>40.00</td><td>35.00</td><td>45.50</td><td>29.58</td><td>40.00</td><td>33.33</td><td>37.50</td><td>42.08</td><td>47.50</td></tr><tr><td>TAX</td><td>36.25</td><td>40.00</td><td>29.17</td><td>35.00</td><td>33.54</td><td>36.25</td><td>33.13</td><td>36.25</td><td>36.88</td><td>40.00</td><td>35.42</td><td>37.50</td></tr><tr><td>AVG</td><td>36.45</td><td>40.59</td><td>35.85</td><td>42.50</td><td>35.15</td><td>41.34</td><td>35.90</td><td>42.61</td><td>35.64</td><td>42.91</td><td>34.15</td><td>36.82</td></tr></table>

Table 3: Ablation on the Contribution of Context Formation and Adaptive Search. The full model incorporates all components essential for context formation, including entities, hyperedges involved in learnable relational chains, and retrieved chunks. The best results in MRR are bolded, and the best in Hits@ 10 are underlined.

<table><tr><td>Dimension</td><td>RAPTOR [33]</td><td>HippoRAG [14]</td><td>ToG [37]</td><td>HyperGraphRAG [25]</td><td>OG-RAG [34]</td><td>HyperRetriever / Memory</td></tr><tr><td>Structure type</td><td>Doc tree (summ.)</td><td>KG (binary)</td><td>KG (binary)</td><td>Hypergraph ( \( n \) -ary)</td><td>Object graph (mostly bin.)</td><td>Hypergraph (n-ary)</td></tr><tr><td>Unit of fact</td><td>Passage / summary</td><td>Entity-entity edge</td><td>Step / subgoal</td><td>Hyperedge ( \( n \) -ary fact)</td><td>Object-object edge</td><td>Hyperedge (n-ary fact)</td></tr><tr><td>Candidate growth</td><td>Additive (levels)</td><td>Additive on edge</td><td>LLM-var.</td><td>Additive on hyperedges</td><td>Additive on objects</td><td>Additive on hyperedges</td></tr><tr><td>Per-query overhead</td><td>Tokens only</td><td>\( O\left( {n - k}\right) \)</td><td>Var.</td><td>\( O{\left( 1\right) }^{ \dagger  } \)</td><td>\( O\left( 1\right) \)</td><td>\( O{\left( 1\right) }^{ \dagger  } \)</td></tr><tr><td>Depth for reasoning chain</td><td>Deep</td><td>Deep (pairwise)</td><td>LLM-var.</td><td>Shallow \( \left( {n\text{ -ary edges }}\right) \)</td><td>Deep (pairwise)</td><td>Shallow \( \left( {n\text{ -ary edges }}\right) \)</td></tr><tr><td>Retrieval strategy</td><td>Dense tree search</td><td>Graph walk + dense</td><td>LLM on graph</td><td>Static</td><td>Object-centric walk</td><td>Adaptive / LLM on graph</td></tr><tr><td>LLM at retrieval</td><td>Low-Med</td><td>Low</td><td>Med-High (LLM)</td><td>Low</td><td>Low</td><td>Low / Med (LLM)</td></tr><tr><td>Ontology</td><td>✘</td><td>✘</td><td>✘</td><td>✘</td><td>✓</td><td>✘</td></tr></table>

Table 4: Method Comparison. HyperRetriever utilizes adaptive search on \( n \) ary hyperedges, enabling higher-order reasoning with shallow chains and near constant per-query retrieval overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks on binary graphs entail deeper pairwise chains and materialization cost. \( \dagger \) denotes bounded arity; \( \checkmark \) indicates an ontology requirement.

### 4.4 Ablation Study

To evaluate the effectiveness of our approach, we conduct a series of ablation studies targeting two key aspects: (i) the contribution of individual components to context formation, and (ii) the impact of the adaptive search policy on retrieval performance.

4.4.1 Higher-Order Reasoning Chains. Compared with binary KG RAG, HyperRAG supports higher-order reasoning on \( n \) -ary hyper-graphs. An \( n \) -ary hyperedge jointly binds multiple entities and roles, capturing fine-grained dependencies beyond pairwise links. Exploiting this structure yields shallower yet more expressive reasoning chains, enabling the model to surface key evidence without multihop traversal. Empirically (Table 3), replacing the \( n \) -ary structure with a binary one lowers average MRR from 36.45% to 34.15% (-2.3%) and the average Hits @ 10 from 40.59% to 36.82% (-3.77%), indicating gains in both accuracy and efficiency. Additional qualitative examples appear in Appendix C.

4.4.2 Impact of Context Formation. Table 3 presents a componentwise ablation study conducted on a representative \( 1\% \) subset to isolate the contributions of (i) entities, (ii) structural relations (hy-peredges), and (iii) textual context. We observe that removing any component consistently degrades Mean Reciprocal Rank (MRR), though Hits@10 exhibits higher variance. This divergence highlights the distinction between ranking fidelity (MRR) and candidate inclusion (Hits@10). For instance, in the ORG and LOC domains, certain ablated variants maintain competitive Hits@10 scores but suffer sharp declines in MRR. This indicates that while the correct answer remains within the top candidates, the loss of structural or semantic signals causes it to drift down the ranking list, degrading precision. Crucially, hyperedges emerge as the dominant factor in effective context formation. Their exclusion precipitates the most significant performance drops across both metrics, underscoring the necessity of high-order topological structure for reasoning. In contrast, removing entities yields less severe degradation, as entities primarily provide node-level descriptions, whereas hyperedges capture the joint dependencies between them. Text chunks offer complementary unstructured semantics but lack the relational precision of the graph structure. Ultimately, the superior performance of the full model validates the synergistic integration of entity-aware signals, hypergraph topology, and adaptive textual evidence.

4.4.3 Impact of Adaptive Search. Removing the adaptive search component results in a noticeable decline in MRR across most categories, whereas its impact on Hit@10 is minimal and in some cases (e.g., INFRA, LOC), even marginally positive. This pattern suggests that while correct answers remain retrievable among the top 10 candidates, they tend to be ranked lower in the absence of adaptive search, resulting in a reduced overall ranking precision.

![bo_d6nbbuc601uc73e2hrig_7_219_241_581_353_0.jpg](images/bo_d6nbbuc601uc73e2hrig_7_219_241_581_353_0.jpg)

Figure 3: The visualization shows the efficiency-effectiveness tradeoff in multi-hop QA: retrieval time ( \( x \) -axis), answer quality (Hits@10, y-axis), and context volume (bubble size, log-scaled by retrieved tokens).

### 4.5 Efficiency Study

4.5.1 Setup. To assess retrieval efficiency, we draw a stratified 1% from each WikiTopics-CLQA category, yielding approximately 1,000 questions evenly distributed across 11 topic domains, and evaluate all baselines on this set. Figure 3 depicts the three-way trade off among retrieval time ( \( x \) -axis), Hits@10 accuracy ( \( y \) -axis), and context volume (bubble size, logarithmically scaled by retrieved tokens). Models in the upper left quadrant achieve the best balance between efficiency and effectiveness, combining low latency with high Hits@10 while retrieving compact contexts.

4.5.2 Empirical Evidence. HyperRetriever achieves the shortest retrieval time and the highest Hits@10.Although it retrieves more tokens than some baselines, top performers consistently rely on larger contexts, highlighting a common trade-off between answer quality and retrieval volume. Our empirical findings align with the theoretical analysis in §2.2. HyperRetriever employs adaptive search over \( n \) -ary hyperedges, enabling higher-order reasoning with shallow chains and nearly constant per query overhead \( O\left( 1\right) \) . In contrast, static or object-centric walks in binary graphs require deeper pairwise chains and incur an event materialization cost \( O\left( {n - k}\right) \) . We further benchmark our approach against five publicly available graph-based RAG systems, covering both \( n \) -ary and binary KG designs, and summarize in Table 4.

## 5 Related Work

Retrieval-Augmented Generation. RAG fundamentally augments the parametric memory of LLMs with external data, serving as a critical countermeasure against hallucination in knowledge-intensive tasks. The standard pipeline operates by retrieving top- \( k \) document chunks via dense similarity search before conditioning generation on this augmented context [2, 12, 17]. However, conventional dense retrieval methods [6, 20] treat data as flat text, often overlooking the complex structural and relational signals required for deep reasoning. To address this, iterative multi-step retrieval approaches have been proposed [18, 36, 39]. Yet, these methods often suffer from diminishing returns: they increase inference latency and retrieve redundant information that dilutes the context signal. This noise contributes to the "lost-in-the-middle" effect, where finite context windows prevent the LLM from effectively attending to dispersed evidence [24, 41].

Graph-based RAG. Graph-based RAG frameworks incorporate inter-document and inter-entity relationships into retrieval to enhance coverage and contextual relevance \( \left\lbrack  {3,{15},{31},{32}}\right\rbrack \) . Early approaches queried curated KGs (e.g., WikiData, Freebase) for factual triples or reasoning chains \( \left\lbrack  {4,{22},{27},{40}}\right\rbrack \) , while recent methods fuse KGs with unstructured text [8, 23] or build task-specific graphs from raw corpora [7]. To improve efficiency, LightRAG [13], HippoRAG [14], and MiniRAG [10] adopt graph indexing via entity links, personalized PageRank, or incremental updates [28, 29]. However, KG-based RAGs often face a trade-off between breadth and precision: broader retrieval increases noise, while narrower retrieval risks omitting key evidence. Methods using fixed substructures (e.g., paths, chunks) simplify reasoning [33, 44] but may miss global context, and challenges are amplified by LLM context window limits, vast KG search spaces [18, 30, 37], and the high latency of iterative queries [37]. Moreover, most graph-based RAG methods rely on binary relational facts, limiting the expressiveness and coverage of knowledge. Hypergraph-based representations capture richer \( n \) - ary relational structures [26]. HyperGraphRAG [25] advances this line by leveraging \( n \) -ary hypergraphs, outperforming conventional KG-based RAGs, yet suffers from noisy retrieval and reliance on dense retrievers. OG-RAG [34] addresses these issues by grounding hyperedge construction and retrieval in domain-specific ontologies, enabling more accurate and interpretable evidence aggregation. However, its dependence on high-quality ontologies constrains scalability in fast-changing or low-resource domains. Most graph-based and hypergraph-based RAG methods still face challenges, particularly due to the use of static or object-centric walks on binary graphs, which entail deeper pairwise chains and higher materialization costs. Table 4 compares existing methods with HyperRAG.

## 6 Conclusion

We introduced HyperRAG, a novel framework that advances multihop Question Answering by shifting the retrieval paradigm from binary triples to \( n \) -ary hypergraphs featuring two strategies: Hyper-Retriever, designed for precise, structure-aware evidential reasoning, and HyperMemory, which leverages dynamic, memory-guided path expansion. Empirical results demonstrate that HyperRAG effectively bridges reasoning gaps by enabling shallower, more semantically complete retrieval chains. Notably, HyperRetriever consistently outperforms strong baselines across diverse open- and closed-domain datasets, proving that modeling high-order dependencies is crucial for accurate and interpretable RAG systems.