## IV. EXPERIMENTS This section conducts experiments and performance analysis on the Hyperbolic Spatial Hypergraph (HySH) construction and the Physics-Informed Conflict Triage (PICT) modules. Baseline methods are compared with SOTA multi-source retrieval, graph-based RAG, and conflict-resolution methods. Extensive experiments are conducted to assess the robustness and efficiency of AreoRAG, which aims to answer the following questions. - **Q1**: How does the overall retrieval and QA performance of AreoRAG compare with existing multi-source RAG and graph-based RAG methods on planetary spatial data? - **Q2**: What are the respective impacts of spatial sparsity and inter-source conflict intensity on retrieval quality? - **Q3**: How effective are the two core modules (HySH and PICT) of AreoRAG individually? - **Q4**: Can PICT correctly preserve scientifically valuable conflicts while filtering noise, and how does this compare with conventional conflict-elimination approaches? - **Q5**: What are the time costs of the various modules in AreoRAG? ### A. Experimental Settings **a) Datasets:** To validate the effectiveness of AreoRAG in planetary multi-source spatial data retrieval, we construct three datasets from real Mars exploration archives and further evaluate on two general multi-hop QA benchmarks. The planetary datasets are summarized in Table I. (1) **MarsRegion-QA**: A multi-source spatial QA dataset constructed from the Mars Orbital Data Explorer (ODE) archives. We select five scientifically significant regions on Mars — Jezero Crater, Gale Crater, Utopia Planitia (Zhurong landing site), Valles Marineris, and Olympus Mons — and aggregate observations from HiRISE (0.3 m), CTX (6 m), CRISM (18 m), MOLA (460 m), and Zhurong/Curiosity rover in-situ measurements. Each query targets cross-source spatial reasoning (e.g., "What mineral signatures have been detected in the clay-bearing unit at the western delta of Jezero Crater, and do orbital and in-situ observations agree?"). We construct 200 queries with expert-annotated ground truth answers and conflict labels. (2) **MarsConflict-50**: A curated subset of 50 observation pairs exhibiting known scientific conflicts documented in the planetary science literature (e.g., orbital detection of hydrated minerals vs. inconclusive in-situ results). Each pair is annotated with conflict type (instrument-inherent, scale-dependent, temporal-evolution, or noise) by domain experts. This dataset serves as the primary benchmark for evaluating PICT's conflict classification accuracy. (3) **MarsTemporal-QA**: A temporal reasoning dataset comprising 150 queries about surface changes observed across different Mars Years (MY), such as recurring slope lineae (RSL) activity, dust storm impacts, and seasonal frost patterns. Each query requires integrating observations spanning $L_s$ ranges to assess temporal evolution. TABLE I: Statistics of the planetary datasets

Dataset	Data Source	Sources	Entities	Hyperedges	Queries
MarsRegion-QA	HiRISE (Orbital)	1	12,847	8,213	200
	CTX (Orbital)	1	28,563	15,471
	CRISM (Orbital)	1	6,329	4,182
	MOLA (Orbital)	1	45,210	22,605
	Rover In-situ	2	3,876	2,541
MarsConflict-50	Mixed (all above)	6	1,247	683	50
MarsTemporal-QA	Mixed (all above)	6	8,934	5,127	150

Additionally, to validate generalization on established benchmarks, we evaluate on HotpotQA [38] and 2WikiMultiHopQA [39], using the same 300-question subsamples as MultiRAG [14] for fair comparison. It is noteworthy that MarsRegion-QA exhibits high spatial density (multiple overlapping observations per region) but significant cross-resolution heterogeneity, while MarsConflict-50 is specifically designed to stress-test conflict handling with a high proportion of scientifically valuable disagreements (~72% of conflicts are non-noise). **b) Evaluation Metrics:** We adopt multiple metrics to comprehensively evaluate retrieval quality, answer accuracy, and conflict handling: - **F1 score**: The harmonic mean of precision and recall, assessing overall retrieval and answer quality: $$F1 = 2 \times \frac{P \times R}{P + R} \tag{22}$$ - **Recall@K**: Recall at rank $K$, measuring the proportion of relevant documents retrieved within the top-$K$ results. - **Conflict Preservation Rate (CPR)**: The proportion of scientifically valuable conflicts (annotated as instrument-inherent, scale-dependent, or temporal-evolution) that are correctly preserved rather than filtered: $$CPR = \frac{|\mathcal{C}^{sci}_{preserved}|}{|\mathcal{C}^{sci}_{total}|} \tag{23}$$ - **Noise Rejection Rate (NRR)**: The proportion of noise conflicts that are correctly filtered: $$NRR = \frac{|\mathcal{C}^{noise}_{filtered}|}{|\mathcal{C}^{noise}_{total}|} \tag{24}$$ - **Conflict Classification Accuracy (CCA)**: Four-class classification accuracy over the conflict types on MarsConflict-50. - **Query Time (QT)** and **Preprocessing Time (PT)**: Measured in seconds, assessing online and offline efficiency. **c) Hyper-parameter Settings:** All methods were implemented in Python 3.10 and CUDA 12.1 environment. The base LLM is Llama3-8B-Instruct for all methods except where noted. For HySH construction, the hyperbolic curvature is set to $K = -1.0$, the embedding dimension $d = 64$, and the resolution power parameter $p = 2$ for Spatial OEM. For PICT, the interaction entropy threshold is $\epsilon = 0.3$, the noise penalty $\eta = -0.5$, the scientific boost coefficient $\beta = 0.2$, the temporal decay constant $\tau_{decay} = 180$ (in $L_s$ degrees, approximately one Mars season), and the authority weight $\alpha = 0.5$. The MLP conflict classifier uses a two-layer architecture ($256 \rightarrow 128 \rightarrow 4$) with ReLU activation, trained on MarsConflict-50 with 5-fold cross-validation. The plausibility scoring MLP $f_\theta$ for retrieval follows the architecture in [18] with adaptive threshold $\tau_0 = 0.5$ and decay factor $c = 0.1$. All experiments were conducted on a device equipped with an NVIDIA A100 (80 GB) GPU and 256 GB of memory. **d) Baseline Models:** To demonstrate the superiority of AreoRAG, we compare with the following categories of methods: *General RAG Methods:* 1) **Standard RAG** [6]: Conventional retrieval-augmented generation with dense vector retrieval. 2) **IRCoT** [44]: Iterative retrieval with chain-of-thought reasoning refinement. 3) **RQ-RAG** [47]: Retrieval with optimized query decomposition for complex queries. *Graph-based RAG Methods:* 4) **MultiRAG** [14]: Multi-source line graph with multi-level confidence computing (the primary comparison target). 5) **HyperGraphRAG** [25]: Hypergraph-based RAG with $n$-ary relational facts retrieval. 6) **HyperRAG** [18]: MLP-based retrieval over $n$-ary hypergraphs with adaptive search. *Conflict-Resolution Methods:* 7) **TruthfulRAG** [17]: Knowledge graph-based conflict resolution via entropy-based filtering. 8) **MetaRAG** [9]: Metacognitive strategies for hallucination mitigation in retrieval. **e) Dataset Preprocessing:** For the planetary datasets, we parse PDS4 labels and CNSA metadata through the multi-source spatial adapters (Section III-B) to extract spatial footprints, temporal windows, and instrument parameters. All observations are projected to the Mars IAU 2000 areocentric coordinate system. Temporal references are unified to Solar Longitude $L_s$ using SPICE kernels. For the general QA benchmarks, we follow the same preprocessing pipeline as MultiRAG [14] to ensure fair comparison. ### B. Overall Retrieval and QA Performance (Q1) To validate the effectiveness of AreoRAG, we assess it using F1 scores and query times across the planetary datasets and the two general multi-hop QA benchmarks. Table II summarizes the performance comparison. TABLE II: Comparison with baseline methods on planetary and general QA datasets

Method	MarsRegion-QA		MarsTemporal-QA		HotpotQA		2WikiMultiHopQA
Method	F1/%	Recall@5	F1/%	Recall@5	F1/%	Recall@5	F1/%	Recall@5
Standard RAG	28.4	31.2	25.7	28.3	34.1	33.5	25.6	26.2
IRCoT	35.6	38.9	32.1	35.4	41.6	41.2	42.3	40.9
RQ-RAG	37.2	40.5	34.8	37.6	51.6	49.3	45.3	44.6
MultiRAG	42.3	46.8	38.5	42.1	59.3	62.7	55.7	61.2
HyperGraphRAG	44.1	48.3	40.2	43.7	51.0	42.7	42.5	30.2
HyperRAG	46.5	50.7	41.8	45.2	42.5	43.7	34.0	34.1
TruthfulRAG	40.8	44.6	37.9	41.3	60.2	—	55.4	—
MetaRAG	41.5	45.2	39.1	42.8	51.1	49.9	50.7	52.2
AreoRAG	55.8	61.3	52.4	57.6	61.7	64.2	57.3	62.8

*Bold represents optimal metrics. "—" indicates the metric is not reported by the original paper.* Table II demonstrates that AreoRAG outperforms all comparative methods across both planetary and general QA datasets. On MarsRegion-QA, AreoRAG achieves an F1 score of 55.8%, representing a 13.5% absolute improvement over MultiRAG (42.3%) and a 9.3% improvement over the best graph-based baseline HyperRAG (46.5%). This significant gap validates the effectiveness of HySH in capturing spatial relationships that discrete line graphs and standard hypergraphs miss. On MarsTemporal-QA, which demands temporal reasoning across observation epochs, AreoRAG achieves 52.4% F1, outperforming all baselines by at least 10.6%. This improvement is attributed to PICT's temporal-evolution conflict handling (the $\gamma(|\Delta\mathcal{T}|)$ weighting in Eq. 20), which preserves temporal change signals rather than filtering them as inconsistencies. On the general benchmarks (HotpotQA and 2WikiMultiHopQA), AreoRAG maintains competitive performance (61.7% and 57.3% F1), demonstrating that the framework generalizes beyond planetary science. The modest improvements over MultiRAG on these benchmarks (2.4% and 1.6%) are expected, as these datasets do not exhibit the spatial and physical conflict characteristics that AreoRAG is specifically designed to address. Notably, HyperRAG and HyperGraphRAG perform well on planetary datasets (46.5% and 44.1% F1 on MarsRegion-QA) but underperform on general benchmarks. This is because their $n$-ary hypergraph structure naturally accommodates the multi-entity spatial observations in planetary data, yet they lack the conflict triage mechanism needed to handle inter-source disagreements correctly. ### C. Robustness Under Spatial Sparsity and Conflict Intensity (Q2) AreoRAG demonstrates strong robustness under varying spatial sparsity and conflict intensity. We conduct experiments from two perspectives. **1) Spatial Sparsity:** We applied 30%, 50%, and 70% random hyperedge masking to MarsRegion-QA, progressively removing spatial connections while ensuring query answers remain retrievable. As shown in Fig. 5(a-b), after applying 30%, 50%, and 70% hyperedge masking, AreoRAG's F1 score on MarsRegion-QA decreased from 55.8% to 52.1%, 49.3%, and 45.6% respectively. In contrast, MultiRAG's F1 dropped more sharply from 42.3% to 37.8%, 32.5%, and 26.1%. HyperRAG shows moderate degradation (46.5% to 42.7%, 38.9%, 33.4%). The superior robustness of AreoRAG under sparsity is attributed to two factors: (i) hyperbolic embedding preserves proximity information even when explicit graph edges are removed, as geodesic distance in $\mathbb{H}_K^d$ encodes spatial proximity independently of graph connectivity; and (ii) the Spatial OEM aggregation maintains representational quality by amplifying high-resolution signals that survive masking. **2) Conflict Intensity:** We injected 30%, 50%, and 70% synthetic conflict triples into MarsRegion-QA by duplicating existing observation records and perturbing their factual content (e.g., randomizing mineral identifications or altering coordinate data), simulating scenarios of increasing inter-source noise. As shown in Fig. 5(c-d), AreoRAG's F1 score decreased only moderately from 55.8% to 54.2%, 52.8%, and 50.1% under 30%, 50%, and 70% conflict injection respectively. MultiRAG exhibited steeper degradation (42.3% to 40.1%, 36.4%, 30.7%), and TruthfulRAG showed similar sensitivity (40.8% to 38.2%, 34.6%, 29.3%). The resilience of AreoRAG is directly attributable to PICT's ability to classify injected noise conflicts as $\mathcal{C}^{noise}$ and filter them while preserving genuine scientific disagreements. In contrast, MultiRAG's MCC module and TruthfulRAG's entropy-based filtering indiscriminately penalize all inconsistencies, including the original valid observations that become "outvoted" by injected noise. ### D. Ablation Study (Q3) To evaluate the individual contributions of HySH and PICT, we conduct systematic ablation experiments. Table III reports results on MarsRegion-QA and MarsTemporal-QA. TABLE III: Ablation experiments of HySH and PICT modules

Configuration	MarsRegion-QA			MarsTemporal-QA
Configuration	F1/%	QT/s	PT/s	F1/%	QT/s	PT/s
AreoRAG (Full)	55.8	3.42	86.5	52.4	4.17	72.3
w/o HySH (use MLG)	44.6	28.7	15.2	40.1	35.4	12.8
w/o Hyperbolic (Euclidean hypergraph)	49.2	4.85	51.3	45.6	5.72	43.7
w/o Spatial OEM (standard Einstein)	51.3	3.38	86.5	47.8	4.12	72.3
w/o PICT (use MCC)	45.9	3.15	86.5	39.7	3.89	72.3
w/o Conflict Classification (uniform filter)	48.1	3.28	86.5	42.3	4.01	72.3
w/o Interaction Entropy (use ΔH_p)	50.4	3.51	86.5	46.2	4.25	72.3
w/o Both (Standard RAG)	28.4	1.23	—	25.7	1.56	—

**a) HySH Module Analysis:** The HySH module achieves significant improvements in both accuracy and efficiency. Replacing HySH with MultiRAG's MLG (w/o HySH) causes F1 drops of 11.2% on MarsRegion-QA and 12.3% on MarsTemporal-QA, while query time increases by 8.4$\times$ (3.42s to 28.7s) due to the edge explosion problem in pairwise spatial encoding. This validates the $O(k)$ vs. $O(k^2)$ complexity advantage of hyperedges. Within HySH, the hyperbolic embedding contributes 6.6% F1 improvement over Euclidean hypergraph (49.2% vs. 55.8%), confirming that the negative-curvature geometry is essential for faithfully representing the hierarchical scale structure. The Spatial OEM contributes an additional 4.5% F1 over standard Einstein midpoint aggregation (51.3% vs. 55.8%), validating the outward bias property (Theorem 1) in preventing hierarchical collapse during cross-resolution fusion. **b) PICT Module Analysis:** Replacing PICT with MultiRAG's MCC (w/o PICT) causes F1 drops of 9.9% on MarsRegion-QA and 12.7% on MarsTemporal-QA. The larger drop on MarsTemporal-QA is expected, as this dataset contains abundant temporal-evolution conflicts that MCC would filter as inconsistencies. The ablation further reveals the contribution of each PICT component. Removing conflict classification (using uniform filtering instead of four-category triage) costs 7.7% F1 on MarsRegion-QA. Replacing cross-source interaction entropy with TruthfulRAG's $\Delta H_p$ metric costs 5.4% F1, confirming that the cross-source formulation (Eq. 14) is more appropriate for the all-external-knowledge setting of planetary observations. **c) Module Interaction:** Notably, the sum of individual module contributions (HySH: 11.2% + PICT: 9.9% = 21.1%) exceeds the gap between the full model and Standard RAG (55.8% - 28.4% = 27.4%), but the actual synergy is evident in the coupling points. HySH's radial depth difference $\Delta r$ directly improves PICT's scale-conflict classification; PICT's triage feedback improves HySH's retrieval priority. Disabling either module degrades the other's performance more than isolated analysis suggests. ### E. Conflict Preservation Evaluation (Q4) A defining capability of AreoRAG is the ability to preserve scientifically valuable conflicts rather than suppressing them. We evaluate this on MarsConflict-50, which contains expert-annotated conflict types. TABLE IV: Conflict handling performance on MarsConflict-50

Method	CCA/%	CPR/%	NRR/%	F1/%
Standard RAG	—	100.0*	0.0	26.3
MultiRAG (MCC)	—	8.3	85.7	35.2
TruthfulRAG	—	13.9	78.6	37.8
MetaRAG	—	11.1	82.1	36.5
AreoRAG (PICT)	84.0	91.7	85.7	53.1

*Standard RAG preserves all information indiscriminately (CPR=100%) because it has no conflict handling mechanism, resulting in noise contamination and low F1. "—" indicates the method does not perform explicit conflict classification.* Table IV reveals the fundamental difference between AreoRAG and existing methods. MultiRAG achieves a high Noise Rejection Rate (85.7%) but at the cost of a catastrophically low Conflict Preservation Rate (8.3%) — it filters 91.7% of scientifically valuable conflicts as "unreliable data." TruthfulRAG and MetaRAG show similar behavior (CPR of 13.9% and 11.1%), confirming that existing conflict-resolution methods systematically destroy scientific anomaly signals. In contrast, AreoRAG achieves a CPR of 91.7% while maintaining the same NRR (85.7%) as MultiRAG, demonstrating that PICT successfully decouples noise filtering from scientific conflict preservation. The Conflict Classification Accuracy of 84.0% on the four-category task validates the separability claim in Proposition 2. Error analysis reveals that the primary source of misclassification is between instrument-inherent and scale-dependent conflicts (12.3% confusion rate), which is expected as both involve observation geometry differences. Noise vs. scientific conflict misclassification is rare (3.7%), confirming the robustness of the explainable/opaque distinction (Definition 7). Furthermore, the F1 score improvement (53.1% vs. 35.2% for MultiRAG) demonstrates that preserving scientific conflicts directly benefits answer quality: the LLM can generate more comprehensive and scientifically faithful answers when provided with both agreeing and legitimately disagreeing evidence, accompanied by physical bridging explanations. ### F. Efficiency Analysis (Q5) TABLE V: Time cost analysis across modules

Method	MarsRegion-QA		MarsTemporal-QA
Method	QT/s	PT/s	QT/s	PT/s
Standard RAG	1.23	—	1.56	—
MultiRAG	4.87	15.2	6.13	12.8
HyperRAG	2.95	142.7	3.41	118.5
TruthfulRAG	5.62	18.7	6.85	15.4
AreoRAG	3.42	86.5	4.17	72.3

AreoRAG's query time (3.42s on MarsRegion-QA) is competitive with HyperRAG (2.95s) and substantially faster than MultiRAG (4.87s) and TruthfulRAG (5.62s). The faster online query is attributable to the $O(k)$ hyperedge traversal complexity and the lightweight MLP-based plausibility scoring, which avoids the expensive mutual information entropy computation required by MultiRAG's MCC at query time. The preprocessing time (86.5s) is higher than MultiRAG (15.2s) due to the hyperbolic embedding computation (Eq. 6-8), but lower than HyperRAG (142.7s) because we do not require the full contrastive training pipeline. Importantly, HySH construction is a one-time offline cost amortized across all queries. The PICT module adds minimal online overhead: the conflict classifier (Eq. 19) requires $<$0.1s per detected conflict pair, and the interaction entropy computation (Eq. 14) adds approximately 0.8s per query through parallel LLM forward passes. ### G. Case Study AreoRAG's effectiveness in multi-source planetary data integration is demonstrated through a real-world query about the Jezero Crater western delta. The query and system response are detailed in Table VI. TABLE VI: Case study — Jezero Crater mineral conflict resolution

Query	"What is the mineral composition of the clay-bearing unit in the western delta of Jezero Crater? Do orbital and in-situ observations provide consistent results?"
Data Sources	Orbital: CRISM detection of Fe/Mg-smectites (confidence 0.91) Orbital: HiRISE morphological evidence of layered phyllosilicates In-situ: Perseverance PIXL showing olivine-carbonate assemblage with minor clay In-situ: SHERLOC Raman spectra indicating organic-bearing minerals
HySH Module	Spatial hyperedge: binds CRISM cube, HiRISE strip, PIXL measurement point, SHERLOC spectrum at co-located footprint ($\Delta r = 4.2$, scale-dependent flag) Embedding: CRISM at $r=3.1$ (18m), HiRISE at $r=8.7$ (0.3m), PIXL at $r=9.2$ (point)
PICT Module	Detected conflict: CRISM "dominant smectite" vs. PIXL "minor clay, dominant olivine-carbonate" $\mathcal{H}_{inter} = 1.34 > \epsilon$; $\|\log(\ell_{res}^{CRISM}/\ell_{res}^{PIXL})\| = 4.1$; $\hat{c} = scale$ Classification: Scale-dependent conflict ($\mathcal{C}^{scale}$), preserved with bridging explanation $C_{triage} = C_{base} + \beta \cdot \mathcal{H}_{inter}^{-1} = 0.72 + 0.15 = 0.87$
AreoRAG Answer	"The western delta clay-bearing unit shows scale-dependent compositional heterogeneity. At orbital scale (CRISM, 18m/pixel), the dominant spectral signature is Fe/Mg-smectite [DataID: FRT0000B3B1]. At in-situ scale (PIXL, sub-mm), the primary phases are olivine and carbonate with minor clay [DataID: sol-0347-pixl]. This discrepancy is consistent with a spatial mixing effect: the CRISM pixel integrates signal from both clay-rich and olivine-rich sub-units that are resolved at rover scale. HiRISE morphology supports layered heterogeneity [DataID: ESP_045994_1985]. Recommendation: cross-reference with CRISM targeted observation at higher spectral resolution for the specific PIXL measurement location."
MultiRAG Answer (for comparison)	"Orbital data indicates dominant clay minerals. In-situ data shows olivine-carbonate. The in-situ result may be less reliable due to inconsistency with orbital consensus (confidence: 0.43, filtered)."

This case study exemplifies AreoRAG's core advantage: while MultiRAG filters the in-situ observation as "unreliable" due to its inconsistency with orbital data, AreoRAG recognizes this as a scale-dependent conflict, preserves both observations, and generates a scientifically meaningful explanation (spatial mixing effect). The answer includes provenance metadata (DataIDs) for scientific traceability, and proactively recommends follow-up data to resolve the ambiguity — a capability enabled by the PICT module's conflict-aware context construction. ### H. Limitations We acknowledge several limitations inherent in the current framework: 1) **Dataset scale**: The planetary datasets are constructed from publicly available archives and may not cover the full diversity of Mars exploration scenarios. Larger-scale evaluation with comprehensive PDS holdings is planned as future work. 2) **Conflict classification coverage**: The four-category conflict taxonomy, while covering the most common planetary science scenarios, may not capture all possible conflict origins (e.g., processing artifact conflicts, calibration drift). Extending the taxonomy is a natural direction. 3) **LLM dependency**: The cross-source interaction entropy computation (Eq. 14) and conflict classification (Eq. 18) both rely on LLM forward passes, introducing potential biases from the base model's parametric knowledge about planetary science. Fine-tuning on domain-specific corpora may mitigate this issue. 4) **Generalization to other planetary bodies**: While designed for Mars, the framework's principles (hyperbolic scale hierarchy, physics-informed conflict triage) are applicable to other planetary bodies (Moon, Venus, icy moons). Validation on non-Mars datasets remains future work.