修改索引实验的文字描述

2026-02-04 14:52:17 +08:00
parent c063a5599d
commit f7215bf1c3
6 changed files with 153 additions and 144 deletions
--- a/rs_retrieval.tex
+++ b/rs_retrieval.tex
@@ -504,14 +504,22 @@ All experiments are conducted on a cluster with 9 homogenous nodes (1 master nod
 \end{table}

 \subsection{Evaluating the Data Indexing Structure}
-In the following experiments, we measured the indexing performance on a single node in the cluster, because each node needs to perform indexing for spatial retrieval. We investigated the retrieval performance of the indexing for remote sensing images.
+To comprehensively evaluate the effectiveness of the proposed I/O-aware indexing structure, we conducted experiments on a single cluster node, as each node independently performs indexing for spatial retrieval in the distributed setting. We compare our approach against five representative baseline systems that span traditional database indexes, distributed NoSQL-based schemes, and state-of-the-art windowed I/O frameworks.

-For comparison, we compare three representative execution schemes: 
+The comparative methods are categorized as follows:

 \begin{enumerate}
-	\item \textbf{Baseline 1 (Full-file Retrieval):} A traditional system that utilizes spatio-temporal indexing for metadata filtering but performs full-file retrieval during data extraction.
-	\item \textbf{Baseline 2 (Window-based I/O):} A state-of-the-art system (e.g., OpenDataCube) that supports fine-grained partial reading to minimize I/O volume, representing the theoretical optimum for data selectivity.
-	\item \textbf{Ours (I/O-aware Indexing):} The proposed approach uses the dual-layer G2I and I2G inverted structure, which pre-materializes grid-to-pixel mappings to enable deterministic partial reading without runtime geometric computations.
+	\item \textbf{PostGIS (Full-file Retrieval):} A traditional relational database approach that employs R-tree spatial indexes for metadata filtering. While it efficiently identifies candidate images through spatial intersection tests, it retrieves entire image files during data extraction, incurring substantial I/O overhead even for small spatial queries.
+
+	\item \textbf{GeoMesa (Full-file Retrieval):} A distributed spatio-temporal index built on Hbase, which encodes spatial footprints using Z-order space-filling curves for scalable metadata discovery. Despite its superior indexing performance for billion-scale datasets, it still relies on full-file data loading.
+
+	\item \textbf{MSTGI (Full-file Retrieval):} A recently proposed multi-scale spatio-temporal grid index model \cite{liu24mstgi} that enhances GeoMesa through hierarchical time granularity (year/month/day) and Hilbert curve-based linearization. It inherits the full-file retrieval limitation, where data extraction cost remains decoupled from index-level optimization.
+
+	\item \textbf{OpenDataCube (Window-based I/O):} A state-of-the-art data cube system that couples PostGIS indexes with windowed I/O via rasterio, enabling partial reads from monolithic image files. By leveraging GeoBox-based ROI computation and automatic overview selection, OpenDataCube represents the theoretical optimum for I/O selectivity but incurs runtime geospatial computation overhead to resolve pixel-to-geographic mappings.
+
+	\item \textbf{rio-tiler (Window-based I/O):} A lightweight raster reading engine optimized for dynamic tile generation. Similar to OpenDataCube, it employs PostGIS for spatial indexing and windowed I/O for partial data access, but features a streamlined execution path with minimal abstraction layers, resulting in lower per-query overhead. rio-tiler serves as a high-performance baseline for windowed reading without the complexity of full data cube management.
+
+	\item \textbf{Ours (I/O-aware Indexing):} The proposed approach leverages a dual-layer inverted index structure comprising Grid-to-Image (G2I) and Image-to-Grid (I2G) mappings. By pre-materializing grid-to-pixel correspondences at ingestion time, our method translates spatio-temporal predicates directly into byte-level read plans, completely eliminating runtime geometric computations while preserving minimal I/O volume through precise windowed access.
 \end{enumerate}

 \subsubsection{I/O Selectivity Analysis}\label{sec:Index_exp_1}
@@ -534,7 +542,7 @@ For comparison, we compare three representative execution schemes:
 	\label{fig:index_exp1}
 \end{figure}

-First, we evaluated the effectiveness of data reduction by measuring the I/O selectivity, defined as the ratio of the retrieved data volume to the total file size. Fig.~\ref{fig:index_exp1} compares our method against Baseline 1 and Baseline 2. As illustrated in Fig.~\ref{fig:index_exp1}(a), Baseline 1 always reads the entire image regardless of the proportion of the intersection between the query range and the image. In contrast, both Baseline 2 and Ours significantly reduce I/O traffic by enabling partial reads. It is worth noting that our method incurs slightly higher I/O volume compared to the theoretically optimal Baseline 2. This marginal data redundancy is attributed to the grid alignment effect: our index retrieves pixel blocks based on fixed grid boundaries, whereas Baseline 2 performs precise geospatial clipping. Fig.~\ref{fig:index_exp1}(b) further presents the distribution of unnecessary data fraction. While our method introduces a small amount of over-reading due to grid padding, it successfully avoids the massive data waste observed in Baseline 1. As we will demonstrate in the next section, this slight compromise in I/O precision is a strategic trade-off that eliminates expensive runtime computations.
+First, we evaluated the effectiveness of data reduction by measuring the I/O selectivity, defined as the ratio of the retrieved data volume to the total file size. Fig.~\ref{fig:index_exp1} compares our method against baselines. As illustrated in Fig.~\ref{fig:index_exp1}(a), systems such as PostGIS, GeoMesa, and MSTGI, which rely on full file loading, exhibit consistent performance. They always reads the entire image regardless of the proportion of the intersection between the query range and the image. In contrast, OpenDataCube, Rio-tiler, and ours significantly reduce I/O traffic by enabling partial reads. It is worth noting that our method incurs slightly higher I/O volume compared to the theoretically optimal baseline (OpenDataCube and Rio-tiler). This marginal data redundancy is attributed to the grid alignment effect: our index retrieves pixel blocks based on fixed grid boundaries, whereas OpenDataCube and Rio-tiler perform precise geospatial clipping. Fig.~\ref{fig:index_exp1}(b) further presents the distribution of unnecessary data fraction. While our method introduces a small amount of over-reading due to grid padding, it successfully avoids the massive data waste observed in the full-file retrieval systems. As we will demonstrate in the next section, this slight compromise in I/O precision is a strategic trade-off that eliminates expensive runtime computations.

 \subsubsection{End-to-End Retrieval Latency}\label{sec:Index_exp_2}

@@ -563,9 +571,11 @@ First, we evaluated the effectiveness of data reduction by measuring the I/O sel
 	\label{fig:index_exp2_3}
 \end{figure}

-We next measured the end-to-end retrieval latency to verify whether the I/O reduction translates into time efficiency. Fig.~\ref{fig:index_exp2}(a) reports the mean and 95th percentile (P95) latency across varying retrieval footprint ratios. The results reveal three distinct performance behaviors: Baseline 1 shows a high and flat latency curve ($\approx 4500$ ms), dominated by the cost of transferring entire images. Baseline 2, despite its optimal I/O selectivity, exhibits a significant latency floor ($\approx 380$ ms for small tile-level retrievals). This overhead stems from the on-the-fly geospatial computations required to calculate precise read windows. Ours achieves the lowest latency, ranging from 34 ms to 59 ms for typical tile-level retrievals. Crucially, for small-to-medium retrievals, our method outperforms Baseline 2 by an order of magnitude. The gap between the two curves highlights the advantage of our deterministic indexing approach: by pre-materializing grid-to-window mappings, we eliminate runtime coordinate transformations. Although our I/O volume is slightly larger (as shown in Sec.~\ref{sec:Index_exp_1}), the time saved by avoiding computational overhead far outweighs the cost of transferring a few extra kilobytes of padding data.
+We next measured the end-to-end retrieval latency to verify whether I/O reduction effectively translates into time efficiency across different indexing and retrieval strategies. Fig.~\ref{fig:index_exp2}(a) reports the mean latency, and Fig.~\ref{fig:index_exp2}(b) shows the 95th percentile (p95) across varying retrieval footprint ratios. The results reveal three distinct performance categories, each corresponding to a fundamental architectural design choice.

-To empirically validate the cost model proposed in Eq.~\ref{eqn:cost_total}, we further decomposed the retrieval latency into three components: metadata lookup ($C_{meta}$), geospatial computation ($C_{geo}$), and I/O access ($C_{io}$). Fig.~\ref{fig:index_exp2}(b) presents the time consumption breakdown for a representative medium-scale retrieval (involving approx. 50 image tiles). As expected, the latency of Baseline 1 is entirely dominated by $C_{io}$, rendering $C_{meta}$ and $C_{geo}$ negligible. The massive data transfer masks all other overheads. While $C_{io}$ of Baseline 2 is successfully reduced to the window size, a new bottleneck emerges in $C_{geo}$. The runtime coordinate transformations and polygon clipping consume nearly $40\%$ of the total execution time ($\approx 350 ms$). This observation confirms our theoretical analysis that window-based I/O shifts the bottleneck from storage to CPU. The proposed method exhibits a balanced profile. Although $C_{meta}$ increases slightly ($\approx 35 ms$) due to the two-phase index lookup (G2I + I2G), this cost is well-amortized. Crucially, $C_{geo}$ is effectively eliminated thanks to the pre-computed grid-window mappings. Consequently, our approach achieves a total latency of 580 ms, providing a $1.7\times$ speedup over Baseline 2 by removing the computational bottleneck without regressing on I/O performance.
+As shown in Fig.~\ref{fig:index_exp2}(a), all three full-file baselines (PostGIS, GeoMesa, MSTGI) exhibit high and flat latency curves ranging from approximately 4,480 to 4,850 ms, nearly independent of the query footprint ratio. This performance ceiling is imposed by the necessity of transferring complete image files regardless of the spatial extent requested. Among these methods, PostGIS achieves the lowest latency ($\approx 4,500$ ms) due to the efficiency of R-tree traversal for metadata filtering. GeoMesa incurs slightly higher latency ($\approx 4,700$--4,850 ms) as a result of distributed query coordination overhead in its Hbase-backed architecture. MSTGI positions between these two ($\approx 4,600$--4,750 ms), reflecting its multi-scale indexing optimization that partially compensates for the underlying GeoMesa framework. Critically, all three methods are dominated by massive I/O transfer that entirely masks computational overheads, rendering index-level optimizations ineffective for end-to-end latency. In contrast, both OpenDataCube and Rio-tiler demonstrate strong correlation between retrieval latency and query footprint, confirming that partial reading successfully eliminates unnecessary data transfer. For small tile-level retrievals (footprint ratio $\le 10^{-3}$), rio-tiler achieves latencies of 34--41 ms, while OpenDataCube ranges from 43--52 ms. This 15--20\% advantage of rio-tiler stems from its streamlined execution path with minimal abstraction layers. However, as the query footprint grows beyond $10^{-2}$, both methods exhibit linear latency growth, reaching 5,090 ms (rio-tiler) and 5,270 ms (OpenDataCube) at full-image queries. The floor latency of approximately 380 ms for small retrievals indicates a fixed computational cost component that is not attributable to I/O. Our method achieves the lowest latency across all query scales. For typical tile-level retrievals (footprint ratio $10^{-4}$ to $10^{-2}$), latency ranges from 32 to 114 ms, representing a 3.8--12$\times$ speedup over rio-tiler and a 4.2--13$\times$ improvement over OpenDataCube. Crucially, our method preserves near-constant latency for small-to-medium queries and only exhibits noticeable growth when the footprint ratio exceeds $0.1$. At the extreme of full-image queries, our approach reaches 4,525 ms, comparable to full-file methods, as the grid-alignment overhead becomes negligible relative to complete data transfer.
+
+To empirically validate the cost model in Eq.~\ref{eqn:cost_total}, we decomposed the retrieval latency into three components: metadata lookup ($C_{meta}$), geospatial computation ($C_{geo}$), and I/O access ($C_{io}$). Fig.~\ref{fig:index_exp2_3} presents the breakdown for a representative medium-scale retrieval involving approximately 50 image tiles. PostGIS, GeoMesa, and MSTGI all exhibit $C_{io}$-dominated profiles, with I/O access consuming approximately 1,200 ms and accounting for over 98\% of total latency. PostGIS maintains the lowest $C_{meta}$ (20 ms) due to its single-node R-tree structure. GeoMesa incurs higher $C_{meta}$ (120 ms) from distributed metadata coordination. MSTGI achieves intermediate $C_{meta}$ (70 ms) by optimizing the query path through its hierarchical time-granularity design. All three methods maintain $C_{geo} \approx 0$ since no geometric computation is performed—entire images are returned directly. Both OpenDataCube and rio-tiler successfully reduce $C_{io}$ to approximately 550--580 ms by reading only intersecting windows. However, this I/O advantage is partially offset by substantial $C_{geo}$ overhead (350 ms, representing 38--40\% of total latency), incurred by runtime coordinate transformations and precise clipping computations. The nearly identical $C_{meta}$ (20 ms) reflects their shared reliance on PostGIS for spatial indexing. Our method achieves a balanced profile with $C_{meta} = 35$ ms (slightly higher than windowed baselines due to the two-phase G2I+I2G lookup) and $C_{io} = 600$ ms (comparable to windowed I/O methods). Critically, $C_{geo}$ is completely eliminated by pre-materialized grid-to-pixel mappings, resulting in a total latency of 635 ms. This represents a 1.7$\times$ speedup over OpenDataCube (1,070 ms) and a 1.6$\times$ improvement over rio-tiler (970 ms). The decomposition validates our core thesis: by shifting computational burden from retrieval time to ingestion time, I/O-aware indexing achieves near-optimal I/O cost while avoiding the computational bottleneck that plagues existing windowed I/O systems.

 \subsubsection{Ablation Study}\label{sec:Index_exp_3}
 \begin{figure}[tb]
@@ -618,7 +628,7 @@ Moreover, the choice of grid resolution (Zoom Level) is a critical parameter tha
 	\label{fig:index_exp4}
 \end{figure}

-Finally, we evaluated the scalability and cost of maintaining the index. Fig.~\ref{fig:index_exp4} compares our method against PostGIS (R-tree) and GeoMesa (Z-order) during the ingestion of $7\times 10^5$ images. Fig.~\ref{fig:index_exp4}(a) illustrates the ingestion throughput. PostGIS exhibits a degrading trend as the dataset grows, bottlenecked by the logarithmic cost of R-tree rebalancing. In contrast, Ours maintains a stable throughput ($\approx 2100$ img/s). Although slightly lower than the lightweight GeoMesa ($\approx 2500$ img/s) due to the dual-table write overhead, our method demonstrates linear scalability suitable for high-velocity streaming data. Regarding storage cost (Fig.~\ref{fig:index_exp4}(b)), our index occupies approximately 0.83\% of the raw data size. While this is higher than GeoMesa (0.15\%) and PostGIS (0.51\%) due to the storage of grid-window mappings, it remains strictly below the 1\% threshold. This result validates that the proposed method achieves significant performance gains with a negligible storage penalty.
+Finally, we evaluated the scalability and cost of maintaining the index. Fig.~\ref{fig:index_exp4} compares our method against PostGIS (R-tree), GeoMesa (Z-order), and MSTGI during the ingestion of $7\times 10^5$ images. Fig.~\ref{fig:index_exp4}(a) illustrates the ingestion throughput. PostGIS exhibits a degrading trend as the dataset grows, bottlenecked by the logarithmic cost of R-tree rebalancing. In contrast, GeoMesa maintains a stable and high throughput ($\approx 2500$ img/s) owing to its append-only write pattern in HBase. MSTGI achieves a throughput of $\approx 2350$ img/s—slightly lower than GeoMesa due to the additional overhead of maintaining multi-scale temporal indexes. Our method demonstrates stable throughput at $\approx 2100$ img/s, lower than both GeoMesa and MSTGI due to the dual-table write overhead (G2I + I2G), yet still exhibits linear scalability suitable for high-velocity streaming data. Regarding storage cost (Fig.~\ref{fig:index_exp4}(b)), our index occupies approximately 0.83\% of the raw data size. GeoMesa maintains the lowest storage footprint (0.15\%) by encoding only spatial footprints via Z-order curves. MSTGI incurs moderately higher storage cost as it must maintain multi-scale temporal partitions alongside spatial indexes. While our method's storage overhead is higher than both GeoMesa and PostGIS (0.51\%) due to the storage of pre-materialized grid-window mappings, it remains strictly below the 1\% threshold. This result validates that the proposed method achieves significant performance gains with a negligible storage penalty.

 \subsection{Evaluating the Concurrency Control}
 In this section, we evaluate the proposed hybrid coordination mechanism on a distributed storage cluster to assess its scalability, robustness under contention, and internal storage efficiency.