Files
secondo/bin/Scripts/DistributedSimilarityClustering2.sec

72 lines
3.2 KiB
Plaintext
Raw Permalink Normal View History

2026-01-23 17:03:45 +08:00
####### Similarity Based Clustering
#
# Example: Cluster Buildings of NRW with parameters epsilon = 100 meters, n = 10. Hence a building is a core point if it has 10 buildings within 100 meters radius.
The algorithm is the following:
Let S (= Buildings) be the given set of objects.
1 Determine a set of core points C using the script SimilarityPartitioning. Core points are now available as set C = SmallCore.
2 Distribute S = Buildings as fast as possible to workers using round robin, for example. The set BuildingsB1 has been built in this way.
3 Share C = SmallCore with all workers.
For each partition S_i:
4 Assign each element s in S_i to its closest core point in C. Add to s the index N of its core point and the distance D to its core point (in Gauss-Krüger, hence meters).
5 Create an M-tree index over S_i for CenterGK, S_CenterGK_mtree_i
end
5 Repartition S by N, call it T. Now partitions correspond to core points.
6 For each partition T_i, determine the maximum value of D. Let this be Radius_i.
7 For each partition S_i, for each core point c_j in C, search on S_CenterGK_mtree_i with the radius Radius_j + epsilon. Repartition the result set by N into U.
On U_j, create an index U_CenterGK_mtree_j.
We now have partitions S_i with an arbitrary subset of S, T_j with the points assigned to some core c_j called members of the partition, and U_j with the points within distance Radius_j + epsilon from core c_j (those that are not members are called neighbors of the partition).
8 For each (member) point p in T_j, search on U_j with a circle of size epsilon. If the result set has size >= n, extend p by attribute IsCorePoint = true, else false. For each member point q_t found, add an edge (p, q_t) into a main memory graph V_j.
For each neighbor q encountered, store the triple (p_id, IsCorePoint, q_id). Call the set of triples Triple_j.
9 On the graph V_j, determine connected components, adding the component number cn to each node of V_j.
Assuming we have less than 100 core points, multiply each component number with 100 and add the index j of the partition. So we have components with distinct numbers globally.
10 Extend the triples of Triple_j with the component numbers for p, resulting in quadruples W_j.
On the master:
11 Collect all sets W_j on the master.
12 For each pair of triples (p_id, IsCorePoint, ClusterNo_p, q_id) and (q_id, IsCorePoint, ClusterNo_q, p_id) on the master:
(a) If IsCorePoint is false in both quadruples, just discard these quadruples.
(b) If p is a core point but not q, assign the component number of p to q.
(c) If p is not a core point, but q is, assign the component number of q to p
(d) If p and q are both core points, generate a task of merging clusters ClusterNo_p and ClusterNo_q.
13 Construct a graph G over tasks with pairs of cluster numbers (c_a, c_b); pairs are edges, cluster numbers are nodes.
14 Compute connected components in that graph; assign each component a new number c_x. Create a global table R of renumberings.
15 Send the table R with renumberings to all partitions.
On the workers:
16 For each partition T_u, apply the renumberings of R.
Now all points have received their correct cluster number.