secondo/bin/Scripts/DistributedSimilarityClustering2.sec

####### Similarity Based Clustering
#
# Example: Cluster Buildings of NRW with parameters epsilon = 100 meters, n = 10. Hence a building is a core point if it has 10 buildings within 100 meters radius.

The algorithm is the following:

Let S (= Buildings) be the given set of objects.

  1 Determine a set of core points C using the script SimilarityPartitioning. Core points are now available as set C = SmallCore.

  2 Distribute S = Buildings as fast as possible to workers using round robin, for example. The set BuildingsB1 has been built in this way.

  3 Share C = SmallCore with all workers.

For each partition S_i:

  4 Assign each element s in S_i to its closest core point in C. Add to s the index N of its core point and the distance D to its core point (in Gauss-Krüger, hence meters).

  5 Create an M-tree index over S_i for CenterGK, S_CenterGK_mtree_i

end 

  5 Repartition S by N, call it T. Now partitions correspond to core points.

  6 For each partition T_i, determine the maximum value of D. Let this be Radius_i. 

  7 For each partition S_i, for each core point c_j in C, search on S_CenterGK_mtree_i with the radius Radius_j + epsilon. Repartition the result set by N into U.

    On U_j, create an index U_CenterGK_mtree_j.

    We now have partitions S_i with an arbitrary subset of S, T_j with the points assigned to some core c_j called members of the partition, and U_j with the points within distance Radius_j + epsilon from core c_j (those that are not members are called neighbors of the partition).

  8 For each (member) point p in T_j, search on U_j with a circle of size epsilon. If the result set has size >= n, extend p by attribute IsCorePoint = true, else false. For each member point q_t found, add an edge (p, q_t) into a main memory graph V_j. 

For each neighbor q encountered, store the triple (p_id, IsCorePoint, q_id). Call the set of triples Triple_j.

  9 On the graph V_j, determine connected components, adding the component number cn to each node of V_j.

  Assuming we have less than 100 core points, multiply each component number with 100 and add the index j of the partition. So we have components with distinct numbers globally.

 10 Extend the triples of Triple_j with the component numbers for p, resulting in quadruples W_j.

On the master:

 11 Collect all sets W_j on the master.

 12 For each pair of triples (p_id, IsCorePoint, ClusterNo_p, q_id) and (q_id, IsCorePoint, ClusterNo_q, p_id) on the master:

(a) If IsCorePoint is false in both quadruples, just discard these quadruples.

(b) If p is a core point but not q, assign the component number of p to q.

(c) If p is not a core point, but q is, assign the component number of q to p

(d) If p and q are both core points, generate a task of merging clusters ClusterNo_p and ClusterNo_q.

 13 Construct a graph G over tasks with pairs of cluster numbers (c_a, c_b); pairs are edges, cluster numbers are nodes.

 14 Compute connected components in that graph; assign each component a new number c_x. Create a global table R of renumberings.

 15 Send the table R with renumberings to all partitions. 

On the workers:

 16 For each partition T_u, apply the renumberings of R. 

Now all points have received their correct cluster number.
firs commit 2026-01-23 17:03:45 +08:00			`####### Similarity Based Clustering`
			`#`
			`# Example: Cluster Buildings of NRW with parameters epsilon = 100 meters, n = 10. Hence a building is a core point if it has 10 buildings within 100 meters radius.`

			`The algorithm is the following:`

			`Let S (= Buildings) be the given set of objects.`

			`1 Determine a set of core points C using the script SimilarityPartitioning. Core points are now available as set C = SmallCore.`

			`2 Distribute S = Buildings as fast as possible to workers using round robin, for example. The set BuildingsB1 has been built in this way.`

			`3 Share C = SmallCore with all workers.`

			`For each partition S_i:`

			`4 Assign each element s in S_i to its closest core point in C. Add to s the index N of its core point and the distance D to its core point (in Gauss-Krüger, hence meters).`

			`5 Create an M-tree index over S_i for CenterGK, S_CenterGK_mtree_i`

			`end`

			`5 Repartition S by N, call it T. Now partitions correspond to core points.`

			`6 For each partition T_i, determine the maximum value of D. Let this be Radius_i.`

			`7 For each partition S_i, for each core point c_j in C, search on S_CenterGK_mtree_i with the radius Radius_j + epsilon. Repartition the result set by N into U.`

			`On U_j, create an index U_CenterGK_mtree_j.`

			`We now have partitions S_i with an arbitrary subset of S, T_j with the points assigned to some core c_j called members of the partition, and U_j with the points within distance Radius_j + epsilon from core c_j (those that are not members are called neighbors of the partition).`

			`8 For each (member) point p in T_j, search on U_j with a circle of size epsilon. If the result set has size >= n, extend p by attribute IsCorePoint = true, else false. For each member point q_t found, add an edge (p, q_t) into a main memory graph V_j.`

			`For each neighbor q encountered, store the triple (p_id, IsCorePoint, q_id). Call the set of triples Triple_j.`

			`9 On the graph V_j, determine connected components, adding the component number cn to each node of V_j.`

			`Assuming we have less than 100 core points, multiply each component number with 100 and add the index j of the partition. So we have components with distinct numbers globally.`

			`10 Extend the triples of Triple_j with the component numbers for p, resulting in quadruples W_j.`

			`On the master:`

			`11 Collect all sets W_j on the master.`

			`12 For each pair of triples (p_id, IsCorePoint, ClusterNo_p, q_id) and (q_id, IsCorePoint, ClusterNo_q, p_id) on the master:`

			`(a) If IsCorePoint is false in both quadruples, just discard these quadruples.`

			`(b) If p is a core point but not q, assign the component number of p to q.`

			`(c) If p is not a core point, but q is, assign the component number of q to p`

			`(d) If p and q are both core points, generate a task of merging clusters ClusterNo_p and ClusterNo_q.`

			`13 Construct a graph G over tasks with pairs of cluster numbers (c_a, c_b); pairs are edges, cluster numbers are nodes.`

			`14 Compute connected components in that graph; assign each component a new number c_x. Create a global table R of renumberings.`

			`15 Send the table R with renumberings to all partitions.`

			`On the workers:`

			`16 For each partition T_u, apply the renumberings of R.`

			`Now all points have received their correct cluster number.`