####### Similarity Based Clustering # # Example: Cluster Buildings of NRW with parameters epsilon = 100 meters, n = 10. Hence a building is a core point if it has 10 buildings within 100 meters radius. The algorithm is the following: Let S (= Buildings) be the given set of objects. 1 Determine a set of core points C using the script SimilarityPartitioning. Core points are now available as set C = SmallCore. 2 Distribute S = Buildings as fast as possible to workers using round robin, for example. The set BuildingsB1 has been built in this way. 3 Share C = SmallCore with all workers. For each partition S_i: 4 Assign each element s in S_i to its closest core point in C. Add to s the index N of its core point and the distance D to its core point (in Gauss-Krüger, hence meters). 5 Create an M-tree index over S_i for CenterGK, S_CenterGK_mtree_i end 5 Repartition S by N, call it T. Now partitions correspond to core points. 6 For each partition T_i, determine the maximum value of D. Let this be Radius_i. 7 For each partition S_i, for each core point c_j in C, search on S_CenterGK_mtree_i with the radius Radius_j + epsilon. Repartition the result set by N into U. On U_j, create an index U_CenterGK_mtree_j. We now have partitions S_i with an arbitrary subset of S, T_j with the points assigned to some core c_j called members of the partition, and U_j with the points within distance Radius_j + epsilon from core c_j (those that are not members are called neighbors of the partition). 8 For each (member) point p in T_j, search on U_j with a circle of size epsilon. If the result set has size >= n, extend p by attribute IsCorePoint = true, else false. For each member point q_t found, add an edge (p, q_t) into a main memory graph V_j. For each neighbor q encountered, store the triple (p_id, IsCorePoint, q_id). Call the set of triples Triple_j. 9 On the graph V_j, determine connected components, adding the component number cn to each node of V_j. Assuming we have less than 100 core points, multiply each component number with 100 and add the index j of the partition. So we have components with distinct numbers globally. 10 Extend the triples of Triple_j with the component numbers for p, resulting in quadruples W_j. On the master: 11 Collect all sets W_j on the master. 12 For each pair of triples (p_id, IsCorePoint, ClusterNo_p, q_id) and (q_id, IsCorePoint, ClusterNo_q, p_id) on the master: (a) If IsCorePoint is false in both quadruples, just discard these quadruples. (b) If p is a core point but not q, assign the component number of p to q. (c) If p is not a core point, but q is, assign the component number of q to p (d) If p and q are both core points, generate a task of merging clusters ClusterNo_p and ClusterNo_q. 13 Construct a graph G over tasks with pairs of cluster numbers (c_a, c_b); pairs are edges, cluster numbers are nodes. 14 Compute connected components in that graph; assign each component a new number c_x. Create a global table R of renumberings. 15 Send the table R with renumberings to all partitions. On the workers: 16 For each partition T_u, apply the renumberings of R. Now all points have received their correct cluster number.