# **Terminology:***Regression:*Regression regression.*Clustering:*Clustering is a process of partitioning the

**Terminology:***Regression:*Regression is a statistical approach for determining the relation between two attributes.*Regression Indetermination coefficient:*  Using the regression indetermination coefficient regression quality can be determined. Higher the indetermination coefficient lower the quality of regression.*Clustering:*Clustering is a process of partitioning the data points into meaningful sub-classes called clusters.*Clustering Algorithms:* Main goal of clustering algorithms is to categorize data into clusters such that objects are grouped in the same cluster when they are similar according to specific metrics, like Volume,Variety and velocity. ### Introduction:The aim of writing this report is to give a detailed description about a new clustering algorithm that has been introduced in order to deal with partitioning the data-set into clusters under the condition that, the regression indetermination coefficient for each cluster results in a minimal value.  I am also going to discuss the proposal and implementation of this algorithm.  This clustering algorithm is developed with generic programming approach. We use generic algorithms because it provides the sub-optimal solutions.  Most of the clustering algorithms try to minimize the inter clustering distances, However, a few applications such as cellular network planning require to decrease the indetermination coefficient. Regression is one of the most widely used techniques for data analysis. It has a wide application range from experimental data to modern data mining.  Below I am discussing  how  regression is helpful in dividing the data into clusters with an example.### The Idea of considering regression as a clustering method with an Example:The population of the fish has been considered during the ichthyology research. We assume all the fish belongs to one species because it is impossible to describe a few  attributes like size, shape, colour etc., of fish.  Under the condition – “effect of water temperature on the average swimming speed of the fish” we acquire a set of data after several measurements. Now we try to delineate (Figure 1) the data obtained considering the speed of fish on Y-axis and temperature on X-axis and build a function.  One of the approaches is to build a linear regression function for the data. After acquiring the function we try to find the relation that exists between the average swimming speed and temperature.  From the obtained linear regression function we predict that “as the temperature increases there is a decrease in the average speed of fish” though it results in a high error rate because of the residuals.!(F1.PNG)However, there exists another approach called clustering. Having a closer look at Figure 1, we can come up with an idea of splitting the data (fish population) into two clusters.  Now we try to obtain the linear regression function for two clusters separately until we get a satisfactory (minimized) regression indetermination coefficient instead of getting a rough regression function for the entire data items. After performing the clustering we assume that the population(fish) has two kinds of spices which react differently at different temperatures. The regression functions formed have a regression indetermination coefficient which is nearly equivalent to zero which means that these functions are a good fit model for this experimental data. Below I have delineated (Figure 2) how well the regression lines fit the two clusters formed.!(F2.PNG)The above-discussed procedure works well when the data set is small, with minimized indetermination coefficient that predicts the strength of the relationship between two attributes. Unfortunately, the data-sets with which we deal in real world i.e., in the data mining applications, it is rare to find a small data set, in order to conduct the analysis manually. We cannot just predict the number of clusters that should be formed from the data-set by visualizing, so we require this clustering algorithm to find the accurate relation between attributes. This algorithm considers error rates to measure the quality.###  Proposal of Algorithm :When the data-set is very huge it is manually impossible to perform visualization and decide the number of number of clusters to be formed as we saw in the example1, so we need an automated system that can predict the number of clusters. Another approach that can be followed is to perform the algorithm repeatedly as we perform in k means clustering  and consider the best solution.  Regression function parameters fail to determine the distance between the data-set items but can be used to determine the fitness function which is used as cluster quality measure. To further solve the problem we apply genetic problem approach using the cluster quality measure.### Implementation of Algorithm### Assumptions:1. Let’s assume there are two attributes in the given input data-set and the number of clusters that we have to divide is two. 2. Consider that D is the data-set containing objects {x,y} which have to be clustered.D = \$d_{1},d_{2},d_{3},d_{4}……,d_{n}\$3. Assume U is the binary vector which represents the distribution followed by the objects in D and split them into two sets C1 and C2 clusters.U = \$U_{1},U_{2},U_{3},…,U_{n}\$If \$U_i\$ is 0, \$d_i\$ belongs to \$C_{1}\$If \$U_i\$ is 1, \$d_i\$ belongs to \$C_2\$4. Let R be a quadruple consisting the coefficients (a,b,c,d) of the formed linear regression functions.(\$fr_1\$ and \$fr_2\$) . We can also consider R as a binary vector, by changing the coefficient values to binary.5.We denote P as a set of all possible vectors and name it objects population.6.We denote Q as the set of all vectors of R and name it function population.**Using the above-mentioned assumptions I am explaining the algorithm in 2 steps.****Step1:**  First we populate the P and Q with random data.Every vector of P is compared with binary vector U and grouped as cluster1(\$C_1\$) and cluster2 (\$C_2\$) based on the binary vector U. Regression functions (\$f_1\$ and \$f_2\$) are computed for both the clusters. Regression Indetermination coefficients(\$fi_1\$ and \$fi_2\$) of regression functions are computed respectively. Then we find the quality measure of each cluster with fit function, the cluster with maximum quality is considered.  We perform genetic operations on P vector using the fit function. The best individuals are selected from the P and form the regression function(\$f_1\$,\$f_2\$) and the coefficients(\$fr_1,fr_2\$) are sent into vector R. We insert the vector R values into Q vector.    **Step2:**  Using the above formed regression indetermination coefficient (\$fr_1,fr_2\$) and the data objects of the data-set D we find the Euclidean distance and divide them into clusters \$C_1\$ and \$C_2\$ based on this distance. Now find the regression function(\$f_1,f_2\$) and regression indetermination coefficients (\$fi_1,fi_2\$) and use the fit function to determine the quality. And perform the genetic operation on Q vector and get the individuals into \$U_x\$ using the \$C_1\$ and \$C_2\$ distributions for the best individuals from Q. Insert \$U_x\$ into P. Both step1 and step2 are performed repeatedly until we reach the satisfactory indetermination coefficients.Finally, this algorithm mainly competes two populations, i.e., *Object population P:*  This contains the individuals evolving the distribution of data-set objects. *Function population Q:* This contains evolving the regression functions. ### When we have Multiple Dimensions:In a multi-dimensional space, it is hard to recognize the linear patterns. In multiple regression functions, it forms a hyperplane which is hard to visualize. The dependent variable is a vector and the independent variable is a matrix.  As the dimensions increase it is polynomially hard to implement using the algorithm. Instead, regression approximation algorithm can be used when we are dealing with multiple dimensions.### Conclusion:The algorithm discussed above cannot be considered as universal clustering methodology. When dealing with classical regression if unsatisfactory results are obtained this algorithm can be used to show the correlation between attributes. The above-discussed algorithm is in its initial stage of evolution.  A better way should be found for the automatic selection of clusters. There have to be enhancements performed to the generic programming approach and fitness function.