Biclustering: methods, software and application
Autoři
Více o knize
Over the past 10 years, biclustering has become popular not only in the field of biological data analysis but also in other applications with high-dimensional two way datasets. This technique clusters both rows and columns simultaneously, as opposed to clustering only rows or only columns. Biclustering retrieves subgroups of objects that are similar in one subgroup of variables and different in the remaining variables. This dissertation focuses on improving and advancing biclustering methods. Since most existing methods are extremely sensitive to variations in parameters and data, we developed an ensemble method to overcome these limitations. It is possible to retrieve more stable and reliable bicluster in two ways: either by running algorithms with different parameter settings or by running them on sub- or bootstrap samples of the data and combining the results. To this end, we designed a software package containing a collection of bicluster algorithms for different clustering tasks and data scales, developed several new ways of visualizing bicluster solutions, and adapted traditional cluster validation indices (e. g. Jaccard index) for validating the bicluster framework. Finally, we applied biclustering to marketing data. Well-established algorithms were adjusted to slightly different data situations, and a new method specially adapted to ordinal data was developed. In order to test this method on artificial data, we generated correlated original random values. This dissertation introduces two methods for generating such values given a probability vector and a correlation structure. All the methods outlined in this dissertation are freely available in the R packages biclust and orddata. Numerous examples in this work illustrate how to use the methods and software.