Unsupervised learning of the relationship between cancer samples from chromosome specific distances of copy number distributions
As a mutation called the somatic copy number alteration (SCNA) has been shown to highly correlate with oncogenesis, we investigated how SCNA profiles of cancer samples can be utilised to identify the subtype of cancers using a distance called the Wasserstein distance. The distance is used primarily to measure the dissimilarity of two different probability distributions, but it can be generalised to include any arbitrary finite distributions.
We sourced the dataset from a Python library called CNSistent, calculated the distances with varying combination of hyper-parameters, visualised using UMAP and MDS, and clustered with HDBSCAN, K-Means, and GMM. The resulting distance matrix is also used to create an outlier detection method.
The results show that Wasserstein distance can capture these differences. Although the performance is not that different with the benchmark, i.e. when the SCNA profile is used directly as the input feature vector for UMAP and MDS, the resulting method creates a more interpretable approach to analyse the differences between these profiles.
Built in Python
Building application package for simulating fluid flows in an enclosed space
Computing fluid dynamics is a highly complex problem. The research group was planning to develop a measuring device for heat capacity of windows that entails a specific air circulation in an enclosed system. We extended the lid-driven cavity problem to three dimensional and developed an application package that can simulate the airflow with various boundary cases.
Built in Matlab