Multimodal Data Fusion

In the realm of multimodal data fusion, we could category the data fusion approaches in terms of the underlying associations between data modalities. For example, we can collect a user's eye movements via an eye tracker and his/her brain waves via EEG at the same time. These two data modalities hence are temporally-corresponded. There can be definitions, such as "eye fixation-related brain potentials", that use the correspondence to ease the follow-up analysis and modeling. Similarly, we may have images and their annotations in a dataset. The appearances of objects within the images and the human annotations are essentially semantically-related. We are interested in exploring the correspondence between physicians' eye movements (where and how they look) at medical images and their speech about the image content (what they see). Currently, we imperfectly assume the underlying semantic correspondences between these two heterogeneous types of data and develop the model below. Future work will study advanced models.

Useful Materials and Resources

To fuse data from multiple modalities, we develop a data fusion framework based on Laplacian sparse coding. Matrices E and V represent eye gaze-filtered image features and verbal features, respectively. This model provides flexibility to allow extra data modalities by adding terms like the first two. The coefficient matrix C stores the unified data representations, each of which is a distribution of latent topics learned and stored in the basis matrices P and Q. The matrices P and Q reveal the transformation from the original feature spaces to latent topics.