\tocauthor
A. H. Shandiz et al.
11email: Amin.HonarmandiShandiz@gmail.com
Dr. Amin Honarmandi Shandiz
Abstract
Molecular subtyping of breast cancer is crucial for personalized treatment and prognosis. Traditional classification approaches rely on either histopathological images or gene expression profiling, limiting their predictive power. In this study, we propose a deep multimodal learning framework that integrates histopathological images and gene expression data to classify breast cancer into BRCA.Luminal and BRCA.Basal / Her2 subtypes. Our approach employs a ResNet-50 model for image feature extraction and fully connected layers for gene expression processing, with a cross-attention fusion mechanism to enhance modality interaction. We conduct extensive experiments using five-fold cross-validation, demonstrating that our multimodal integration outperforms unimodal approaches in terms of classification accuracy, precision-recall AUC, and F1-score. Our findings highlight the potential of deep learning for robust and interpretable breast cancer subtype classification, paving the way for improved clinical decision-making.
The code for our proposed multimodal deep learning framework is available at https://github.com/AminHonarmandiShandiz/cancerpredict.
keywords:
Breast Cancer Subtyping, Multimodal Deep Learning, Histopathological Image Analysis, Gene Expression Profiling, Cross-Attention Fusion, Convolutional Neural Networks (CNN), Molecular Classification, Precision Oncology, Feature Alignment, Cancer Heterogeneity.
1 Introduction
Breast cancer is a heterogeneous disease, and precise molecular subtyping is essential for effective treatment selection and prognosis prediction[1]. Traditional classification approaches, including histopathological evaluation and genomic profiling, have been widely used in clinical practice[2, 3]. However, single-modality methods often fail to capture the full complexity of tumor biology. While histopathological images provide morphological and structural insights, gene expression data offer a detailed molecular landscape, including information on oncogenic pathways and tumor microenvironment interactions[4, 5]. Despite their individual strengths, these modalities have limitations when used independently, leading to suboptimal classification performance and reduced clinical applicability.
Recent advancements in deep learning have demonstrated promising results in cancer classification by leveraging convolutional neural networks (CNNs)[3] for imaging data and deep neural networks for genomic data[6]. While there are several multi modal setup using AI in different applications such as Speech and ultrasound[7, 8, 9, 10, 11, 12, 13, 14, 15] Inspired by the DeepCC framework[16], which transformed gene expression profiles into functional spectra for cancer subtype classification, we propose a multimodal deep learning approach that integrates both histopathological images and gene expression data to improve breast cancer subtyping. This study explores the effectiveness of different fusion strategies, including cross-attention fusion, concatenation fusion, and late fusion, to determine the optimal method for integrating multimodal information. By enhancing feature alignment and leveraging complementary information from both data sources, our framework aims to improve the accuracy and robustness of breast cancer subtype classification.
2 Materials and Methods
2.1 Dataset and Pre-processing
This study utilizes a dataset containing histopathological images and gene expression profiles from breast cancer patients shows in Figure1[17].

The histopathological images were processed by extracting 256×256 pixel patches from whole-slide images, ensuring that only high-content tissue regions were selected using Otsu’s thresholding.

To enhance model generalization, data augmentation techniques such as color jitter with Brightness 20%, Contrast 20%, Saturation 20% and Hue 10% and random rotations were applied to the image patches as Figure2 shows. Feature extraction was performed using a ResNet-50 model pre-trained on ImageNet, producing a 35×2048 feature matrix per patient.
For the gene expression data, preprocessing involved multiple steps to ensure consistency and accuracy. Negative values were adjusted by shifting all values by the absolute minimum plus one, followed by log transformation and Z-score normalization shown in Figure3.

Outliers were identified and removed using the interquartile range (IQR) method in Figure4, and missing values were imputed using either the mean or median strategy. To align samples across modalities, gene expression data were filtered to include only patients with corresponding histopathological images. Class labels were assigned based on molecular subtyping, with BRCA.Luminal labeled as 0 and BRCA.Basal/Her2 labeled as 1.

2.2 Model Implementation
The deep learning model consists of two parallel pathways for feature extraction, followed by a fusion mechanism, in figure5. The histopathology pathway processes image features using ResNet-50, extracting a compact feature representation. The genomic pathway utilizes a fully connected neural network to encode gene expression profiles into a latent space. Three fusion strategies were evaluated: concatenation fusion, where features from both modalities are merged directly; late fusion, where predictions from separate unimodal models are combined; and cross-attention fusion, where a multi-head attention mechanism is applied to align features from histopathological images and gene expression data. The cross-attention mechanism was designed to enhance feature interactions by dynamically weighting the importance of each modality’s features.

Model training was conducted using the Adam optimizer with a learning rate of 1e-5, and binary cross-entropy was used as the loss function. Five-fold cross-validation was implemented to ensure robust performance evaluation, and performance metrics included F1-score, MCC, and PR-AUC to assess classification effectiveness.
3 Results
3.1 Unimodal Performance
Initial experiments assessed the performance of single-modality models. The gene expression model achieved an F1-score of 0.8197 and a PR-AUC of 0.9435, demonstrating strong predictive power from genomic data alone. However, the histopathological image model underperformed, with an F1-score of 0.1780 and a PR-AUC of 0.3656 O(Figure6 and Figure7 respectively), indicating that image features alone were insufficient for accurate subtype classification. While training curve of the image modality shows even low precise as the trainings has ran with an early stopping call-back therefore the curve is average with the minimum cut steps over 5-Fold.


3.2 Multimodal Performance
The integration of both modalities significantly improved classification performance. The cross-attention fusion model achieved the highest accuracy, with an F1-score of 0.9379 and a PR-AUC of 0.9948, Figure9. The concatenation fusion model followed closely, with an F1-score of 0.8960 and a PR-AUC of 0.9684 (figure8), while the late fusion approach yielded lower performance, confirming that feature-level integration is more effective than decision-level fusion.


3.3 Ablation Study
To evaluate the contribution of each modality, ablation studies were conducted. Removing image features led to a 12% drop in classification accuracy, confirming the added value of histopathology data when combined with gene expression. Eliminating the cross-attention mechanism resulted in a 5% reduction in PR-AUC, demonstrating the effectiveness of attention-based feature alignment.
3.4 Confusion Matrix Analysis
Analysis of the confusion matrix showed high recall for BRCA.Luminal cases, while BRCA.Basal/Her2 cases exhibited moderate misclassification, suggesting further optimization may be required for certain subtypes.
4 Discussion
The results highlight the advantages of multimodal deep learning in breast cancer classification. The integration of histopathological and genomic features significantly enhances predictive performance compared to unimodal approaches. The superior performance of the cross-attention fusion model underscores the importance of aligning and weighting complementary features from different modalities.
Clinically, this approach has the potential to improve patient stratification and treatment selection by providing more accurate molecular subtype predictions. The study also demonstrates the importance of preprocessing steps, particularly outlier removal and normalization, in stabilizing model performance. However, some limitations remain, including dataset size constraints and the need for external validation on independent cohorts. Future work will explore transformer-based fusion models and interpretability techniques to enhance clinical applicability.
5 Conclusion
This study presents a novel deep multimodal learning framework for breast cancer subtype classification. By integrating histopathological images and gene expression data using a cross-attention fusion model, the proposed approach achieves state-of-the-art performance in classification accuracy and robustness. These findings highlight the potential of deep learning to advance precision oncology by providing more reliable and interpretable cancer subtyping solutions.
References
- [1]J.S. Parker, M.Mullins, M.C. Cheang, S.Leung, D.Voduc, T.Vickery, S.Davies, C.Fauron, X.He, Z.Hu etal., “Supervised risk predictor of breast cancer based on intrinsic subtypes,” Journal of clinical oncology, vol.27, no.8, pp. 1160–1167, 2009.
- [2]M.R. Young and D.L. Craft, “Pathway-informed classification system (pics) for cancer analysis using gene expression data,” Cancer informatics, vol.15, pp. CIN–S40 088, 2016.
- [3]F.A. Spanhol, L.S. Oliveira, C.Petitjean, and L.Heutte, “Breast cancer histopathological image classification using convolutional neural networks,” in 2016 international joint conference on neural networks (IJCNN).IEEE, 2016, pp. 2560–2567.
- [4]J.Krawczuk and T.Łukaszuk, “The feature selection bias problem in relation to high-dimensional gene data,” Artificial intelligence in medicine, vol.66, pp. 63–71, 2016.
- [5]J.Jass, “Classification of colorectal cancer based on correlation of clinical, morphological and molecular features,” Histopathology, vol.50, no.1, pp. 113–130, 2007.
- [6]Y.Chen, Y.Li, R.Narayan, A.Subramanian, and X.Xie, “Gene expression inference with deep learning,” Bioinformatics, vol.32, no.12, pp. 1832–1839, 2016.
- [7]Y.Yu, A.H. Shandiz, and L.Tóth, “Reconstructing speech from real-time articulatory mri using neural vocoders,” in 2021 29th European Signal Processing Conference (EUSIPCO).IEEE, 2021, pp. 945–949.
- [8]L.Tóth and A.H. Shandiz, “3d convolutional neural networks for ultrasound-based silent speech interfaces,” in Artificial Intelligence and Soft Computing: 19th International Conference, ICAISC 2020, Zakopane, Poland, October 12-14, 2020, Proceedings, Part I 19.Springer, 2020, pp. 159–169.
- [9]T.G. Csapó, G.Gosztolya, L.Tóth, A.H. Shandiz, and A.Markó, “Optimizing the ultrasound tongue image representation for residual network-based articulatory-to-acoustic mapping,” Sensors, vol.22, no.22, p. 8601, 2022.
- [10]A.H. Shandiz, L.Tóth, G.Gosztolya, A.Markó, and T.G. Csapó, “Neural speaker embeddings for ultrasound-based silent speech interfaces,” arXiv preprint arXiv:2106.04552, 2021.
- [11]——, “Improving neural silent speech interface models by adversarial training,” in The International Conference on Artificial Intelligence and Computer Vision.Springer, 2021, pp. 430–440.
- [12]A.HonarmandiShandiz and L.Tóth, “Voice activity detection for ultrasound-based silent speech interfaces using convolutional neural networks,” in Text, Speech, and Dialogue: 24th International Conference, TSD 2021, Olomouc, Czech Republic, September 6–9, 2021, Proceedings 24.Springer, 2021, pp. 499–510.
- [13]L.Tóth, A.H. Shandiz, G.Gosztolya, and C.T. Gábor, “Adaptation of tongue ultrasound-based silent speech interfaces using spatial transformer networks,” arXiv preprint arXiv:2305.19130, 2023.
- [14]C.Zainkó, L.Tóth, A.H. Shandiz, G.Gosztolya, A.Markó, G.Németh, and T.G. Csapó, “Adaptation of tacotron2-based text-to-speech for articulatory-to-acoustic mapping using ultrasound tongue imaging,” arXiv preprint arXiv:2107.12051, 2021.
- [15]A.H. Shandiz and L.Tóth, “Improved processing of ultrasound tongue videos by combining convlstm and 3d convolutional networks,” in International conference on industrial, engineering and other applications of applied intelligent systems.Springer, 2022, pp. 265–274.
- [16]F.Gao, W.Wang, M.Tan, L.Zhu, Y.Zhang, E.Fessler, L.Vermeulen, and X.Wang, “Deepcc: a novel deep learning-based framework for cancer molecular subtype classification,” Oncogenesis, vol.8, no.9, p.44, 2019.
- [17]“Dataset,” Figshare, 2025. [Online]. Available: https://doi.org/10.6084/m9.figshare.28050083