Disclaimer
In the initial proposal of this project, the aim was to develop a Convolutional Neural Network (CNN) model capable of classifying mammogram images into three distinct categories: normal, benign, and malignant. However, during the course of the project, the scope was revised to focus on a binary classification system, distinguishing only between normal and abnormal cases. This adjustment was made due to data limitations, time constraints and limitation of computational resources. It’s important to acknowledge this change from the original proposal when considering the outcomes and applications of the current model.
Processing Data by Convolutional Neural Network

Our CNN model, developed using Keras Sequential API [1], represents an advanced approach in machine learning for distinguishing between normal and abnormal mammograms. It is crafted to facilitate a linear stacking of layers, with each layer building upon its predecessor. The model is adept at processing grayscale images of 2250×1500 pixels, a resolution critical for detailed feature extraction in medical diagnostics.
At the heart of this model are the convolutional layers, essential for feature extraction. The initial convolutional layer utilizes 32 filters, each sized at 3×3, and employs Rectified Linear Unit (ReLU) activation [2]. This layer is instrumental in extracting a variety of features from the mammogram images. To enhance training stability and speed, batch normalization follows each convolutional layer. Subsequently, MaxPooling layers with a 2×2 pool size reduce the spatial dimensions of the feature maps, boosting computational efficiency and highlighting essential features [3].
As the model deepens, its feature extraction capabilities expand with additional convolutional layers. The second layer comprises 64 filters, the third escalates to 256, and the fourth amplifies to 512 filters. This progressive increase in filters enables the network to learn more complex patterns from the input data.
Following the convolutional layers, a Flatten layer transforms the 3D output tensor into a 1D tensor. This flattened data then proceeds through fully connected layers. The first Dense layer, containing 128 units with ReLU activation, learns intricate patterns from the features. In this enhanced version, the layer now includes an L2 regularization term where it penalizes large weights by adding the sum of the squares of the weights to the loss function, multiplied by the regularization factor 0.1 [4]. To combat overfitting, a Dropout layer with a 0.3 rate randomly deactivates a portion of input units during training [5] [6]. The second Dense layer includes 64 units with ReLU activation. The final Dense layer, essential for binary classification, has a single unit with sigmoid activation, outputting a probability score for categorizing mammograms as normal or abnormal.
The model’s compilation, training, and evaluation are key to its performance. We use the Adam optimizer for effective parameter adjustments during training, ensuring convergence to an optimal solution [7]. The ‘Binary_crossentropy’ loss function is ideal for this binary classification task, measuring the dissimilarity between predictions and actual labels to guide training towards accurate classification.
During the training phase, we employ a batch size of 32 over 50 epochs. The model learns from the training data (x_train and y_train) and fine-tunes its parameters for precise predictions. We also include validation data (x_val and y_val) to monitor performance and mitigate overfitting. This validation ensures the model’s generalizability to new data, enhancing its real-world applicability.
Result







At the outset of training, the model’s initial loss was recorded at 137.0950, yet this metric demonstrated significant volatility throughout the training period, ultimately reaching its nadir at 4.2985 in epoch 45 (as depicted in Figure 3). This trend was mirrored in the validation loss, also presented in Figure 3, which similarly fluctuated, achieving its lowest point of 5.2842 in epoch 24. Such variations in loss metrics suggest challenges in the model’s learning process, warranting further investigation and potential adjustments.
Turning our attention to accuracy, a critical measure of the model’s performance, it began with a training accuracy of 56.94%. Despite this promising start, the accuracy levels fluctuated considerably over the subsequent epochs, peaking at 62.21% in epoch 2, as illustrated in Figure 4. The validation accuracy followed a similar pattern, albeit with a slightly higher peak of 66.16% in epoch 41. These inconsistencies in accuracy highlight the model’s unstable learning trajectory and underscore the need for refinement in its training methodology.
Precision and recall, critical metrics for assessing the model’s predictive capabilities, also showed significant variation. The model demonstrated significant fluctuations across the training epochs, with the highest training precision reaching 76.45% in epoch 11 and the peak recall at 44.47% in epoch 46, as shown in Figures 5 and 6, respectively. The validation data exhibited comparable volatility, with precision and recall peaking at various points throughout the training process. This variability points to potential issues in the model’s ability to generalize from its training data, a factor that could significantly impact its real-world applicability.
When evaluating the model’s performance on the test dataset, it achieved 75 true positives, 102 false negatives, 52 false positives, and 70 true negatives. The resulting metrics were a loss of 32.85, accuracy of 48.49%, precision of 40.70%, recall of 57.38%, and an F1 score of 47.62%. These figures, particularly the low accuracy and precision, raise concerns about the model’s ability to differentiate between classes effectively. The higher recall rate indicates a tendency to identify positive instances; however, this comes at the cost of a greater number of false positives. The F1 score, a harmonic mean of precision and recall, stood below 50%, reflecting the model’s suboptimal balance between these two metrics.
Furthermore, the area under the ROC Curve, depicted in Figure 7, which was recorded at 0.54, points to a moderate performance level. This suggests that while the model shows promise, it requires substantial improvements in key areas such as accuracy, precision, and recall.
In conclusion, this analysis reveals that while the CNN model has potential, its current performance is marked by significant fluctuations in loss and accuracy metrics. The fluctuations in loss and accuracy, as detailed in Figures 3 and 4, suggest that the model could benefit from a re-evaluation of its training approach, data handling, and overall architecture. Considering modifications such as hyperparameter adjustments and exploring different architectural designs might enhance the model’s efficacy. The goal is to improve the model’s capacity for generalization and reliable prediction, which is essential for its application in practical scenarios.
- Manaswi, N. K. (2018). Understanding and Working with Keras. Deep Learning with Applications Using Python, 31–43. https://doi.org/10.1007/978-1-4842-3516-4_2
- Mian Mian Lau, & King Hann Lim. (2018). Review of Adaptive Activation Function in Deep Neural Network. 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES). https://doi.org/10.1109/iecbes.2018.8626714
- Gholamalinezhad, H., & Khosravi, H. (2020). Pooling Methods in Deep Neural Networks, a Review. ArXiv Preprint. https://doi.org/10.48550/arXiv.2009.07485
- Ying, X. (2019). An Overview of Overfitting and its Solutions. Journal of Physics: Conference Series, 1168(2), 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
- Garbin, C., Zhu, X., & Marques, O. (2020). Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia Tools and Applications, 79(19-20), 12777–12815. https://doi.org/10.1007/s11042-019-08453-9
- Ha, C., Tran, V.-D., Ngo Van, L., & Than, K. (2019). Eliminating overfitting of probabilistic topic models on short and noisy text: The role of dropout. International Journal of Approximate Reasoning, 112, 85–104. https://doi.org/10.1016/j.ijar.2019.05.010
- Kandel, I., Castelli, M., & Popovič, A. (2020). Comparative Study of First Order Optimizers for Image Classification Using Convolutional Neural Networks on Histopathology Images. Journal of Imaging, 6(9), 92. https://doi.org/10.3390/jimaging6090092