Discovering social determinants of health from case reports using natural language processing: algorithmic development and validation

Table 1 Comparative performance of various NER methods across different biomedical datasets. The F1-scores for each dataset are given, along with the mean performance and standard deviation (SD) (Mean ± SD column) for each method across all datasets

	NCBI	BC5CDR	tmVar	BC4CHEMD	BC2GM	i2b2-clinical	Our test set	Mean ± SD
BiLSTM-CRF [31]	85.81	86.18	86.28	89.48	81.08	85.66	87.24	85.74 ± 2.46*¹
BILSTM-CNN-Char [30]	88.19	87.58	87.20	90.06	83.29	84.08	89.25	87.43 ± 2.09*
BiLSTM-CRF-MTL [32]	88.85	84.93	83.35	89.42	82.12	83.25	86.78	85.95 ± 2.71*
Doc-Att-BiLSTM-CRF [33]	88.61	87.33	83.31	88.2	81.80	85.18	86.94	86.89 ± 2.62*
CollaboNet [34]	84.08	87.12	81.75	87.12	79.73	85.61	87.13	85.21 ± 2.98**
BLUE-BERT [35]	88.37	87.62	87.24	90.19	82.93	86.09	88.10	87.26 ± 2.24*
ClinicalBERT [17]	87.01	84.19	79.10	80.13	78.13	84.10	84.93	82.51 ± 2.51**
BioBERT [15]	90.01	89.30	88.70	91.28	88.52	88.33	91.94	89.58 ± 2.05*
BioBERT + CRF [36]	89.71	88.39	88.58	90.28	88.01	87.33	90.94	89.03 ± 2.01*
BioBERT + MLP [37]	89.10	88.37	88.10	90.08	87.72	86.73	90.34	88.63 ± 2.10*
Our approach	90.08	89.98	89.13	91.58	89.15	89.17	92.98	90.31 ± 1.96

¹* = p-value < 0.005, ** = p-value < 0.001; asterisk (*) means that the difference in mean performance is statistically significant with a p-value less than 0.005, while two asterisks (**) indicate a higher level of statistical significance with a p-value less than 0.001

ISSN: 2731-684X