Skip to main content

Table 1 Performance evaluation of the NER task uses Precision (P), Recall (R), and F1-score (F1). Bold indicates the best score. All baselines are tuned to their optimal settings, and the best result for each method is reported. The train-test split information for each dataset follows the original papers. For our test data, a standard ratio of 70–15-15 is employed. A fivefold cross-validation is conducted, providing mean and standard deviation (SD) (Mean ± SD) values for each measure

From: A framework for multi-faceted content analysis of social media chatter regarding non-medical use of prescription medications

Model/Dataset

 

NCBI [67]

i2b2-clinical [68]

i2b2-2012 [69]

Our test set

BiLSTM-CRF [71]

P

88.3 ± 1.9

89.2 ± 1.6

87.4 ± 2.2

90.1 ± 2.1

R

90.1 ± 1.7

91.1 ± 1.8

89.1 ± 2.0

92.1 ± 1.9

F1

89.2 ± 1.8

90.1 ± 1.7

88.2 ± 2.1

91.1 ± 2.0

Att-BiLSTM-CRF [72]

P

85.5 ± 1.5

88.6 ± 1.7

89.8 ± 1.9

88.1 ± 1.6

R

86.2 ± 1.4

90.0 ± 1.8

92.0 ± 1.7

92.2 ± 1.5

F1

85.9 ± 1.4

89.3 ± 1.7

90.9 ± 1.8

90.1 ± 1.5

CollabNet [73]

P

82.1 ± 1.2

84.2 ± 1.3

83.1 ± 1.4

86.7 ± 1.5

R

84.1 ± 1.3

85.4 ± 1.2

84.3 ± 1.3

88.1 ± 1.4

F1

83.1 ± 1.2

84.8 ± 1.2

83.7 ± 1.3

87.4 ± 1.4

BLUE [74]

P

90.4 ± 1.7

92.1 ± 1.5

91.5 ± 1.6

91.0 ± 1.4

R

92.9 ± 1.6

93.2 ± 1.4

92.1 ± 1.5

93.0 ± 1.3

F1

91.6 ± 1.6

92.6 ± 1.4

91.8 ± 1.5

91.9 ± 1.3

BioBERT [45]

P

91.2 ± 1.8

91.0 ± 1.7

91.4 ± 1.9

92.3 ± 1.6

R

92.3 ± 1.7

91.1 ± 1.8

93.1 ± 1.7

93.1 ± 1.5

F1

91.7 ± 1.7

91.0 ± 1.7

92.3 ± 1.8

92.7 ± 1.5

Our approach

P

91.9 ± 1.6

92.1 ± 1.4

92.2 ± 1.5

94.8 ± 1.3

R

93.4 ± 1.5

93.3 ± 1.3

92.1 ± 1.4

95.1 ± 1.2

F1

92.6 ± 1.5

92.7 ± 1.3

92.2 ± 1.4

94.9 ± 1.2