Frequency Bias in MLM-trained BERT Embeddings for Medical Codes

Trevor Yu; Tia Tuinstra; Bing Hu; Ryan Rezai; Thomas Fortin; Rachel DiMaio; Brian Vartian; Bryan Tripp

Authors

Trevor Yu University of Waterloo
Tia Tuinstra University of Waterloo
Bing Hu University of Waterloo
Ryan Rezai University of Waterloo
Thomas Fortin University of Waterloo
Rachel DiMaio University of Waterloo
Brian Vartian McMaster University
Bryan Tripp University of Waterloo

Keywords:

Masked Language Modeling, Embeddings, BERT, Medical AI, Electronic Health Records

Abstract

Transformers are deep networks that operate on loosely structured data such as natural language and electronic medical records. Transformers learn embedding vectors that represent discrete inputs (e.g. words; medical codes). Ideally, a transformer should learn similar embedding vectors for two codes with similar medical meanings, as this will help the network make similar inferences given either of these codes. Previous work has suggested that they do so, but this has not been analysed in detail, and work with transformers in other domains suggests that unwanted biases can occur. We trained a Bidirectional Encoder Representations from Transformers (BERT) network with clinical diagnostic codes and analyzed the learned embeddings. The analysis shows that the transformer can learn an undesirable frequency-related bias in embedding similarities, failing to reflect true similarity relationships between medical codes. This is especially true for codes that are infrequently used. It will be important to mitigate this issue in future applications of deep networks to electronic health records.

Author Biographies

Trevor Yu, University of Waterloo

MASc Student, Department of Systems Design Engineering

Tia Tuinstra, University of Waterloo

MASc Student, Department of Systems Design Engineering

Bing Hu, University of Waterloo

PhD Student, Department of Systems Design Engineering

Ryan Rezai, University of Waterloo

Undergraduate Co-op Student, Department of Systems Design Engineering

Thomas Fortin, University of Waterloo

MASc Student, Department of Systems Design Engineering

Rachel DiMaio, University of Waterloo

MASc Student, Department of Systems Design Engineering

Brian Vartian, McMaster University

Assistant Clinical Professor (Adjunct), Department of Family Medicine, McMaster University

Adjunct Professor, Systems Design Engineering, University of Waterloo

Bryan Tripp, University of Waterloo

Associate Professor, Department of Systems Design Engineering

Frequency Bias in MLM-trained BERT Embeddings for Medical Codes

Authors

Keywords:

Abstract

Author Biographies

Trevor Yu, University of Waterloo

Tia Tuinstra, University of Waterloo

Bing Hu, University of Waterloo

Ryan Rezai, University of Waterloo

Thomas Fortin, University of Waterloo

Rachel DiMaio, University of Waterloo

Brian Vartian, McMaster University

Bryan Tripp, University of Waterloo

Downloads

Published

How to Cite

Issue

Section

Information