Text Detoxification using Large Pre-trained Neural Models

Published in The 2021 Conference on Empirical Methods in Natural Language Processing, 2021

We present two novel unsupervised methodsfor eliminating toxicity in text. Our firstmethod combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guidedby style-trained language models to keep thetext content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT toreplace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.