Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification

Published in Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale co-located with 47th International Conference on Very Large Data Bases (VLDB), 2021, 2021

One of the ways to fighting toxicity online is to automatically rewrite toxic messages. This is a sequence-to-sequence task, and the easiest way of solving it is to train an encoder-decoder model on a set ofparallel sentences (pairs of sentences with the same meaning, where one is offensive and the other isnot). However, such data does not exist, making researchers resort to non-parallel corpora. We close thisgap by suggesting a crowdsourcing scenario for creating a parallel dataset of detoxifying paraphrases.In our first experiments, we collect paraphrases for 1,200 toxic sentences. We describe and analyse thecrowdsourcing setup and the resulting corpus.