Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

About me

Posts

Future Blog Post

less than 1 minute read

Published:

This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.

Blog Post number 4

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 3

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 2

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

Blog Post number 1

less than 1 minute read

Published:

This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.

portfolio

WIFI Single Access Point Positioning

Published:

The majority of WiFi positioning methods use several access points to estimate users locations. But different interferences such as walls, some obstacles, signal reflection make it sometimes impossible to use several access points. Moreover, private and public places where WiFi positioning can be useful homes, cafes, malls have only one access point covering almost all of the space. So it is crucial to make positioning only with one access point because it will release us from quantitative restriction and make WiFi positioning available in all public and private places. Several probabilistic methods exist to determine the position of the user from the input of the signal strength sequence. We used statistical models as Markov Chains to track users paths. You can see the results of experiments in real-life premises.

Figure. Example of our appoach results inside the real-life building with only one WIFI access point.

Aggregation and personalization of news text content

Published:

This work is devoted to exploratory search. At the moment there are many search engines, but they do not satisfy all the needs of users. In particular, users are not always comfortable using well-known large search engines for educational purposes – for example, to study a specific topic from scratch, if the user does not know anything about the topic yet. Exploratory search is devoted to help in such tasks. In this work we explore the purposes and the obstacles of the usage of different exploratory search systems. In addition, we create an index upon the collection of Russian language articles from Habr. Based on the index, we test several popular search algorithms for their appropriateness for exploratory search tasks: TD-IDF indexing, BM25, fasttext embeddings and topic modeling. Below you can see the example of created user interface.

Figure. Example of created user interface for exploratory search based on Habr collection: 1) User can intercat with his personal collection, add or delete document, add his own text. 2) Also, he can see the whole feed from the source. 3) After choosing the search algorithm, he can get recommendations based on his collection.

Fake News Detection using Multilingual Evidence

Published:

Misleading information spreads on the Internet at an incredible speed, which can lead to irreparable consequences in some cases. As a result, it becoming essential to develop fake news detection technologies. While substantial work has been done in this direction, one of the limitations of the current approaches is that these models are focused only on one language and do not use multilingual information. In this work, we propose the new technique based on multilingual evidence that can be used for fake news detection and improve existing approaches. This approach imporved baseline systems for fake news detection and added more explainability for the users. Below you can examine graphical abstract of this work.

Figure. The approach containes the follwong steps: 1) Text Extraction from the new coming article. 2) Text Translation into several languages. 3) Cross-lingual News Retrieval based on translated text. 4) Content Similarity Computation between the retrieved articles and the original one. 5) News Classification into true if there is enough evidence, or fake if there is contradiction.

Text Style Transfer: Detoxification

Published:

The task of text style transfer for texts is not so explored as for images. The application of text style transfer can be quite broad. For social good purposes, we explored unsupervised methods for texts detoxification for Russian and English languages. Also, we continue the work collecting parallel corpus for future possibility to address this problem as seq2seq.

Figure. Example of use cases where the detoxification technology can be applicable. (a) Offering the user a more civil version of a message. (b) Preventing chatbots from being rude to users when trained on open data.

publications

Fake News Detection using Multilingual Evidence

Published in IEEE 7th International Conference on Data Science and Advanced Analytics, 2020

Nowadays, misleading information spreads over the internet at an incredible speed, which can lead to irreparable consequences. As a result, it is becoming more and more essential to combat fake news, especially in the early stages of its origins. Over the past years, a lot of work has been done in this direction. However, all existed solutions have their limitations. One of the main limitations of the current approaches is that the majority of the models are focused only on one language and do not use any multilingual information. In this work, we investigate the new approach of fake news detection based on multilingual evidence. We show effectiveness of the proposed approach in a manual and an automated evaluation experiments. Paper presentation

SkoltechNLP at SemEval-2020 Task 11: Exploring Unsupervised Text Augmentation for Propaganda Detection

Published in International Workshop on Semantic Evaluation, 2020

This paper presents a solution for the Span Identification (SI) task in the “Detection of Propaganda Techniques in News Articles” competition at SemEval-2020. The goal of the SI task is to identify specific fragments of each article which contain the use of at least one propaganda technique. This is a binary sequence tagging task. We tested several approaches finally selecting a fine-tuned BERT model as our baseline model. Our main contribution is an investigation of several unsupervised data augmentation techniques based on distributional semantics expanding the original small training dataset as applied to this BERT-based sequence tagger. We explore various expansion strategies and show that they can substantially shift the balance between precision and recall, while maintaining comparable levels of the F1 score.

Cross-lingual Evidence Improves Monolingual Fake News Detection

Published in Proceedings of the ACL-IJCNLP 2021 Student Research Workshop, ACL 2021, 2021

Misleading information spreads on the Internet at an incredible speed, which can lead to irreparable consequences in some cases. Therefore, it is becoming essential to develop fake news detection technologies. While substantial work has been done in this direction, one of the limitations of the current approaches is that these models are focused only on one language and do not use multilingual information. In this work, we propose a new technique based on cross-lingual evidence (CE) that can be used for fake news detection and improve existing approaches. The hypothesis of the usage of cross-lingual evidence as a feature for fake news detection is confirmed, firstly, by manual experiment based on a set of known true and fake news. Besides, we compared our fake news classification system based on the proposed feature with several strong baselines on two multi-domain datasets of general-topic news and one newly fake COVID-19 news dataset showing that combining cross-lingual evidence with strong baselines such as RoBERTa yields significant improvements in fake news detection.

Download here

Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification

Published in Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale co-located with 47th International Conference on Very Large Data Bases (VLDB), 2021, 2021

One of the ways to fighting toxicity online is to automatically rewrite toxic messages. This is a sequence-to-sequence task, and the easiest way of solving it is to train an encoder-decoder model on a set ofparallel sentences (pairs of sentences with the same meaning, where one is offensive and the other isnot). However, such data does not exist, making researchers resort to non-parallel corpora. We close thisgap by suggesting a crowdsourcing scenario for creating a parallel dataset of detoxifying paraphrases.In our first experiments, we collect paraphrases for 1,200 toxic sentences. We describe and analyse thecrowdsourcing setup and the resulting corpus.

Download here

Methods for Detoxification of Texts for the Russian Language

Published in Multimodal Technologies and Interaction, 5(9), 2021, 2021

We introduce the first study of the automatic detoxification of Russian texts to combat offensive language. This kind of textual style transfer can be used for processing toxic content on social media or for eliminating toxicity in automatically generated texts. While much work has been done for the English language in this field, there are no works on detoxification for the Russian language. We suggest two types of models—an approach based on BERT architecture that performs local corrections and a supervised approach based on a pretrained GPT-2 language model. We compare these methods with several baselines. In addition, we provide the training datasets and describe the evaluation setup and metrics for automatic and manual evaluation. The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.

Download here

Text Detoxification using Large Pre-trained Neural Models

Published in The 2021 Conference on Empirical Methods in Natural Language Processing, 2021

We present two novel unsupervised methodsfor eliminating toxicity in text. Our firstmethod combines two recent ideas: (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guidedby style-trained language models to keep thetext content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT toreplace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.

Download here

talks

ViTalk: Virtual Analyst or Human Business Intelligence Permalink

Published:

NLP solutions in production can be not so easy. We are presenting our NLP-application for BI — virtual analyst ViTalk and our path of development of this solution. ViTalk can deal with natural language request to the KPI of your business (“what is our income”, “show me only last year”, “compare with current”, etc.) While the NLP parts inside are pretty usual, the real-life implementation has faced its own obstacles. We want to share our experience and show that even low-resource unusual data can lead to successful product.

DSAA Poster: Fake News Detection Using Multilingual Evidence Permalink

Published:

Nowadays, misleading information spreads over the internet at an incredible speed, which can lead to irreparable consequences. As a result, it is becoming more and more essential to combat fake news, especially in the early stages of its origins. Over the past years, a lot of work has been done in this direction. However, all existed solutions have their limitations. One of the main limitations of the current approaches is that the majority of the models are focused only on one language and do not use any multilingual information. In this work, we investigate the new approach of fake news detection based on multilingual evidence. We show effectiveness of the proposed approach in a manual and an automated evaluation experiments.

teaching

Data Structures and Algorithms Course

Elective, Innopolis University, 2015

Additional education course for Data Structures and Algorithms with elements of ACMC ICPC training. Overview of the main program, tasks with increased complexity, contests organization.

NLP course

Technical Course, Skoltech, CDISE, 2019

TA at NLP course: seminars preparation, HW check, advising students.

Statistical & Neural NLP course

Technical Course, Skoltech, CDISE, 2021

TA at NLP course: seminars preparation, HW check, advising students, final projects mentorship.