Публикации

Zubchuk, E., Menshikov, D., & Mikhaylovskiy, N. (2022, February). Using a Language Model in a Kiosk Recommender System at Fast-Food Restaurants.

Kiosks are a popular self-service option in many fast-food restaurants, they save time for the visitors and save labor for the fast-food chains. In this paper, we propose an effective design of a kiosk shopping cart recommender system that combines a language model as a vectorizer and a neural network-based classifier. The model performs better than other models in offline tests and exhibits performance comparable to the best models in A/B/C tests. ...

Danilovich, I., Moshkin, V., Reimche, A., Tevelevich, M., & Mikhaylovskiy, N. (2021, November). Video monitoring over anti-decubitus protocol execution with a deep neural network to prevent pressure ulcer. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) (pp. 1384-1387). IEEE.

Video monitoring of the patient position in the intensive care units is complicated by the obstacles covering the patient body. Conventional posture detection algorithms do not work in this case. A reformulation of the posture detection problem for the case as an object detection/image classification problem and the use of recent deep learning techniques allowed us to achieve 94.5% accuracy on a pre-clinical test classifying 4 postures using imagery from an off-the-shelf camera and edge processing, which is a 60% improvement over the result previously known in literature. This in turn allowed us to build a ready for the clinical trials system based on inexpensive off-the-shelf cameras.Clinical Relevance — A cheap and practical system of automatic video monitoring of bedridden patients allows to minimize the risks of pressure ulcer in ICU. ...

Zubchuk, E., Menshikov, D., & Mikhaylovsky, N. (2021, September). Efficiency of short text classifiers for payment classification. In 2021 International Conference on Information Technology and Nanotechnology (ITNT) (pp. 1-4). IEEE.

Traditionally, the Central Bank of Russia used regular expressions for the payment classification as part of its supervisory activities. Regular expressions often spanned multiple pages to cover varied relevant keywords and their forms. We compare this approach to two modern short text classification approaches: fastText and BERT-based transformer in terms of speed, accuracy and flexibility, including few-shot learning. ...

Bedyakin, R., & Mikhaylovskiy, N. (2021, June). Language ID Prediction from Speech Using Self-Attentive Pooling. In Proceedings of the Third Workshop on Computational Typology and Multilingual NLP (pp. 130-135).

This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results for the language identification task. ...

Bedyakin, R., & Mikhaylovskiy, N. (2021, June). Low­-Resource Spoken Language Identification Using Self-­Attentive Pooling and Deep 1D Time­Channel Separable Convolutions. In Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialogue 2021”, Moscow.

This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including most of the languages of Russia. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task and set up a SOTA for the Low Resource ASR challenge dataset. Additionally, we compare the structure of confusion matrices for this and significantly more diverse VoxForge dataset and state and substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure. ...

Kolobov, R., Okhapkina, O., Omelchishina, O., Platunov, A., Bedyakin, R., Moshkin, V., Menshikov, D. and Mikhaylovskiy, N., 2021. MediaSpeech: Multilanguage ASR Benchmark and Dataset. arXiv preprint arXiv:2103.16193.

The performance of automated speech recognition (ASR) systems is well known to differ for varied application domains. At the same time, vendors and research groups typically report ASR quality results either for limited use simplistic domains (audiobooks, TED talks), or proprietary datasets. To fill this gap, we provide an open-source 10-hour ASR system evaluation dataset NTR MediaSpeech for 4 languages: Spanish, French, Turkish and Arabic. The dataset was collected from the official youtube channels of media in the respective languages, and manually transcribed. We estimate that the WER of the dataset is under 5%. We have benchmarked many ASR systems available both commercially and freely, and provide the benchmark results. We also open-source baseline QuartzNet models for each language. ...