Meeting minutes are short texts summarizing the most important outcomes of a meeting. The goal of this work is to develop a module for automatic generation of meeting minutes based on a meeting transcript text produced by an Automated Speech Recognition (ASR) system. We consider minuting as a supervised machine learning task on pairs of texts: the transcript of the meeting and its minutes. No Russian minuting dataset was previously available. To fill this gap we present DumSum - a dataset of meetings transcripts of the Russian State Duma and City Dumas, complete with minutes. We use a two-staged minuting pipeline, and introduce semantic segmentation that improves ROUGE and BERTScore metrics of minutes on City Dumas meetings by 1%-10% compared to naive segmentation. ...
We show that the laws of autocorrelations decay in texts are closely related to applicability limits of language models. Using distributional semantics we empirically demonstrate that autocorrelations of words in texts decay according to a power law. We show that distributional semantics provides coherent autocorrelations decay exponents for texts translated to multiple languages. The autocorrelations decay in generated texts is quantitatively and often qualitatively different from the literary texts. We conclude that language models exhibiting Markov behavior, including large autoregressive language models, may have limitations when applied to long texts, whether analysis or generation. ...
We study performance of BERT-like distributive semantic language models on anaphora resolution and related tasks with the purpose of selecting a model for on-device inference. We have found that lean (narrow and deep) language models provide the best balance of speed and quality for word-level tasks, and opensource1 RuLUKE-tiny and RuLUKE-slim models we have trained. Both are significantly (over 27%) faster than models with comparable accuracy. We hypothesise that the model depth may play a critical role for performance as, according to recent findings each layer behaves as a gradient descent step in autoregressive setting. ...
Coreference resolution is an important task in natural language processing, since it can be applied to such vital tasks as information retrieval, text summarization, question answering, sentiment analysis and machine translation. In this paper, we present a study on the effectiveness of several approaches to coreference resolution, focusing on the RuCoCo dataset as well as results of participation in the Dialogue Evaluation 2023. We explore ways to increase the dataset size by using pseudo-labelling and data translated from another language. Using such technics we managed to triple the size of dataset, make it more diverse and improve performance of autoregressive structured prediction (ASP) on coreference resolution task. This approach allowed us to achieve the best results on RuCoCo private test with increase of F1-score by 1.8, Precision by 0.5 and Recall by 3.0 points compared to the second-best leaderboard score. Our results demonstrate the potential of the ASP model and the importance of utilizing diverse training data for coreference resolution. ...
Цель: разработать методику формирования видеодатасета рефлекторной мимической активности лица в группе сотрудников транспортной безопасности (ТБ) с оценкой их психофизиологического состояния, подтверждающего наличие функционального состояния утомления. ...
In this short note we explore what is needed for the unsupervised training of graph language models based on link grammars. First, we introduce the ter-mination tags formalism required to build a language model based on a link grammar formalism of Sleator and Temperley [21] and discuss the influence of context on the unsupervised learning of link grammars. Second, we pro-pose a statistical link grammar formalism, allowing for statistical language generation. Third, based on the above formalism, we show that the classical dissertation of Yuret [25] on discovery of linguistic relations using lexical at-traction ignores contextual properties of the language, and thus the approach to unsupervised language learning relying just on bigrams is flawed. This correlates well with the unimpressive results in unsupervised training of graph language models based on bigram approach of Yuret. ...