This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including most of the languages of Russia. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task and set up a SOTA for the Low Resource ASR challenge dataset. Additionally, we compare the structure of confusion matrices for this and significantly more diverse VoxForge dataset and state and substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure. ...
The performance of automated speech recognition (ASR) systems is well known to differ for varied application domains. At the same time, vendors and research groups typically report ASR quality results either for limited use simplistic domains (audiobooks, TED talks), or proprietary datasets. To fill this gap, we provide an open-source 10-hour ASR system evaluation dataset NTR MediaSpeech for 4 languages: Spanish, French, Turkish and Arabic. The dataset was collected from the official youtube channels of media in the respective languages, and manually transcribed. We estimate that the WER of the dataset is under 5%. We have benchmarked many ASR systems available both commercially and freely, and provide the benchmark results. We also open-source baseline QuartzNet models for each language. ...
In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also improve the current best published SOTA (for small-footprint models) for Google Speech Commands dataset V2 10+2-class classification by about 16%, achieving 98.37% accuracy, and the current best published SOTA for 35-class classification on Google Speech Commands dataset V2 by 47%, achieving 97.0% accuracy. ...
Creating Strong AI means to develop artificial intelligence to the point where the machine’s intellectual capability is in a way equal to a human’s. Science is definitely one of the summits of human intelligence, the other being the art. Scientific research consists in creating hypotheses that are limited applicability models (methods) implying lossy information compression. In this article, we show that this paradigm is not unique to the science and is common to the most developed areas of human activities, like business and engineering. Thus, we argue, a Strong AI should possess a capability to build such models. Still, the known tests to confirm the human-level AI do not address this consideration. Based on the above we suggest a series of six tests of rising complexity to check if AI have achieved the human-level intelligence. ...
The importance of ECG classification is very high now due to many current medical applications where this problem can be stated. Currently, there are many machine learning (ML) solutions which can be used for analyzing and classifying ECG data. However, the main disadvantages of these ML results is use of heuristic hand-crafted or engineered features with shallow feature learning architectures. The problem relies in the possibility not to find most appropriate features which will give high classification accuracy in this ECG problem. One of the proposing solution is to use deep learning architectures where first layers of convolutional neurons behave as feature extractors and in the end some fully-connected (FCN) layers are used for making final decision about ECG classes. ...
Здесь не будут рассматриваться вопросы управления проектами, а сосредоточимся на двух последних проблемах, сводящихся к адекватной оценке стоимости проекта. Адекватная оценка стоимости проекта важна как для заказчика, так и для исполнителя проекта. В данной статье автор анализирует четыре основные модели оценки трудоемкости разработки информационных систем и предлагает способы использования моделей типа функциональных точек при управлении проектами разработки информационных систем и контрактами по их разработке. Оптимизация бизнес-процессов. Документирование, анализ и управление. ...