We study performance of BERT-like distributive semantic language models on anaphora resolution and related
tasks with the purpose of selecting a model for on-device inference. We have found that lean (narrow and deep)
language models provide the best balance of speed and quality for word-level tasks, and opensource1 RuLUKE-tiny
and RuLUKE-slim models we have trained. Both are significantly (over 27%) faster than models with comparable
accuracy. We hypothesise that the model depth may play a critical role for performance as, according to recent
findings each layer behaves as a gradient descent step in autoregressive setting.