Research Data

Experimental multi-dimensional scaling of web-scraping results from the A.A Zalizniak Grammatical Dictionary and the Russian National Corpus. Creating a corpus fragment of all possible word-forms of modified Russian sound verbs using web-scraping methodology. Compilation of a summary table for the present tense, future tense, imperative, imperfective and perfective gerund forms.

Alternative Title

Экспериментальное многомерное шкалирование результатов веб-извлечения из Грамматического словаря А.А. Зализняка и базы данных Национального корпуса русского языка (НКРЯ). Создание фрагмента корпуса словоформ модифицированных глаголов звучания русского языка методом веб-извлечения. Составление сводной таблицы форм настоящего и будущего времени, повелительного наклонения, деепричастий несовершенного и совершенного вида.

Ивлиева, И.В.
Kуб, Перри

Abstract

The emergence and development of electronic versions of dictionaries and corpus databases allows the researcher to finally do what was technically impossible on paper: to collect, compile and analyze the entire index of all possible verbal forms of different ranges and scales. This project attempts to improve the method of web extraction in relation to the source material (lexical-semantic group of Russian sound verbs, semantically modified at the word-forming level) and summarize the search results as an interactive summary table. A novel, four-position system of numbering the verbal forms have been introduced and a subsequent experimental multi-dimensional scaling of results successfully carried out. The output takes into account not only all documented (submitted) modifications of sound verbs from the A.A Zaliznijak Grammatical Dictionary and the Russian National Corpus, but also reveals lacunae, dublets, and indicates new (potential) units. The results of the study may be useful for the development of various web applications for the search, collection, and visualization of linguistic material. Possibilities of combinatorial optimization in the processing of open and closed linguistic databases can be particularly important when extracting information from various digital lexicographic sources (across a single or multiple languages), from national linguistic corpora, as well as from digital text collections.

Появление и развитие электронных версий словарей русского языка и корпусных баз данных позволяют исследователю сделать то, что прежде было технически неосуществимо при работе с бумажными словарями: собрать, скомпилировать (соединить результаты) и проанализировать обобщающий индекс всех возможных глагольных форм в различных диапазонах и масштабах.

В данном проекте предпринята попытка не только усовершенствовать методику веб-извлечения применительно к исходному материалу (лексико-семантической группе русских звуковых глаголов, модифицированных на словообразовательном уровне), но и обобщить результаты поиска.

Разработана и введена новая четырехпозиционная система нумерации исходных форм глагола и их словоформ. Успешно проведено системное извлечение и фиксирование результатов поиска в виде многомерной сводной интерактивной таблицы.

Выходные данные документируют не только все словоформы звуковых глаголов, представленные в Грамматическом словаре А.А. Зализняка и Национальном корпусе русского языка, но и выявляют существующие дублеты, внутриязыковые лакуны, указывают на новые (потенциальные) единицы.

Результаты исследования могут быть полезны в разработке различных веб-приложений для поиска, сбора и визуализации лингвистического материала. Возможности комбинаторной оптимизации открытых и закрытых баз данных могут быть особенно значимы при извлечении информации из цифровых лексикографических источников (на одном или нескольких языках), из национальных языковых корпусов, а также из электронных текстовых коллекций.

Start Date

01 April 2022

End Date

31 May 2023

Recommended Citation

Ivliyeva, Irina and Koob, Perry, "Experimental multi-dimensional scaling of web-scraping results from the A.A Zalizniak Grammatical Dictionary and the Russian National Corpus. Creating a corpus fragment of all possible word-forms of modified Russian sound verbs using web-scraping methodology. Compilation of a summary table for the present tense, future tense, imperative, imperfective and perfective gerund forms." (2023). Research Data. 11.
https://scholarsmine.mst.edu/research_data/11

Contact Information

Dr. Irina V. Ivliyeva, ivliyeva@mst.edu
Professor of Russian, Arts, Languages, and
Philosophy Department
Missouri University of Science and Technology

Perry B. Koob, koobp@mst.edu
Database Administrator/System Administrator
Academic Technology Support Team
Missouri S&T Information Technology

Department(s)

Arts, Languages, and Philosophy

Document Type

Data

Document Version

Final Version

File Format

text

Language(s)

Russian

Language 2

English

Publication Date

05 June 2023

File 2 - CВОДНАЯ ТАБЛИЦА Ivliyeva Koob. Verb-extended-complete-2023-05-18.xlsx (442 kB)
File 3 - Приложение 1. Ivliyeva Koob. Appendix 1. Words not_in_morfologija-2023-05-18.xlsx (15 kB)
File 4 - Приложение 2. Ivliyeva Koob. Appendix 2. Double_pronoun -dual- action -2023-05-18.xlsx (42 kB)
File 5 - Приложение 3. Ivliyeva Koob. Appendix 3. Double_perfective_gerund-2023-05-18.xlsx (21 kB)
File 6 - Приложение 4. Ivliyeva Koob. Appendix 4. Double_imperfective_gerund-2023-05-18.xlsx (9 kB)

Download

Additional files available below

Included in

Russian Linguistics Commons

COinS

Research Data

Alternative Title

Abstract

Start Date

End Date

Recommended Citation

Contact Information

Department(s)

Document Type

Document Version

File Format

Language(s)

Language 2

Publication Date

Included in

Search

Browse

Author Corner

Useful Links

Article Locations

Research Data

Alternative Title

Author

Abstract

Start Date

End Date

Recommended Citation

Contact Information

Department(s)

Document Type

Document Version

File Format

Language(s)

Language 2

Publication Date

Included in

Share

Search

Browse

Author Corner

Useful Links

Article Locations