Arts, Languages and Philosophy Faculty Research & Creative Works

Russian National Corpus Web Scraping Project 2019-2020

Abstract

Many web sites, in particular ones that serve content from a content management system or database, deliver their content as HTML with an underlying computer generated structure that is then visually formatted and styled using Cascading Style Sheets (CSS) and JavaScript.

Additionally, when the website uses a web form to query and return results, the web address is read by the web server or application server, the address is then parsed for parameters, and the parameters are passed to the database behind the website which control the results returned.

There are techniques that utilize these facts to extract large amounts of data from the backend database behind a website through a series of crafted web page requests.

Collectively these techniques are called Web Scraping.

Two of the key techniques of web scraping are URL Hacking and HTML parsing.

Recommended Citation

Koob, P., Ivliyeva, Russian National Corpus Web Scraping Project 2019-2020. Missouri S&T, IT and ALP departments. [Electronic resource].

Department(s)

Arts, Languages, and Philosophy

Comments

Document Type

Technical Report

Document Version

Final Version

File Type

text

Language(s)

English

Language 2

Russian

Rights

Publication Date

01 Feb 2020

Russian_National_Corpus_Web_Scraping_Project.pptx (1630 kB)
Power Point presentation

Download

Additional files available below

Access the data accompanying this publication

Included in

Russian Linguistics Commons

COinS

Arts, Languages and Philosophy Faculty Research & Creative Works

Russian National Corpus Web Scraping Project 2019-2020

Abstract

Recommended Citation

Department(s)

Comments

Document Type

Document Version

File Type

Language(s)

Language 2

Rights

Publication Date

Included in

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Arts, Languages and Philosophy Faculty Research & Creative Works

Russian National Corpus Web Scraping Project 2019-2020

Author

Abstract

Recommended Citation

Department(s)

Comments

Document Type

Document Version

File Type

Language(s)

Language 2

Rights

Publication Date

Included in

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations