Schemas for Web Data: A Reverse Engineering Approach

Abstract

In this paper, we show how to generate schemas of a set of HTML or XML documents retrieved from the web in the context of our web warehousing system called Whoweda (WareHouse Of WEb DAta). Web schemas are used to bind a web table that contains a collection of interlinked web documents called web tuples. These schemas specify the metadata, content and structural properties (in the form of predicates) shared by the web documents and hyperlinks in the web table. They also summarize the hyperlink structure of these documents using the notion of connectivities. Web schemas are generated in three stages. In the first stage, a simple or complex web schema is generated from the user's query (coupling query). In the next stage, the complex web schema is decomposed into a set of simple web schemas. These two stages are performed without inspecting the data instances, i.e., web tuples. Finally, in the last stage the set of simple web schemas are pruned by inspecting the hyperlink structure of the web tuples. We also discuss the formal algorithm for generating a set of simple web schemas from a coupling query. © 2001 Elsevier Science B.V. All rights reserved.

Recommended Citation

S. S. Bhowmick et al., "Schemas for Web Data: A Reverse Engineering Approach," Data and Knowledge Engineering, vol. 39, no. 2, pp. 105 - 142, Elsevier, Nov 2001.

The definitive version is available at https://doi.org/10.1016/S0169-023X(01)00036-2

Department(s)

Computer Science

Keywords and Phrases

Coupling query; Web schemas; Web table; Web tuples; Web warehouse

International Standard Serial Number (ISSN)

0169-023X

Document Type

Article - Journal

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Nov 2001

Computer Science Faculty Research & Creative Works

Schemas for Web Data: A Reverse Engineering Approach

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Schemas for Web Data: A Reverse Engineering Approach

Author

Abstract

Recommended Citation

Department(s)

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations