Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Cui Tao, David W. Embley

Research output: Contribution to journalArticle

29 Citations (Scopus)

Abstract

The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

Original languageEnglish (US)
Pages (from-to)683-703
Number of pages21
JournalData and Knowledge Engineering
Volume68
Issue number7
DOIs
StatePublished - Jul 2009
Externally publishedYes

Fingerprint

Conceptualization
Semantic annotation
World Wide Web
Siblings
Data base
Interrelationship
Query
Semistructured data
Car
Data management
Information extraction

Keywords

  • Automatic semantic annotation
  • Automatic table interpretation
  • Ontology generation
  • Web of data

ASJC Scopus subject areas

  • Information Systems and Management

Cite this

Automatic hidden-web table interpretation, conceptualization, and semantic annotation. / Tao, Cui; Embley, David W.

In: Data and Knowledge Engineering, Vol. 68, No. 7, 07.2009, p. 683-703.

Research output: Contribution to journalArticle

@article{4928a2907dc5488e8c20e20c7a0d2068,
title = "Automatic hidden-web table interpretation, conceptualization, and semantic annotation",
abstract = "The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5{\%}. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.",
keywords = "Automatic semantic annotation, Automatic table interpretation, Ontology generation, Web of data",
author = "Cui Tao and Embley, {David W.}",
year = "2009",
month = "7",
doi = "10.1016/j.datak.2009.02.010",
language = "English (US)",
volume = "68",
pages = "683--703",
journal = "Data and Knowledge Engineering",
issn = "0169-023X",
publisher = "Elsevier",
number = "7",

}

TY - JOUR

T1 - Automatic hidden-web table interpretation, conceptualization, and semantic annotation

AU - Tao, Cui

AU - Embley, David W.

PY - 2009/7

Y1 - 2009/7

N2 - The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

AB - The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

KW - Automatic semantic annotation

KW - Automatic table interpretation

KW - Ontology generation

KW - Web of data

UR - http://www.scopus.com/inward/record.url?scp=67349276460&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67349276460&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2009.02.010

DO - 10.1016/j.datak.2009.02.010

M3 - Article

VL - 68

SP - 683

EP - 703

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

SN - 0169-023X

IS - 7

ER -