Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Cui Tao; David W. Embley

doi:10.1016/j.datak.2009.02.010

Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Cui Tao, David W. Embley

Biomedical Statistics and Informatics

Research output: Contribution to journal › Article › peer-review

29 Scopus citations

Abstract

The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

Original language	English (US)
Pages (from-to)	683-703
Number of pages	21
Journal	Data and Knowledge Engineering
Volume	68
Issue number	7
DOIs	https://doi.org/10.1016/j.datak.2009.02.010
State	Published - Jul 2009

Keywords

Automatic semantic annotation
Automatic table interpretation
Ontology generation
Web of data

ASJC Scopus subject areas

Information Systems and Management

Access to Document

10.1016/j.datak.2009.02.010

Cite this

@article{4928a2907dc5488e8c20e20c7a0d2068,

title = "Automatic hidden-web table interpretation, conceptualization, and semantic annotation",

abstract = "The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.",

keywords = "Automatic semantic annotation, Automatic table interpretation, Ontology generation, Web of data",

author = "Cui Tao and Embley, {David W.}",

note = "Funding Information: Dr. David W. Embley received a B.A. in Mathematics (1970) and an M.S. in Computer Science (1972), both from the University of Utah. In 1976 he earned his Ph.D. in Computer Science from the University of Illinois. From 1976 to 1982 he was a faculty member in the Department of Computer Science at the University of Nebraska, where he was tenured in 1982. Since then he has been a faculty member in the Department of Computer Science at Brigham Young University. He teaches graduate and undergraduate classes in database systems and theory, discrete mathematics, and extraction and integration of web data. He is co-director of the Data Extraction research group and has been co-director of the Object-oriented Systems Modeling (OSM) research group. He has published widely and has made numerous presentations at national and international conferences. His research is supported in part by the National Science Foundation. He is the author of “Object Database Development: Concepts and Principles,” Addison-Wesley, Reading, Massachusetts, 1998, and a coauthor of “Object-oriented Systems Analysis: A Model-driven Approach,” Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1992. He is a member of the steering committee for the International Conferences on Conceptual Modeling (the ER Conferences), and has served as chair for the committee. He is serving or has served in various other capacities, including as an editorial board member, PC chair, PC member, and workshop coordinator. Funding Information: Dr. Cui Tao received a B.S. degree in 1997 from Beijing Normal University, where she majored in Biology and minored in Computer Science. She recently received her Ph.D. in Computer Science from Brigham Young University. In 2009, Dr. Tao joined Mayo Clinic College of Medicine, Division of Biomedical Statistics and Informatics. Her research focuses on ontology generation, conceptual modeling, and information extraction over the biomedical domain. She is also interested in the Semantic Web and its application on biomedical and clinical data. Her research is supported partially by NIH and NCI. She is serving and has served as a PC member for various international conferences and workshops. Funding Information: This work is supported in part by the National Science Foundation under Grant #0414644. We would like to thank Stephen W. Liddle, Yihong Ding, and Andrew Zitzelberger for their help and efforts in implementing the query system for TISP++. ",

year = "2009",

month = jul,

doi = "10.1016/j.datak.2009.02.010",

language = "English (US)",

volume = "68",

pages = "683--703",

journal = "Data and Knowledge Engineering",

issn = "0169-023X",

publisher = "Elsevier",

number = "7",

}

TY - JOUR

T1 - Automatic hidden-web table interpretation, conceptualization, and semantic annotation

AU - Tao, Cui

AU - Embley, David W.

N1 - Funding Information: Dr. David W. Embley received a B.A. in Mathematics (1970) and an M.S. in Computer Science (1972), both from the University of Utah. In 1976 he earned his Ph.D. in Computer Science from the University of Illinois. From 1976 to 1982 he was a faculty member in the Department of Computer Science at the University of Nebraska, where he was tenured in 1982. Since then he has been a faculty member in the Department of Computer Science at Brigham Young University. He teaches graduate and undergraduate classes in database systems and theory, discrete mathematics, and extraction and integration of web data. He is co-director of the Data Extraction research group and has been co-director of the Object-oriented Systems Modeling (OSM) research group. He has published widely and has made numerous presentations at national and international conferences. His research is supported in part by the National Science Foundation. He is the author of “Object Database Development: Concepts and Principles,” Addison-Wesley, Reading, Massachusetts, 1998, and a coauthor of “Object-oriented Systems Analysis: A Model-driven Approach,” Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1992. He is a member of the steering committee for the International Conferences on Conceptual Modeling (the ER Conferences), and has served as chair for the committee. He is serving or has served in various other capacities, including as an editorial board member, PC chair, PC member, and workshop coordinator. Funding Information: Dr. Cui Tao received a B.S. degree in 1997 from Beijing Normal University, where she majored in Biology and minored in Computer Science. She recently received her Ph.D. in Computer Science from Brigham Young University. In 2009, Dr. Tao joined Mayo Clinic College of Medicine, Division of Biomedical Statistics and Informatics. Her research focuses on ontology generation, conceptual modeling, and information extraction over the biomedical domain. She is also interested in the Semantic Web and its application on biomedical and clinical data. Her research is supported partially by NIH and NCI. She is serving and has served as a PC member for various international conferences and workshops. Funding Information: This work is supported in part by the National Science Foundation under Grant #0414644. We would like to thank Stephen W. Liddle, Yihong Ding, and Andrew Zitzelberger for their help and efforts in implementing the query system for TISP++.

PY - 2009/7

Y1 - 2009/7

N2 - The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

AB - The longstanding problem of automatic table interpretation still eludes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction, semantic annotation, and semi-structured data management. In this paper, we offer a solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. Our system compares them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%. Further, given that we can automatically interpret tables, we next show that this leads immediately to a conceptualization of the data in these interpreted tables and thus also to a way to semantically annotate these interpreted tables with respect to the ontological conceptualization. Labels in nested table structures yield ontological concepts and interrelationships among these concepts, and associated data values become annotated information. We further show that semantically annotated data leads immediately to queriable data. Thus, the entire process, which is fully automatic, transform facts embedded within tables into facts accessible by standard query engines.

KW - Automatic semantic annotation

KW - Automatic table interpretation

KW - Ontology generation

KW - Web of data

UR - http://www.scopus.com/inward/record.url?scp=67349276460&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=67349276460&partnerID=8YFLogxK

U2 - 10.1016/j.datak.2009.02.010

DO - 10.1016/j.datak.2009.02.010

M3 - Article

AN - SCOPUS:67349276460

SN - 0169-023X

VL - 68

SP - 683

EP - 703

JO - Data and Knowledge Engineering

JF - Data and Knowledge Engineering

IS - 7

ER -

Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this