TY - GEN
T1 - Automatic hidden-web table interpretation by sibling page comparison
AU - Tao, Cui
AU - Embley, David W.
PY - 2007
Y1 - 2007
N2 - The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.
AB - The longstanding problem of automatic table interpretation still illudes us. Its solution would not only be an aid to table processing applications such as large volume table conversion, but would also be an aid in solving related problems such as information extraction and semi-structured data management. In this paper, we offer a conceptual modeling solution for the common special case in which so-called sibling pages are available. The sibling pages we consider are pages on the hidden web, commonly generated from underlying databases. We compare them to identify and connect nonvarying components (category labels) and varying components (data values). We tested our solution using more than 2,000 tables in source pages from three different domains-car advertisements, molecular biology, and geopolitical information. Experimental results show that the system can successfully identify sibling tables, generate structure patterns, interpret tables using the generated patterns, and automatically adjust the structure patterns, if necessary, as it processes a sequence of hidden-web pages. For these activities, the system was able to achieve an overall F-measure of 94.5%.
UR - http://www.scopus.com/inward/record.url?scp=38349008590&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=38349008590&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-75563-0_38
DO - 10.1007/978-3-540-75563-0_38
M3 - Conference contribution
AN - SCOPUS:38349008590
SN - 9783540755623
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 566
EP - 581
BT - Conceptual Modeling - ER 2007 - 26th International Conference on Conceptual Modeling, Proceedings
PB - Springer Verlag
T2 - 26th International Conference on Conceptual Modeling, ER 2007
Y2 - 5 November 2007 through 9 November 2007
ER -