Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources

Doina Caragea, Jun Zhang, Jie Bao, Jyotishman Pathak, Vasant Honavar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

13 Citations (Scopus)

Abstract

Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.

Original languageEnglish (US)
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages13-44
Number of pages32
Volume3734 LNAI
DOIs
StatePublished - 2005
Externally publishedYes
Event16th International Conference on Algorithmic Learning Theory, ALT 2005 - Singapore, Singapore
Duration: Oct 8 2005Oct 11 2005

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3734 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other16th International Conference on Algorithmic Learning Theory, ALT 2005
CountrySingapore
CitySingapore
Period10/8/0510/11/05

Fingerprint

Information Storage and Retrieval
Software
Semantics
Informatics
Knowledge acquisition
Bioinformatics
Software packages
Ontology
Data acquisition
Classifiers
Decision making
Throughput
Communication
Industry
Computational Biology
Decision Making
Attribute
Learning
Technology
Performance Guarantee

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Caragea, D., Zhang, J., Bao, J., Pathak, J., & Honavar, V. (2005). Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3734 LNAI, pp. 13-44). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3734 LNAI). https://doi.org/10.1007/11564089_5

Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources. / Caragea, Doina; Zhang, Jun; Bao, Jie; Pathak, Jyotishman; Honavar, Vasant.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3734 LNAI 2005. p. 13-44 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3734 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Caragea, D, Zhang, J, Bao, J, Pathak, J & Honavar, V 2005, Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 3734 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3734 LNAI, pp. 13-44, 16th International Conference on Algorithmic Learning Theory, ALT 2005, Singapore, Singapore, 10/8/05. https://doi.org/10.1007/11564089_5
Caragea D, Zhang J, Bao J, Pathak J, Honavar V. Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3734 LNAI. 2005. p. 13-44. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/11564089_5
Caragea, Doina ; Zhang, Jun ; Bao, Jie ; Pathak, Jyotishman ; Honavar, Vasant. / Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3734 LNAI 2005. pp. 13-44 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{234d9582f50a4d20a7a2dc0269e07457,
title = "Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources",
abstract = "Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.",
author = "Doina Caragea and Jun Zhang and Jie Bao and Jyotishman Pathak and Vasant Honavar",
year = "2005",
doi = "10.1007/11564089_5",
language = "English (US)",
isbn = "354029242X",
volume = "3734 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "13--44",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Algorithms and software for collaborative discovery from autonomous, semantically heterogeneous, distributed information sources

AU - Caragea, Doina

AU - Zhang, Jun

AU - Bao, Jie

AU - Pathak, Jyotishman

AU - Honavar, Vasant

PY - 2005

Y1 - 2005

N2 - Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.

AB - Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS - an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.

UR - http://www.scopus.com/inward/record.url?scp=33646515517&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33646515517&partnerID=8YFLogxK

U2 - 10.1007/11564089_5

DO - 10.1007/11564089_5

M3 - Conference contribution

SN - 354029242X

SN - 9783540292425

VL - 3734 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 13

EP - 44

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -