Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Liu, Sa; Zhang, Chengzhi

doi:10.1007/978-3-642-36337-5_15

Computer Science > Computation and Language

arXiv:1302.4489 (cs)

[Submitted on 19 Feb 2013]

Title:Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Authors:Sa Liu, Chengzhi Zhang

View PDF

Abstract:Cross-Language Information Retrieval (CLIR) and machine translation (MT) resources, such as dictionaries and parallel corpora, are scarce and hard to come by for special domains. Besides, these resources are just limited to a few languages, such as English, French, and Spanish and so on. So, obtaining comparable corpora automatically for such domains could be an answer to this problem effectively. Comparable corpora, that the subcorpora are not translations of each other, can be easily obtained from web. Therefore, building and using comparable corpora is often a more feasible option in multilingual information processing. Comparability metrics is one of key issues in the field of building and using comparable corpus. Currently, there is no widely accepted definition or metrics method of corpus comparability. In fact, Different definitions or metrics methods of comparability might be given to suit various tasks about natural language processing. A new comparability, namely, termhood-based metrics, oriented to the task of bilingual terminology extraction, is proposed in this paper. In this method, words are ranked by termhood not frequency, and then the cosine similarities, calculated based on the ranking lists of word termhood, is used as comparability. Experiments results show that termhood-based metrics performs better than traditional frequency-based metrics.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1302.4489 [cs.CL]
	(or arXiv:1302.4489v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1302.4489
Journal reference:	Lecture Notes in Computer Science Volume 7717, 2013, pp 134-144
Related DOI:	https://doi.org/10.1007/978-3-642-36337-5_15

Submission history

From: Chengzhi Zhang [view email]
[v1] Tue, 19 Feb 2013 00:30:57 UTC (171 KB)

Computer Science > Computation and Language

Title:Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Termhood-based Comparability Metrics of Comparable Corpus in Special Domain

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators