Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

Abstract

The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.

Recommended Citation

L. Wu et al., "Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval," Proceedings of the 2014 IEEE International Conference on Data Mining (2014, Shenzhen, China), pp. 610 - 619, Institute of Electrical and Electronics Engineers (IEEE), Dec 2014.

The definitive version is available at https://doi.org/10.1109/ICDM.2014.147

Meeting Name

2014 IEEE International Conference on Data Mining, ICDM 2014 (2014: Dec. 14-17, Shenzhen, China)

Department(s)

Computer Science

Comments

We thank the support of the National Natural Science Foundation of China 91224006, the Strategic Priority Research Program of Chinese Academy of Sciences XDA06010202 and XDA05050601), “12th Five Year” Plan for Science & Technology Support 2012BAK17B01 and 2013BAD15B02, the joint project by the Foshan and the Chinese Academy of Science under Grant No. 2012YS23, China National 973 program 2014CB340301.

Keywords and Phrases

Computer programming languages; Data mining; Feature extraction; Semantics; Content-based; Feature extraction methods; Latent semantics; Metric learning; Semantic information; Software artifacts; Textual representation; Textual similarities; Codes (symbols)

International Standard Serial Number (ISSN)

1550-4786; 2374-8486

Document Type

Article - Conference proceedings

Document Version

Citation

File Type

text

Language(s)

English

Rights

Publication Date

01 Dec 2014

Computer Science Faculty Research & Creative Works

Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations

Computer Science Faculty Research & Creative Works

Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

Author

Abstract

Recommended Citation

Meeting Name

Department(s)

Comments

Keywords and Phrases

International Standard Serial Number (ISSN)

Document Type

Document Version

File Type

Language(s)

Rights

Publication Date

Share

Search

Browse

Author Corner

Related Content

Useful Links

Article Locations