Volume 6 Number 8 (Aug. 2011)
Home > Archive > 2011 > Volume 6 Number 8 (Aug. 2011) >
JSW 2011 Vol.6(8): 1521-1528 ISSN: 1796-217X
doi: 10.4304/jsw.6.8.1521-1528

A Novel Weighted Phrase-Based Similarity for Web Documents Clustering

Ruilong Yang, Qingsheng Zhu, Yunni Xia

College of Computer Science, Chongqing University, Chongqing, China

Abstract—Phrase has been considered as a more informative feature term for improving the effectiveness of document clustering. In this paper, a weighted phrase-based document similarity is proposed to compute the pairwise similarities of documents based on the Weighted Suffix Tree Document (WSTD) model. The weighted phrase-based document similarity is applied to the Group-average Hierarchical Agglomerative Clustering (GHAC) algorithm to develop a new Web document clustering approach. According to the structures of the Web documents, different document parts are assigned different levels of significance as structure weights stored in the nodes of the weighted suffix tree which is constructed with sentences instead of documents. By mapping each node and its weights in WSTD model into a unique feature term in the Vector Space Document (VSD) model, the new weighted phrase-based document similarity naturally inherits the term TF-IDF weighting scheme in computing the document similarity with weighted phrases. The evaluation experiments indicate that the new clustering approach is very effective on clustering the Web documents. Its quality greatly surpasses the traditional phrase-based approach in which the Web documents structures are ignored. In conclusion, the weighted phrase-based similarity works much better than ordinary phrase-based similarity.

Index Terms—suffix tree, web document clustering, weight computing, phrase-based similarity, document structure

[PDF]

Cite: Ruilong Yang, Qingsheng Zhu, Yunni Xia, "A Novel Weighted Phrase-Based Similarity for Web Documents Clustering," Journal of Software vol. 6, no. 8, pp. 1521-1528, 2011.

General Information

  • ISSN: 1796-217X (Online)

  • Abbreviated Title: J. Softw.

  • Frequency:  Quarterly

  • APC: 500USD

  • DOI: 10.17706/JSW

  • Editor-in-Chief: Prof. Antanas Verikas

  • Executive Editor: Ms. Cecilia Xie

  • Abstracting/ Indexing: DBLP, EBSCO,
           CNKIGoogle Scholar, ProQuest,
           INSPEC(IET), ULRICH's Periodicals
           Directory, WorldCat, etc

  • E-mail: jsweditorialoffice@gmail.com

  • Oct 22, 2024 News!

    Vol 19, No 3 has been published with online version   [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Jun 12, 2024 News!

    Vol 19, No 2 has been published with online version   [Click]