Volume 8 Number 7 (Jul. 2013)
Home > Archive > 2013 > Volume 8 Number 7 (Jul. 2013) >
JSW 2013 Vol.8(7): 1666-1670 ISSN: 1796-217X
doi: 10.4304/jsw.8.7.1666-1670

The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance

Junxiu An, Pengsen Cheng

Chengdu University of Information Technology, Chengdu, P.R.China

Abstract—On one hand, redundant pages could increase searching burden of the search engine. On the other hand, they would lower the user’s experience. So it is necessary to deal with the pages. To achieve near-replicas detection, most of the algorithms depend on web page content extraction currently. But the cost of content extraction is large and it is difficult. What’s more, it becomes much harder to extract web content properly. This paper addresses these issues through the following ways: it gets the definition of the largest number of common character by taking antisense concept of edit distance; it suggests that the feature string of web page built by a previous Chinese character of period in simple processing text; and it utilizes the largest number of common character to calculate the overlap factor between the feature strings of web page. As a consequence, this paper hopes to achieve near-replicas detection in high noise environment, avoiding extracting the content of web page. The algorithm is proven efficient in our experiment testing: the recall rate of web pages reaches 96.7%, and the precision rate reaches 97.8%.

Index Terms—Near-replicas detection, edit distance, the largest number of common character, feature string of web page.

[PDF]

Cite: Junxiu An, Pengsen Cheng, "The Chinese Duplicate Web Pages Detection Algorithm based on Edit Distance," Journal of Software vol. 8, no. 7, pp. 1666-1670, 2013.

General Information

  • ISSN: 1796-217X (Online)

  • Abbreviated Title: J. Softw.

  • Frequency:  Quarterly

  • APC: 500USD

  • DOI: 10.17706/JSW

  • Editor-in-Chief: Prof. Antanas Verikas

  • Executive Editor: Ms. Cecilia Xie

  • Abstracting/ Indexing: DBLP, EBSCO,
           CNKIGoogle Scholar, ProQuest,
           INSPEC(IET), ULRICH's Periodicals
           Directory, WorldCat, etc

  • E-mail: jsweditorialoffice@gmail.com

  • Oct 22, 2024 News!

    Vol 19, No 3 has been published with online version   [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Jun 12, 2024 News!

    Vol 19, No 2 has been published with online version   [Click]