Volume 9 Number 8 (Aug. 2014)
Home > Archive > 2014 > Volume 9 Number 8 (Aug. 2014) >
JSW 2014 Vol.9(8): 页码 ISSN: 1796-217X
doi: 10.4304/jsw.9.8.页码

Algorithm for Fast Finding High-Frequency Strings from Large-Scale Corpus1

Haijun Zhang

School of Computer Science and Technology Xinjiang Normal University, Urumqi China, 830054

Abstract—In high-frequency string extraction, there exists enormous time and memory waste in taking statistics of tremendous low-frequency strings, which causes low efficiency. Based on the incremental n-gram model, this paper puts forward Hierarchical Pruning Algorithm (HPA) to filter out low-frequency garbage strings and to extract candidate repeats for reducing I/O reading-writing times and enhancing efficiency of memory usage. On the basis of candidate repeats, external sort method is applied to merge all of them in order to obtain the final repeat set. For improving the efficiency of candidate repeats merging, this paper proposes to employ improved Radix Sort method to process strings in O(dn). With 32 gigabyte plain text corpus, experiments show that the relationship between I/O reading-writing times of HPA and the corpus size is nearly linear, and the algorithm can efficiently extract repeats from corpus whose size is much larger than that of memory.

Index Terms—repeats, hash table, low-frequency string, hierarchical pruning algorithm

[PDF]

Cite: Haijun Zhang, "Algorithm for Fast Finding High-Frequency Strings from Large-Scale Corpus1," Journal of Software vol. 9, no. 8, pp. 2154-2159, 2014.

General Information

  • ISSN: 1796-217X (Online)

  • Abbreviated Title: J. Softw.

  • Frequency:  Quarterly

  • APC: 500USD

  • DOI: 10.17706/JSW

  • Editor-in-Chief: Prof. Antanas Verikas

  • Executive Editor: Ms. Cecilia Xie

  • Abstracting/ Indexing: DBLP, EBSCO,
           CNKIGoogle Scholar, ProQuest,
           INSPEC(IET), ULRICH's Periodicals
           Directory, WorldCat, etc

  • E-mail: jsweditorialoffice@gmail.com

  • Oct 22, 2024 News!

    Vol 19, No 3 has been published with online version   [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Jun 12, 2024 News!

    Vol 19, No 2 has been published with online version   [Click]