Journal - Volume I No.1 September 2014 , Special Issue-1

  • » Back to Index

  • Title

    :

    A SUPERVISED WEB-SCALE FORUM CRAWLER USING URL TYPE RECOGNITION

    Authors

    :

    A. Anitha1, Mrs. R. Angeline2,

    Keywords

    :

    EIT path, forum crawling, ITF regex, page classification, page type, URL pattern learning, URL type.

    Issue Date

    :

    September - 2014

    Abstract

    :

    The main goal of the Supervised Web-Scale Forum Crawler Using URL Type Recognition crawler is to discover relevant content from the web forums with minimal overhead. The result of forum crawler is to get the information content of a forum threads. The recent post information of the user is used to refresh the crawled thread in timely manner. For each user, a regression model to predict the time when the next post arrives in the thread page is found .This information is used for timely refresh of forum data. Although forums are powered by different forum software packages and have different layouts or styles, they always have similar implicit navigation paths. Implicit navigation paths are connected by specific URL types which lead users from entry pages to thread pages. Based on this remark, the web forum crawling problem is reduced to a URL-type recognition problem. And show how to learn regular expression patterns of implicit navigation paths from automatically generated training sets using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as three annotated forums. The forum crawler achieved over 98 percent effectiveness and 98 percent coverage on a large set of test forums powered by over 100 different forum software packages.

    Page(s)

    :

    6-15

    ISSN

    :

    2347- 4734

    Source

    :

    Vol. 1, No.1 - Special Issue-1

    Download

    :


  • » Back index