Web Content Extraction Using Machine Learning
Authors: Prof. Yagnesh Tiwari
DOI: https://doi.org/10.17605/OSF.IO/2A35N
Short DOI: https://doi.org/ggkv66
Country: -
Full-text Research PDF File: View | Download
Abstract: Extraction model aims at separating the main data from noise. We define content as a continuous and meaningful resource of text from web pages which can be successfully used to summarize required topic in a concise way. Noise can be any parameter of a web page. Noise on the other hand is defined by any web page parameter that deviates from the main content. Noise can be copyright disclaimers, advertisements, navigation arc. Boilerplate templates form a major extraction criteria of Bolierplate detection algorithm.
Keywords: Content Extraction, Boilerplate Removal, Template Detection, CETR.
Paper Id: 549
Published On: 2014-04-27
Published In: Volume 2, Issue 2, March-April 2014
Cite This: Web Content Extraction Using Machine Learning - Prof. Yagnesh Tiwari - IJIRMPS Volume 2, Issue 2, March-April 2014. DOI 10.17605/OSF.IO/2A35N