Content and link based web spam detection

Date of Publication

2012

Document Type

Bachelor's Thesis

Degree Name

Bachelor of Science in Computer Science

College

College of Computer Studies

Department/Unit

Computer Science

Thesis Adviser

Arlyn Verina L. Ong

Defense Panel Chair

Clement Y. Ong

Abstract/Summary

Web spams are web pages that use various maneuvering techniques to artificially raise their page rankings in search engine results. These pages illegitimately manipulate the algorithms used by search engines allowing them to appear as though their web page contains trustworthy content and are most relevant to what the research engine user needs. Consequently, this would degrade the quality of search engine results and search engine users will inevitably be misled. Human experts can do a good job on identifying spam pages and pages whose content is of doubtful quality. However, it is impractical to solely rely on human effort for classifying millions of web pages since it is too costly and time consuming. Most of the recently developed approaches that address this problem use machine learning for detecting web spam that is, using a set of expert-classified pages – either reputable or spam – as inputs to an algorithm/s, and from there learns and classifies other unclassified pages in the web. While researchers on this field are mainly concerned with identifying new feature sets to retrieving these feature set are disregarded. This study has identified C4.5 classifies with a feature set, containing more content based features than link based features of a page, as a most efficient web spam detection design in terms of minimizing the required resource utilization, specifically the time complexity, and maintaining the quality of web spam detection.

Abstract Format

html

Language

English

Format

Print

Accession Number

TU16771

Shelf Location

Archives, The Learning Commons, 12F, Henry Sy Sr. Hall

Physical Description

1 v. (various foliations) ; 28 cm.

This document is currently not available here.

Share

COinS