Title: Web Classification The Web Unit Approach
1Web Classification The Web Unit Approach
- Ee-Peng LimDivision of Information
SystemsSchool of Computer Engineering
2Acknowledgement
- Dr Aixin Sun, New South Wales University,
Australia - Asst Prof Dion Goh, Sch of Communication
Information, NTU - Ming Yin, Graduate Student
3Outline
- Overview of Web Classification
- Web Page Classification
- Web Unit Mining
- Web Unit Relationship Mining
- Conclusion
4Introduction
- Users and businesses today depend on the World
Wide Web for information and knowledge. - Overloading syndrome in finding web information
- Web classification To classify Web objects into
pre-defined semantic structures - Web page classification
- Web site classification
5Web Page Classification
Classifier Training
Classi- fication
Classifier
6Applications of Web Page Classification
- Web search engines/browsers
- Web directories
- Business intelligence
- Web information integration
7Web Page Classification Methods
- Web page classification approaches
- Text only approach
- Hypertext approach
- Neighborhood category approach
- Relational learning approach
8Existing Web Page Classification Approaches
- Text only approach
- Use term features only Dumais and Chen 00
- Performance is reasonable when web pages are like
text documents - Hypertext approach
- Using content and context features
- Context features layout tag Esposto et al 99,
words from neighboring pages Furnkranz 99,
Glover et al 02, Chakrabarti et al 98
9Existing Web Page Classification Approaches
- Relational learning approach
- Foil-Pilfs Craven and Slattery 01
- Neighborhood category approach
- Make use of neighboring page category labels Oh
et al 00
10Research Questions
- Web page features for classification
- Which feature combinations give good performance?
- Category structure
- Can we have more complex category structures?
- Does a single Web page carry sufficient
information about a concept instance?
11Concept Relationship Graph
- Information is modeled by concepts and
relationships
12Which Page Feature Combination Gives Good
Performance?
- Difficult to compare web page classification
methods - Web pages to be classified Web pages from
one/few/many web site(s) - Different performance metrics
13Our Web Page Classification Research
- To conduct web page classification using
different combinations of web page features - To study their impact on classification accuracy
- Support Vector Machine (SVM) classifiers are used
14Data Set
- Web?KB
- 4 Universities (4159 pages)
- Cornell, Texas, Washington, Wisconsin
- Classes
- Web-gtKB consists of web pages classified into 7
classes - Only 4 classes were used.
- Student, Faculty, Projects, Course
- Training documents
- Proportion of ve training samples (Tr) ranges
from 2 to 18 compared to ve training samples.
15Web Page Features
- 4 combinations of web page features.
- Text Only (X)
- Text Title (T)
- Title words are enclosed by lttitlegt and lt/titlegt
- Title words of a web page may provide more
semantic hint. - Text Anchor Words (A)
- Text words from the neighboring docs could be
noise gt only the anchor words of incoming links. - Text Title Anchor Words (TA)
- Text words, title words and in-link anchor words,
all separately indexed
16Construction of Support Vector Machine (SVM)
Classifier
- SVM is a binary classifier
- SVM has been shown to be accurate for text
classification - A SVM classifier is constructed for each category
- Joachims SVMlight was used
17Construction of SVM Classifiers
- Due to the unbalanced training set
- Cost factor (parameter j in SVMlight)
- SCut thresholding
- Training set is further divided into construction
set and validation set. - Validation set is used to derived the appropriate
threshold for the output score of SVM
18SCut Thresholding
- Performance Evaluation using leave-one-university-
out strategy - Pages from 3 universities are used as training
pages - Pages from the fourth are used as test pages.
- SCut Thresholding
- For the training dataset
- 2 university ? training (train set)
- 1 university ? test (validation set)
- Find threshold that yields the optimal F1 value
19Experimental Results
- F1 measure results
- Compared with CMUs FOIL-PILFS method
20Precision Results
21Recall Results
22F1 Results
23Discussion
- SVM performed very well on Web?KB data set
- SVM delivered better results than Foil-Pilfs when
context features were used - Title words are helpful to SVM compared to using
text words only - In-link anchor words lead to significant increase
in F1. - Using text, title and anchor words resulted in
the best classification performance
24What is the Impact of Concept-Relationship Graph?
- Tasks
- Find the Web pages forming a concept instance
- Assign it with the appropriate concept label
- Identify the relationships among the concept
instances - Challenges
- Definition of concept instance is subjective
- Web sites organize Web pages in different ways
- Features for identifying relationship instances
are limited
25Web Unit
- A Web page or a set of Web pages jointly provides
information of one concept instance - One key page (homepage) and zero or more support
pages.
Web Unit 2
Web Unit 1
http//..path/course/CS100/CS100.htmlhttp//..pat
h/course/CS100/lecture-programs.htmlhttp//..path
/course/CS100/officehours.htmlhttp//..path/cours
e/CS100/instructor.htmlhttp//..path/course/CS100
/exams/final.htmlhttp//..path/course/CS100/exams
/prelim.html
http//..path/user/johnson/index.htmlhttp//..pat
h/user/johnson/research.htmlhttp//..path/user/jo
hnson/publications.htmlhttp//..path/user/johnson
/activities.htmlhttp//..path/user/johnson/studen
ts.htmlhttp//..path/user/johnson/teaching.html h
ttp//..path/user/johnson/contact.html
26Web Unit Mining
- Each concept instance is a Web unit
- Tasks
- Find Web pages that form a Web unit and determine
the role of each page - Assign concept labels to Web units
- Differences between Web Unit Mining and Web Page
Classification - Concept-relationship graph vs flat categories
- Web units are not given a-apriori
27Web Directory Structure
- Structure of Web site based on Web page URLs
- Example
28Observations on Web Units
- Observation 1
- Web pages from the same Web folder are more
semantically related - Observation 2
- Support pages are normally reachable from key
page - Observation 3
- Key page is usually at the highest level of the
Web folder containing the Web unit
29Observations on Web Units
- Observation 4
- Web units of same concept seldom have links
between them - Observation 5
- Multi-page Web units of the same concept often
reside in a set of folders (one for each) under a
common parent folder - One-page Web units of the same concept often
appear in the same folder - Observation 6
- Key page of the Web units of the same concept are
often the link targets of a hub page
30Iterative Web Unit Mining (iWUM)
31Iterative Web Unit Mining (iWUM)
32Web Fragment Generation
- Associate closely-related Web pages together
- Reduce the objects to be classified
- Reduce noise in training
- Criteria to determine Web fragments
- Web folder connectivity index
- Web page naming convention
- Common names for key pages are index.html,
index.htm, etc..
33Web Folder Connectivity
34Web Folder Connectivity Index
- Connectivity from web page pa to web page pb
- Connectivity from a web page to a web folder
- Connectivity from a web folder to a web folder
35Web Fragment Generation
- Connectivity index of a web folder Fi , fFi
- Large fFi suggests pages and subfolders in Fi are
closely linked - Connectivity index threshold to determine the
folders containing web unit(s) - In experiments, 0.1667 was used (at least 5 items
not connected)
36Web Fragment Generation
- Find Candidate Key Pages
- URL of the page ends with a /
- The folder containing the page and the page share
the same name, e.g., path/cs100/cs100.html - Page file name matches home, index, welcome,
default, and homepage
37Web Fragment Generation and Classification
38Web Unit Construction
39Web Unit Construction
1. http//..path/user/johnson/index.html PROF
http//..path/user/johnson/research.html
http//..path/user/johnson/publications.html
http//..path/user/johnson/activities.html
http//..path/user/johnson/students.html
http//..path/user/johnson/teaching.html
http//..path/user/johnson/contact.html
1. http//..path/course/CS100/CS100.html COURSE
http//..path/course/CS100/lecture-program
s.html http//..path/course/CS100/officehours.
html http//..path/course/CS100/instructor.htm
l http//..path/course/CS100/exams/final.htm
http//..path/course/CS100/exams/preli
m.html
40Web Unit Classification
- Observations 5 and 6
- Multi-page Web units of the same concept often
reside in a set of folders (one for each) under a
common parent folder - Key pages of the Web units of the same concept
are often the link targets of a hub page - Improve Web unit mining accuracy
- Web site structure features
- Content features
41Web Unit Classification
42Web Site Structure Features
- Normalized classification score (each web unit)
for each concept - Organization of the web units within the web site
- Closeness to the average depth for each concept
- Highest in-link hub value for each concept
- Precision support of the parent web folder for
each concept - Recall support of the parent web folder for each
concept - Word features in the web page names and URLs
- Each word (term) in page names and URL
43Performance Metrics
- Given a mined web unit ui
- Is ui correctly constructed?
- Is ui correctly classified?
- Perfect web unit u'i of a constructed web unit ui
the labelled web unit containing ui.k and u'i
has the same label as ui. ui.k can be either the
key page or a support page of u'i.
44Performance Metrics
- Precision/Recall for a web unit
- Satisfaction variable (a)
- Degree of importance when the key page of a web
unit is correctly identified - a 1/u key page and support page are equally
important. - a 1 only key page is important.
- a 1, 0.5, 1/u are used in our experiments
45Performance Metrics
- Precision/Recall for a concept
46Experiments (dataset)
- UnitSet
- Pages in WebKB are manually grouped into Web
units - Most pages from the Others category are used as
support pages of the corresponding Web units.
47Experiments (methods)
- Baseline method
- 3 steps (1) train Web page classifiers (2)
classify Web pages, and (3) construct Web units - Baseline with fragments method
- Non-iterative version of iWUM
- 5 steps (1) train Web fragment classifiers (2)
build Web directory (3) generate Web fragments
(4) classify Web fragments and (5) construct Web
units. - Iterative Web Unit Mining method (iWUM)
48Results
a 1
a 0.5
a
49Results
50iWUM Results
51Detailed Results
- Web unit label change rate
52Web Unit Relationship Mining
- Identify relationship instances among the mined
web units that are concept instances - Example Instructor-of (Johnson, CS100)
- Challenges
- Approach? Method?
- Features to be used?
53Web Unit Relationship Mining
- Assumptions
- Relationships can be determined based on
background relation knowledge - Background relation are represented by inter-unit
features - Our proposed method
- Candidate Web unit pair generation
- Feature acquisition
- Classifier training
- Classification
54Inter-Unit Features
- Navigation Features (N)
- Relative Location Features (R)
- Parent-child
- Sibling
- Ancestor-descendent
- Common-item Features (E)
- Email addresses
55Navigation Features (N)
56Relative Location Features (R)
- Parent-child h2 and h4
- Sibling h2 and h3
- Ancestor-descendent h1 and h4
57Experimental Dataset
- WebKB
- Department-of (people, department)
- Instructor-of (people, course)
- Member-of (people, project)
58Experimental Results
- On the manually labelled web units
59Experimental Results
- On the iWUM mined web units
60Conclusion
- Feature combinations for Web Page Classification
- Web Unit to model a concept instance
- Web unit mining and iWUM method
- Web fragment generation classification
- Web unit construction classification
- Web unit mining performance metric
- Web unit relationship mining
61Future Works
- Enhancement of the proposed solutions
- Evaluation of iWUM method on larger datasets
- Development of incremental Web unit mining
methods - Web units and applications
- Search Engines for Web Units and Web Unit
Relationships
62Relevant Publications
- A. Sun, E.-P. Lim, W.-K. Ng, J. Srivastava
Blocking Reduction Strategies in Hierarchical
Text Classification, IEEE TKDE 16(10)1305-1308
, 2004. - A. Sun, E.-P. Lim, Web Unit Mining Finding and
Classifying Subgraphs of Web Pages, ACM CIKM,
2003. - A. Sun, E.-P. Lim, and W.-K. Ng, Performance
Measurement Framework for Hierarchical Text
Classification, JASIST 54(11)1014 1028, 2003. - A. Sun, E.-P. Lim, Web Classification Using
Support Vector Machine, ACM WIDM 2002.
63Thank You
?? ??