Title: Web classification
1Web classification
2References
- Using Ontologies to Discover Domain-Level Web
Usage Profiles hdai,mobasher_at_cs.depaul.edu - Learning to Construct Knowledge Bases from World
Wide Web. M. Craven, D. DiPasquo, A. Mitchell,
K. Nigam, S Slattery Carnegie Mellon
University-Pittsburg-USA D. Freitag A.
McCallum Just Reserch-Pittsburg-USA
3Definitions
- Ontology
- An explicit formal specification of how to
represent the objects, concepts and other
entities that are assumed to exist in some area
of interest and the relationships that hold among
them. - Taxonomy
- a classification of organisms into groups based
on similarities of structure or origin etc
4Goal
- Capture and model behavioral patterns and
profiles of users interacting with a web site. - Why?
- Collaborative filtering
- Personalization systems
- Improve organization and structural of the site
- Provide dynamic recommendations
(www.recommend-me.com)
5Algorithm 0 (by Rafas brother Gabriel)
- Recommend pages viewed by other users with
similar page ranks. - Problems
- New item problem
- Doesnt consider content similarity nor
item-to-item relationships.
6User session
- User session s ltw(p1,s),w(p2,s),..,w(pn,s)gt
- W(pi,s) is a weight in session s, associated with
page pi - Session clusters cl1, cl2,
- cli is a subset of the set of sessions
- Usage profile prclltp, weight(p,prcl)
weight(p,prcl) µ - Weight(p,prcl)(1/cl) ?w(p,s)
7Algorithm 1
- For every session, create a vector containing the
viewed pages and a weight for each page. - Each vector represent a point in a N-dimensional
space, so we may identify the clusters. - For a new session, check to which cluster this
vector/point belongs, and recommend high scores
pages of this cluster - Problems
- New item problem
- Doesnt consider content similarity nor
item-to-item relationships
8Algorithm 2 keyword search
- Solves new item problem.
- Not good enough
- A page can contain info for more than 1 object.
- A fundamental data can be pointed by the page,
not included. - What exactly is a keyword.
- Solution
- Domain ontologies for objects
9Domain Ontologies
- Domain-Level Aggregate Profile Set of pseudo
objects each characterizing objects of different
types occurring commonly across the user
sessions. - Class - C
- Attributes a lt Da, Ta, a, ?agt
- Ta type of attribute
- DaDomain of the values for a (red, blue,..)
- a ordering relation among Da
- ?a combination function
10Example movie web site
- Classes
- movies, actors, directors, etc
- Attributes
- Movies title, genre, starring actors
- Actors name, filmography, gender, nationality
- Functions
- ?actor(ltS,0.7 T, 0.2 U,0.1,1gt, ltS,0.5
T,0.5),0.7gt) sumi(wiwo)/ sumi(wi) - ?year(1991,1994) 1991,1994
- ?is_a(person,student,person,TA) person
11(No Transcript)
12Creating an Aggregated Representation of a usage
profile
- prlto1wo1gt, ,ltonwongt
- Oi object woisignificance on the profile pr
- Let assume all the object are instances of the
same class - Create a new virtual object o, with attributes
ai ?i(o1,,on)
13Item level usage profile
Year Actor Genre Name
2002 S0.7 T0.2 U0.1 Genre-all Romance Romance Comedy Comedy Kids family A
1999 S0.5, T0.5 Genre-all Romance Comedy B
2001 W0.6,S04 Genre-all Romance C
1999, 2002 S0.58 T0.27 W0.09 U0.05 Genre-all Romance A1 B1 C1
14A real (estate property) example
15Item Level Usage Profile
Room num Location Price Weight
5 Chicago 475K 1
4 Chicago 299K 0.7
4 Evanston 272k 0.18
3 Chicago 99K 0.18
4 Chicago, Evanston 365K 1
16Algorithm 2
- Do not just recommend other items viewed by other
users, recommend items similar to the class
representative. - Advantages
- More accuracy
- Need less examples
- No new item problem
- Consider also content similarity (item-to-item
relationship).
17Item Level Usage Profile
Room Location Price Weight
5 Chicago 475K 1
4 Chicago 299K 0.7
4 Evanston 272k 0.18
3 Chicago 99K 0.18
4 Chicago, Evanston 365K 1
4 Chicago 370K 1
18Final Algorithm
- Given a web site
- Classify it contents into classes and attributes.
- Merge the objects of each user profile and create
a pseudo object. - Recommend according to this pseudo-object.
19Problems
- A per-topic solution
- Found patterns can be incomplete
- User patterns may change with time (for movies)
I loved ET problem. - Need cookies and other methods to identify users.
- How is weight calculated? Can need many examples
I loved American Beauty problem. - How to automatically group the web-pages?
20 21Constructing Knowledge Base from WWW
- Goal
- Automatically create computer understandable
knowledge base from the web. - Why?
- To use in the previous described work, and
similar - Find all universities that offer Java Programming
courses - Make me hotel and flight arrangements for the
upcoming Linux conference
22Constructing Knowledge Base from WWW
- How?
- Use machine learning to create information
extraction methods for each of the desired types
of knowledge - Apply it, to extract symbolic, probabilistic
statements directly from the web
Student-of(Rafa, sdbi) 99 - Used method
- Provide an initial ontology (classes and
relations) - Training examples 3 out of 4 university sites
(8000 web pages, 1400 web-page pairs)
23Example of web pages
Jims Home Page I teach several
courses Fundamental of CS Intro to AI My
research includes Intelligent web agents
Fundamentals of CS Home Page Instructors Jim Tom
Classes Faculty, Research-project, Student,
Staff, (Person), Course, Department,
Other Relations instructor-of,
members-of-project, department-of.
24Ontology
Web KB instances
25Problem Assumption
- Class instance one-instance/one-webpage
- Multiple instances in one web-page
- Multiple linked/related web-pages for instance
- Elvis problem
- Relation R(A,B) is represented by
- Hyperlinks A?B or A?C?D??B
- Inclusion in a particular context (I teach
Intro2cs) - Statistical model of typical words
26To Learn
- Recognizing class instances by classifying bodies
of hypertext - Recognizing relations instances by classifying
chains of hyperlinks - Extract text fields
27Recognizing class instances by classifying bodies
of hypertext
- Statistical bag-of-words approach
- Full Text
- Hyperlinks
- Title/Head
- Learning first order rules
- Combine the previous 4 methods
28Statistical bag-of-words approach
- Context-less classification
- Given a set of classes Cc1, c2,cN
- Given a document consisting of n2000 words w1,
w2, ..,wn - c argmaxc Pr(c w1,,wn)
29Accuracy Othe dept rese staff facu stud cour
26.2 552 0 1 0 0 17 202 Cours
43.3 519 0 2 17 14 421 0 Stud
17.9 264 0 3 16 118 56 5 Facu
6.2 45 0 0 4 1 15 0 Staff
13 384 0 62 5 10 9 8 Rese
1.7 209 4 5 1 3 8 10 Dept
93.6 1064 0 12 3 7 32 19 Other
35 100 72.9 8.7 77.1 75.4 82.8 Coverage
actual
predicted
30Statistical bag-of-words approach Pr(wic) log
(Pr(wic)/Pr(wic))
31Accuracy/Coverage tradeoff for full-text
classifiers
32Accuracy/coverage tradeoff for hyperlinks
classifiers
33Accuracy/Coverage for title heading classifiers
34Learning first order rules
- The previous method doesnt consider relations
between pages - A page is a course home-page if it contains the
word textbook and TA and point to a page
containing the word assignment. - FOIL is a learning system that constructs Horn
clause programs from examples
35Relations
- Has_word(Page). Stemmed words computer
computing comput. 200 occurrences but less than
30 in other class pages - Link_to(page,page)
- m-estimate accuracy (nc(mp))/(nm)
- nc of instances correctly classified by the
rule - N Total of instance classified by the rule
- m2
- P proportion of instances in trainning set that
belongs to that class - Predict each class with confidence best_match /
total__of_matches
36New learned rules
- student(A) - not(has_data(A)),
not(has_comment(A)), link_to(B,A), has_jame(B),
has_paul(B), not(has_mail(B)). - faculty(A) - has_professor(A), has_ph(A),
link_to(B,A), has_faculti(B). - course(A) - has_instructor(A), not(has_good(A)),
link_to(A,B), not(link_to(B, 1)),has_assign(B).
37Accuracy/coverage for FOIL page classifiers
38Boosting
- The best prediction classification depends on the
class - Combine the predictions using the measure
confidence
39Accuracy/coverage tradeoff for combined
classifiers (2000 words vocabulary)
40Boosting
- Disappointing Somehow it is not uniformly better
- Possible solutions
- Using reduced size dictionaries (next)
- Using other methods for combining predictions
(voting instead of best_match /
total__of_matches)
41Accuracy/coverage tradeoff for combined
classifiers (200 words vocabulary)
42Multi-Page segments
- The group is the longest prefix (indicated in
parentheses) - (_at_/user,faculty,people,home,projects/)/.html,
htm - (_at_/cs???,www/,)/.html,htm
- (_at_/cs???,www/,)/
-
- A primary page is any page which URL matches
- _at_/index.html,htm
- _at_/home.html,htm
- _at_/1/1.html,htm
-
- If no page in the group matches one of these
patterns, then the page with the highest score
for any non-other class is a primary page. - Any non-primary page is tagged as Other
43Accuracy/coverage tradeoff for the full text
after URL grouping heuristics
44Conclusion- Recognizing Classes
- Hypertext provides redundant information
- We can classify using several methods
- Full text
- Heading/title
- Hyperlinks
- Text in neighboring pages
- Grouping pages
- No method alone is good enough.
- Combine predictions (classify methods) allows a
better result.
45Learning to Recognize Relation Instances
- Assume Relations are represented by hyper-links
- Given the following background relations
- Class (Page)
- Link-to(Hyperlink,P1,P2)
- Has-word (H) the word is part of the Hyperlink
- All-words-capitalized (H)
- Has-alphanumeric-word (H) I Teach CS2765
- Has-neighborhood-word (H) Neighborhood
paragraph
46 Learning to Recognize Relation Instances
- Try to learn the following
- Members-of-project(P1,P2)
- Intsructors_of_course(P1,P2)
- Department_of_person(P1,P2)
47Learned relations
- instructors of(A,B) - course(A), person(B), link
to(C,B,A). - Test Set 133 Pos, 5 Neg
- department of(A,B) - person(A), department(B),
link to(C,D,A), link to(E,F,D), link to(G,B,F),
has neighborhood word graduate(E). - Test Set 371 Pos, 4 Neg
- members of project(A,B) - research project(A),
person(B), link to(C,A,D), link to(E,D,B), has
neighborhood word people(C). - Test Set 18 Pos, 0 Neg
48Accuracy/Coverage tradeoff for learned relation
rules
49Learning to Extract Text Fields
- Sometimes we want a small fragment of text, not
the whole web-page or class (like Jon, Peter,
etc) - Make me hotel and flight arrangements for the
upcoming Linux conference
50Predefined predicates
- Let F w1, w2, wj be a fragment of text
- length(lt,gt,, N).
- some(Var, Path, Feat, Value) some
(A,next_token, next_token, numeric, true) - position(Var, From, Relop, N)
- relpos(Var1, Var2, Relop, N)
51A wrongExample
Last-Modified Wednesday, 26-Jun-96 013746
GMT lttitlegt Bruce Randall Donald lt/titlegt lth
1gt ltimg src"ftp//ftp.cs.cornell.edu/pub/brd/imag
es/brd.gif"gt ltpgt Bruce Randall Donaldltbrgt Associat
e Professorltbrgt
- ownername(Fragment) -
- some(A, prev token, word, gmt"),
- some(A, , in title, true),
- some(A, , word, unknown),
- some(A, , quadrupletonp, false)
- length(lt, 3)
52Accuracy/coverage tradeoff for Name Extraction
53Conclusions
- Used machine learning algorithms to create
information extract methods for each desired type
of knowledge. - WebKB achieves 70 accuracy at 30 coverage.
- Bag-of-words (Hyperlinks, web-pages and full
text) and First order learning can be used to
boost the confidence - First order learning can be used to look outward
from the page and consider its neighbors
54Problems
- Not as accurate as we want
- You can get more accuracy at cost of coverage
- Use linguistic features (verbs)
- Add new methods to the booster (predict the
department of a professor, based on the
department of his students advisees) - A per topic, per language, per method. Needs
hand made labeling to learn. - Learners with high accuracy can be used to teach
learners with low accuracy.