Title: Structure Based Information Extraction (SBIE)
1Structure Based Information Extraction (SBIE)
Hua Lei 06/09/2004
2The disadvantages of BYU tool
- Define ontologies, lexicons and data patterns for
each domain. - Define and update ontologies, lexicons and data
patterns manually. - Results heavily rely on lexicons and data
patterns.
3(No Transcript)
4HTML code of a data group
ltTD vAligntop width"33"gtltA classprodtitle
href"http//www.kmart.com/product/index.jsp?prod
uctId1789425ampcp784867.784872.784574amppare
ntPagefamily"gt ltCENTERgtltIMG src"kmart_files/p147
2053th.gif" border0gtlt/CENTERgtltBRgtSharp
LC-20B4U-S 20-Inch Flact LCD Television in
Silverlt/AgtltBRgtltBRgtltSPAN classlistpricegtList
Price 1299.99 ltBRgtlt/SPANgtltSPAN
classourpricegtltBgtOur Price 1199.99lt/Bgtlt/SPANgt
lt/TDgt
Phenomenon all data groups in the same web page
have a same structure. Idea extracts data from
web pages based on the data group structure.
5Method of SBIE
Step 1. choose a data group as the initial one
which has a typical data structure.
6Method of SBIE
Step 2. Analyze HTML code of the web page and
find the structure of data group.
- Recognize the annotations, data patterns and
relative positions of each data. - This algorithm can integrate all these structure
information and get the structure of the data
group. - Validate the data group structure by recognizing
other data groups based on this data group
structure.
7ltTD vAligntop width"33"gtltA classprodtitle
href"http//www.kmart.com/product/index.jsp?prod
uctId1789425ampcp784867.784872.784574amppare
ntPagefamily"gt ltCENTERgtltIMG src"kmart_files/p147
2053th.gif" border0gtlt/CENTERgtltBRgtSharp
LC-20B4U-S 20-Inch Flact LCD Television in
Silverlt/AgtltBRgtltBRgtltSPAN classlistpricegtList
Price 1299.99 ltBRgtlt/SPANgtltSPAN
classourpricegtltBgtOur Price 1199.99lt/Bgtlt/SPANgt
lt/TDgt
8Method of SBIE
Step 3. use the result of step 2 to extract other
data groups in that web page.
9Method of SBIE
10(No Transcript)
11Method of SBIE
- Machine Learning
- A machine learning technique will combine with
the structure recognizing algorithm. - A database for data group structures, lexicons
and data patterns will be created or update after
each extraction. - The machine learning tool can analyze new data
group structure based on the structure
information in the database. - When SBIE is well trained, SBIE could analyze the
data group structures in the HTML code and
extract data without the initial data group (step
1).
12Conclusion
SBIE
- Extend BYU tool.
- Extract information without defining ontologies.
- Create and update data patterns and lexicons
automatically. - Extracting data based on their relative
positions. - Machine learning technique makes it smart.
13Questions