Title: Some studies on Vietnamese multi-document summarization and semantic relation extraction
1Some studies on Vietnamese multi-document
summarization and semantic relation extraction
- Laboratory of Data Mining Knowledge Science
2Content
- Vietnamese multi-document summarization
- Vietnamese VNSEN search engine
- Clustering
- Semantic similarity
- Multi-document summarization
- Semantic relation extraction
- Vietnamese medical ontology
- Object relation extraction
- Cause-and-effect relations
- Vietnamese entity search engine
3Vietnamese multi-document summarization
- Vietnamese VNSEN search engine
- Based on NUTCH
- Integrated Vietnamese word segmentation tool
- JvnSegmenter
- Indexed 500.000 pages from vi.wikipedia.org
4Vietnamese multi-document summarization
- Clustering
- Integrated clustering to VNSEN search engine
- Using snippet results from VNSEN search engine
- Hierarchical Agglomerative Clustering (HAC)
algorithm - Estimation with Clustering on Vivisimo search
engine - Cluster labeling
- Compactness of clusters
- Isolation of clusters
5Vietnamese multi-document summarization
- Implementation of semantic similarity measures
- Semantic similarity between words based on
Semantic Network - Path length (PL)
- Information content (IC)
- Semantic similarity between sentences based on
topic analysis - Word order similarity between sentences
6Vietnamese multi-document summarization
- Building Vietnamese semantic corpus
- Hidden topic corpus
- Using Latent Dirichlet Allocation (LDA) model
- Using JgibbsLDA tool to analyze topic
- Vietnamese Wikipedia corpus
- Using category graph model
- Result
- 120/150/200 hidden topics corpus based on
Vnexpress/Wikipedia data set - Category graph with 14.000 category nodes and
200.000 articles
7Vietnamese multi-document summarization
- Multi-document summarization
- Maximal Marginal Relevance (MMR) method
- Improving with Semantic Similarity Measures based
on Hidden topic analysis
8Vietnamese multi-document summarization
- Multi-document summarization for simple
Vietnamese Medical QA system - Semantic Similarity Measures based on Vietnamese
Wikipedia corpus - Medical Ontology
- Hidden topic analysis
- Clustering
9Vietnamese multi-document summarization
10Vietnamese multi-document summarization
- Table-of-Contents generation
- Using some solutions of Text Segmentation and
Title Generation for automatically generating a
Table-of-Contents.
11Vietnamese multi-document summarization
- Some our Vietnamese language processing utilities
- Nguyen Cam Tu, Phan Xuan Hieu. JvnSegmenter. A
Java-based Vietnamese Word Segmentation - Nguyen Cam Tu. JVnTextpro A Java-based
Vietnamese Text Processing Toolkit - Nguyen Cam Tu. JGibbsLDA A Java and Gibbs
Sampling based Implementation of Latent Dirichlet
Allocation (LDA) - http//203.113.130.2058080/sise VNSEN Search
Engine (Implementers Nguyen Thu Trang, Nguyen
Cam Tu, Nguyen Viet Cuong, Tran Mai Vu, Nguyen
Minh Tuan etc.)
12Semantic Relation Extraction
- Vietnamese Medical Ontology
- 23 classes entity
- 14 relations
- 200 entities
- Technique to improve ontology
- Named Entity Recognition
- Relation extraction
-
13Semantic Relation Extraction
14Semantic Relation Extraction
- Object relation extraction
- Product domain
- Medical domain
- Technique
- Using Wrapper technique for structured data
(HTML/XML/Table) - NLP for unstructured data (Text)
- HMM Model
- CRF Model
15Semantic Relation Extraction
- Cause-and-effect relations
- Using the researching result by Corina Roxana
Girju to investigated some cause-and-effect
relations such as - Adverbial causal link
- Preposition causal link
- Subordination causal link
- Clause integrated link
- Rox08 Corina Roxana Girju (2008). Semantic
Relation Extraction and its Applications, Invited
tutorial at the European Summer School in Logic,
Language and Information (ESSLLI 2008), Hamburg,
Germany, August 2008.
16Semantic Relation Extraction
- Vietnamese entity search engine on the field of
Medical Healthy Care - Using Medical Ontology, Object relation
extraction, Cause-and-effect relation extraction - Associating UIUC-DBIS Lab (University of
Illinois at Urbana-Champaign) - Object Search
- Query Log Mining
- Object Extraction
- Cha08 Kevin C. Chang (2008). Data-Aware Search
on the Web, Act. 2 Entity Search, Technical
Report, University of Illinois at
Urbana-Charmpaign (a talking at College of
Technology, Vietnam National University, Hanoi,
July 08, 2008).
17Some articles in 2008
- LNH08 Dieu-Thu Le, Cam-Tu Nguyen, Quang-Thuy
Ha, Xuan-Hieu Phan, and Susumu Horiguchi (2008).
Matching and Ranking with Hidden Topics towards
Online Contextual Advertising, The 2008
IEEE/WIC/ACM International Conference on Web
Intelligence (WI-08), University of Technology,
Sydney, Australia, December 9 - 12, 2008
(accepted) - PNL08 Xuan-Hieu Phan, Cam-Tu Nguyen, Dieu-Thu
Le, Le-Minh Nguyen, Susumu Horiguchi, and
Quang-Thuy Ha (2008). Classification and
Contextual Match on the Web with Hidden Topics
from Large Data Collections, IEEE TRANSACTIONS ON
KNOWLEDGE AND DATA ENGINEERING (Submitted) - VUH08 Tran Mai Vu, Pham Thi Thu Uyen, Hoang
Minh Hien, Ha Quang Thuy (2008). Semantic
Similarity of sentences and application for
multi-document summarization to evalute on
clustering component of Vietnamese search engine,
Workshop on Information Communication Technology
(ICTFIT08), College of Science, Vietnam National
University, Ho Chi Minh City, November 14, 2008
(in Vietnamese, accepted).
12/28/2020
17
Laboratory of Data Mining Knowledge Science
18THANK YOU