Title: Min Song, Ph.D.
1IS698 Web Mining
- Min Song, Ph.D.
- Course Web Page
- http//web.njit.edu/song/courses/web_mining/is698
_webmining_syllabus.html - and
- Moodle
2Course structure
- The course has two parts
- Lectures - Introduction to the main topics
- One projects (done either individually or group)
- 1 research project.
- Lecture slides will be made available on the
course web page and on Moodle.
3Grading
- Class Participation 10
- Assignments 20
- Midterm 25
- Projects 45
4Prerequisites
- Knowledge/Experience of
- Java programming
5Teaching materials
- Required Text
- Web Data Mining Exploring Hyperlinks, Contents
and Usage data. By Bing Liu, Springer, ISBN
3-450-37881-2. - References
- Data mining Concepts and Techniques, by Jiawei
Han and Micheline Kamber, Morgan Kaufmann, ISBN
1-55860-489-8. - Principles of Data Mining, by David Hand, Heikki
Mannila, Padhraic Smyth, The MIT Press, ISBN
0-262-08290-X. - Introduction to Data Mining, by Pang-Ning Tan,
Michael Steinbach, and Vipin Kumar,
Pearson/Addison Wesley, ISBN 0-321-32136-7. - Machine Learning, by Tom M. Mitchell,
McGraw-Hill, ISBN 0-07-042807-7
6Topics
- Introduction
- Data pre-processing
- Association rules and sequential patterns
- Classification (supervised learning)
- Clustering (unsupervised learning)
- Post-processing of data mining results
- Question Answering
- Full-Text mining
- Partially (semi-) supervised learning
- Opinion mining and summarization
- Link analysis
7Feedback and suggestions
- Your feedback and suggestions are most welcome!
- I need it to adapt the course to your needs.
- Let me know if you find any errors in the
textbook. - Share your questions and concerns with the class
very likely others may have the same. - No pain no gain
- The more you put in, the more you get
- Your grades are proportional to your efforts.
8Rules and Policies
- Statute of limitations No grading questions or
complaints, no matter how justified, will be
listened to one week after the item in question
has been returned. - Cheating Cheating will not be tolerated. All
work you submitted must be entirely your own. Any
suspicious similarities between students' work
will be recorded and brought to the attention of
the Dean. The MINIMUM penalty for any student
found cheating will be to receive a 0 for the
item in question, and dropping your final course
grade one letter. The MAXIMUM penalty will be
expulsion from the University. - Late assignments Late assignments will not, in
general, be accepted. They will never be accepted
if the student has not made special arrangements
with me at least one day before the assignment is
due. If a late assignment is accepted it is
subject to a reduction in score as a late
penalty.
9Web mining Examples
- Link analysis
- How does Google work?
- How to find communities on the Web?
- Structured data extraction
- Web information integration
10Example Web data extraction
Data region1
A data record
A data record
Data region2
11Align and extract data items (e.g., region1)
image1 EN7410 17-inch LCD Monitor Black/Dark charcoal 299.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image2 17-inch LCD Monitor 249.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image3 AL1714 17-inch LCD Monitor, Black 269.99 Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
image4 SyncMaster 712n 17-inch LCD Monitor, Black Was 369.99 299.99 Save 70 After 70 mail-in-rebate(s) Add to Cart (Delivery / Pick-Up ) Penny Shopping Compare
12Resources
- ACM SIGKDD
- Data mining related conferences
- Data mining KDD, ICDM, SDM,
- Databases SIGMOD, VLDB, ICDE,
- AI AAAI, IJCAI, ICML, ACL,
- Web WWW,
- Information retrieval SIGIR, CIKM,
- Kdnuggets http//www.kdnuggets.com/
- News and resources. You can sign-up!
- Our text and reference books
13What is web mining?
- The process of discovering knowledge from web
page content, hyperlink structure, and usage data
- Builds on existing data and text mining
techniques, but adds many new tasks and
algorithms - Three types, based on sources of data (often
combined in practice) - Web structure mining
- Web content mining
- Web usage mining
14Importance of web data mining
- The web is unique!
- Amount of information is huge and still growing,
on almost any topic, and changes continuously - No single editorial control significant
variations in quality, much duplication, and data
formats vary widely - Significant information is linked (within and
between web sites) - Web reflects a virtual society ---interactions
among people, organizations, and automated
systems, no longer limited by geography - The Web presents challenges and opportunities for
mining
15How to make best use of data?
- Knowledge discovered from web data can be used
for competitive advantage. - Online retailers (e.g., amazon.com) are largely
driven by data mining. - Web search engines are based on information
retrieval (text mining) and data mining, and NLP. - Web surfers/searchers need tools to find,
recommend, organize, and extract useful
information from the Web
16Semester Research Project
- Individual, or groups of two (will grade each
other) - Plus formal and informal feedback from instructor
- Should be the beginning of what could be a
publishable project. - On some aspect of web mining
- Topic will be given by instructor or proposed by
student and approved by instructor - Students present
- Ideas early in the semester for feedback
- Completed project at the end of the semester
- Write a scientific paper at the end.
- Publish as a technical report if not more (some
have been published at AMIS and under review)
17Project Biomedical Fulltext Mining
- Input data for Web Mining (particularly web
content mining) consists of document surrogates,
short web pages, email messages, etc. - Fulltext data (books and online articles) has
become publically available. - Currently fulltext mining is not well studied.
- Study fulltext mining in the context of
Biomedical research problems.
18BioFulltextMiner
19Required Software
- Java (jdk1.6.0 or above)
- Tomcat 6
- Apache-ant-1.7.1
- Eclipse 3.4
- BioFulltextMiner.zip (http//base.njit.edu/vline/B
ioFullTextMiner.zip)