Title: J. Bengel, S. Gauch, E. Mittur, R. Vijayaraghavan
1Chat Room Topic Detection Using Classification
- J. Bengel, S. Gauch, E. Mittur,R.
Vijayaraghavan -
-
- Presented by Dr. Susan Gauch
2Outline
- Motivation, Goals
- Related Work
- ChatTrack Architecture
- Scenarios
- Conclusion
- Future Work
3Motivation
- Popularity of Chat Rooms IM soaring for both
adults teenagers - Consumers entertainment value, meet new people
globally - Corporations business contacts, virtual
meetings - 2005 IM will surpass e-mail for primary
electronic communications (Gartner Group) - Difficult to retain track conversations
- Criminals embedded in chat culture remain hidden
- Disguise identity Lure children
- Commit corporate/homeland espionage
4Motivation (2)
- Chat provides a new tool for collaboration
between criminals - Recruit new terrorists and sympathizers
- Dissemination of criminal plots
- Detection of illegal activities may enhance crime
prevention - Human monitors could monitor chat discussions
- Invasive (privacy concerns)
- Human error
- Expensive!
5Goals
- Assist in crime detection by producing a
concept-based profile, summarizing chat data - over entire public chat sessions
- over individuals 11 chat
- to analyze silent listeners interests
- Chat Profile allows focus on domains of criminal
interest, national security, protection of
children - Examples
- science/microbiology (anthrax)
- music/instruments/violins (Stradivarius heist)
6Related Work
- Topic Identification longstanding problem
- Chat structured differently than normal writing
- Butterfly Samples 30-seconds of chat data to
determine chat room topics search for topics of
interest Van Dyke, et. al. - Wayne, C. L. text classification for
discovering, threading, retrieving from data
streams - Bingham, E., et. al. topic ID (from news
sources using complexity pursuit
7Related Work (2)
- PeopleGarden creates data portraits that
represent interactions between users Xiong, R.,
et. al. - IBM WebFountain assembles bodies of knowledge
from various sources (web, chat rooms, email) - Elnahrawy, E. evaluated several text
classification techniques of chat data on four
categories - Our work complements these systems
- combination ? powerful analysis
8Architecture
Client
Internet
Chat DataArchive (XML/SQL)
ConceptDatabase
IRCClient
IRCClient
Chat Server(with ChatLog)
Classifier
Indexer
Administrator /Intelligence Agent
ChatProfile
ChatRetrieve
9Chat Archival
- IRC Bot and IRC Server augmented w/ ChatLog
Library
- Bot collects public chat data
- Server collects both public/private data
- All activity recorded (e.g., joins, parts,
messages, nickname changes, etc.)
10Chat Archival (2)
2004-04-17
085750 jmb jayhawk
2004-04-17
085814 Jason
jayhawk
There is going to be weather, whether or
not. Uh oh, Ill be RIGHT back!
2004-04-17
085819 jmb 2004-04-17
085946 Aliceer jayhawk
Poof..Left in a hurry! Must be a tornado
outside his door or something. lol
- XML chat data ? SQL database
- ChatLog Library XML schema can be used for
almost any client/server-based system
11Chat Profile
- Text classification to create profile of chat
data - Classification based on Vector Space model
- Training
- Use pre-defined concepts example text -- from
ODP - Each category Vector of representative keywords
weights (tf idf ) - Trained on 1,565 concepts from top 3 levels of
ODP - Selection based on empirical studies
- Additional self-trained categories can be supplied
12Chat Profile (2)
- User profile focuses on one chat participant
- Session profile filters by chat room name only
- Analyst selects criteria of interest
13Chat Profile (3)
- Chat data collected from archive (Stop word
removal Porter stemming) - Classification performed once chat utterances
collected - Classifier creates vector of keywords from chat
data - Similarity measure
- Determines similarity between chat data vector
vectors for each trained concept - Concepts sorted top matches returned
- Represented in hierarchical fashion
- Asterisks represent relative concept weights at
each level
14Chat Profile (4)
Classification for hacking
15Chat Profile (5)
- Intelligence agencies interests constantly
evolve - New world events
- Threats
- Updated intelligence
- ChatProfile allows new concepts to be added for
profile recognition - Agent would make use of concept database tuned
for categories of interest only - Omit uninteresting concepts (idle chat)
16Chat Profile (6)
Classification forhacking(Augmented Training)
17Classification in american-politics(two hours
in January 2004 Undernet) SESSION PROFILE
18Classification in american-politics(two hours
in January 2004 Undernet) USER PROFILE
selected one public chat participant
19Chat Retrieve
- Some profiles warrant further analysis
- Agent/admin needs ability to recall chat
session linked to profiles in question - Indexing system keeps data current
- Traditional indexing systems index from
scratch every time new data appears
ChatRetrieve provides session details
20Chat Retrieve (2)
IRCClient
Chat DataArchive (XML/SQL)
Chat Participants
Internet
Chat Server(ChatLog)
IRCClient
IncrementalIndexer
Inverted Index
Web-Based Retrieval System
21Chat Retrieve (3)
- Queries based on
- speaker name
- keywords
- date/time range
- (combo of above)
- Keyword retrieval based on tf idf
22Chat Retrieve (4)
- Selecting chat room name replays chat session
- Includes utterances spoken by all participants
- Tracks all chat room participants
- Even if they do not contribute to sessions
- Ability to search by listener
23Conclusions
ChatTrack Technology is a framework that
enhances todays client and server-based chat
systems.
- ChatTrack Provides agents with new tools for
vigilance against crime - ChatProfile generates conceptual profiles from
chat data - Filter based on listener/sender, date/time, chat
room session name - ChatRetrieve facilitates manual analysis for
session retrieval by agents, administrators,
parents - Reduces manual efforts to classify chat sessions
24Future
- Automate temporal analysis of chat room topics
(ChatTrend) - Sudden change of interests may indicate odd
behavior - e.g., an upswing in discussions about public
water utilities - Visualize user profiles to inspect interests,
topic shifts
25Future (ChatTrend)
- CHATTREND SCREENSHOT HERE
26Future (ChatTrend 2)
- CHATTREND SCREENSHOT 2 HERE
27Future (2)
- Preprocess seemingly meaningless chatter into
meaningful words - brb be right back
- lol laughing out loud
- heehehehee laughing
- Chat utterance clustering identify similar chat
content (thesis project) - Provide ability to retrieve, replay sessions from
archive based on topics - Keyword search nitroglycerin or pipe bomb
- Topic search bomb making
28Future (3)
- Proactively monitor children
- Restrict types of messages children send/receive
- Corporate monitoring
- Archive virtual meetings corporate liability
- Chat Networks
- Identify rogue users / Spambots
- Analyze metrical habits, rhetorical devices,
polysyllabic words - Language modeling identify user fingerprint
(such as in analysis of Shakespeare literature) - Detect on-line predators
- Investigate users having both adult and
child-like language models
29More Information
- website has many
components available for download - XML Chat Log schema
- ChatLog Library (in C)
- Modified IRC chat server
- ChatTrackBot (IRC data collection)
http//www.ittc.ku.edu/chattrack/