J. Bengel, S. Gauch, E. Mittur, R. Vijayaraghavan - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

J. Bengel, S. Gauch, E. Mittur, R. Vijayaraghavan

Description:

Agent would make use of concept database tuned for ... Keyword search: 'nitroglycerin' or 'pipe bomb' Topic search: 'bomb making' University of Kansas ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 30
Provided by: jasonb2
Category:

less

Transcript and Presenter's Notes

Title: J. Bengel, S. Gauch, E. Mittur, R. Vijayaraghavan


1
Chat Room Topic Detection Using Classification
  • J. Bengel, S. Gauch, E. Mittur,R.
    Vijayaraghavan
  • Presented by Dr. Susan Gauch


2
Outline
  • Motivation, Goals
  • Related Work
  • ChatTrack Architecture
  • Scenarios
  • Conclusion
  • Future Work

3
Motivation
  • Popularity of Chat Rooms IM soaring for both
    adults teenagers
  • Consumers entertainment value, meet new people
    globally
  • Corporations business contacts, virtual
    meetings
  • 2005 IM will surpass e-mail for primary
    electronic communications (Gartner Group)
  • Difficult to retain track conversations
  • Criminals embedded in chat culture remain hidden
  • Disguise identity Lure children
  • Commit corporate/homeland espionage

4
Motivation (2)
  • Chat provides a new tool for collaboration
    between criminals
  • Recruit new terrorists and sympathizers
  • Dissemination of criminal plots
  • Detection of illegal activities may enhance crime
    prevention
  • Human monitors could monitor chat discussions
  • Invasive (privacy concerns)
  • Human error
  • Expensive!

5
Goals
  • Assist in crime detection by producing a
    concept-based profile, summarizing chat data
  • over entire public chat sessions
  • over individuals 11 chat
  • to analyze silent listeners interests
  • Chat Profile allows focus on domains of criminal
    interest, national security, protection of
    children
  • Examples
  • science/microbiology (anthrax)
  • music/instruments/violins (Stradivarius heist)

6
Related Work
  • Topic Identification longstanding problem
  • Chat structured differently than normal writing
  • Butterfly Samples 30-seconds of chat data to
    determine chat room topics search for topics of
    interest Van Dyke, et. al.
  • Wayne, C. L. text classification for
    discovering, threading, retrieving from data
    streams
  • Bingham, E., et. al. topic ID (from news
    sources using complexity pursuit

7
Related Work (2)
  • PeopleGarden creates data portraits that
    represent interactions between users Xiong, R.,
    et. al.
  • IBM WebFountain assembles bodies of knowledge
    from various sources (web, chat rooms, email)
  • Elnahrawy, E. evaluated several text
    classification techniques of chat data on four
    categories
  • Our work complements these systems
  • combination ? powerful analysis

8
Architecture
Client
Internet
Chat DataArchive (XML/SQL)
ConceptDatabase
IRCClient
IRCClient
Chat Server(with ChatLog)
Classifier
Indexer
Administrator /Intelligence Agent
ChatProfile
ChatRetrieve
9
Chat Archival
  • IRC Bot and IRC Server augmented w/ ChatLog
    Library
  • Bot collects public chat data
  • Server collects both public/private data
  • All activity recorded (e.g., joins, parts,
    messages, nickname changes, etc.)

10
Chat Archival (2)
2004-04-17
085750 jmb jayhawk
2004-04-17
085814 Jason
jayhawk
There is going to be weather, whether or
not. Uh oh, Ill be RIGHT back!
2004-04-17
085819 jmb 2004-04-17
085946 Aliceer jayhawk
Poof..Left in a hurry! Must be a tornado
outside his door or something. lol
  • XML chat data ? SQL database
  • ChatLog Library XML schema can be used for
    almost any client/server-based system

11
Chat Profile
  • Text classification to create profile of chat
    data
  • Classification based on Vector Space model
  • Training
  • Use pre-defined concepts example text -- from
    ODP
  • Each category Vector of representative keywords
    weights (tf idf )
  • Trained on 1,565 concepts from top 3 levels of
    ODP
  • Selection based on empirical studies
  • Additional self-trained categories can be supplied

12
Chat Profile (2)
  • User profile focuses on one chat participant
  • Session profile filters by chat room name only
  • Analyst selects criteria of interest

13
Chat Profile (3)
  • Chat data collected from archive (Stop word
    removal Porter stemming)
  • Classification performed once chat utterances
    collected
  • Classifier creates vector of keywords from chat
    data
  • Similarity measure
  • Determines similarity between chat data vector
    vectors for each trained concept
  • Concepts sorted top matches returned
  • Represented in hierarchical fashion
  • Asterisks represent relative concept weights at
    each level

14
Chat Profile (4)
Classification for hacking
15
Chat Profile (5)
  • Intelligence agencies interests constantly
    evolve
  • New world events
  • Threats
  • Updated intelligence
  • ChatProfile allows new concepts to be added for
    profile recognition
  • Agent would make use of concept database tuned
    for categories of interest only
  • Omit uninteresting concepts (idle chat)

16
Chat Profile (6)
Classification forhacking(Augmented Training)
17
Classification in american-politics(two hours
in January 2004 Undernet) SESSION PROFILE
18
Classification in american-politics(two hours
in January 2004 Undernet) USER PROFILE
selected one public chat participant
19
Chat Retrieve
  • Some profiles warrant further analysis
  • Agent/admin needs ability to recall chat
    session linked to profiles in question
  • Indexing system keeps data current
  • Traditional indexing systems index from
    scratch every time new data appears

ChatRetrieve provides session details
20
Chat Retrieve (2)
IRCClient
Chat DataArchive (XML/SQL)
Chat Participants
Internet
Chat Server(ChatLog)
IRCClient
IncrementalIndexer
Inverted Index
Web-Based Retrieval System
21
Chat Retrieve (3)
  • Queries based on
  • speaker name
  • keywords
  • date/time range
  • (combo of above)
  • Keyword retrieval based on tf idf

22
Chat Retrieve (4)
  • Selecting chat room name replays chat session
  • Includes utterances spoken by all participants
  • Tracks all chat room participants
  • Even if they do not contribute to sessions
  • Ability to search by listener

23
Conclusions
ChatTrack Technology is a framework that
enhances todays client and server-based chat
systems.
  • ChatTrack Provides agents with new tools for
    vigilance against crime
  • ChatProfile generates conceptual profiles from
    chat data
  • Filter based on listener/sender, date/time, chat
    room session name
  • ChatRetrieve facilitates manual analysis for
    session retrieval by agents, administrators,
    parents
  • Reduces manual efforts to classify chat sessions

24
Future
  • Automate temporal analysis of chat room topics
    (ChatTrend)
  • Sudden change of interests may indicate odd
    behavior
  • e.g., an upswing in discussions about public
    water utilities
  • Visualize user profiles to inspect interests,
    topic shifts

25
Future (ChatTrend)
  • CHATTREND SCREENSHOT HERE

26
Future (ChatTrend 2)
  • CHATTREND SCREENSHOT 2 HERE

27
Future (2)
  • Preprocess seemingly meaningless chatter into
    meaningful words
  • brb be right back
  • lol laughing out loud
  • heehehehee laughing
  • Chat utterance clustering identify similar chat
    content (thesis project)
  • Provide ability to retrieve, replay sessions from
    archive based on topics
  • Keyword search nitroglycerin or pipe bomb
  • Topic search bomb making

28
Future (3)
  • Proactively monitor children
  • Restrict types of messages children send/receive
  • Corporate monitoring
  • Archive virtual meetings corporate liability
  • Chat Networks
  • Identify rogue users / Spambots
  • Analyze metrical habits, rhetorical devices,
    polysyllabic words
  • Language modeling identify user fingerprint
    (such as in analysis of Shakespeare literature)
  • Detect on-line predators
  • Investigate users having both adult and
    child-like language models

29
More Information
  • website has many
    components available for download
  • XML Chat Log schema
  • ChatLog Library (in C)
  • Modified IRC chat server
  • ChatTrackBot (IRC data collection)

http//www.ittc.ku.edu/chattrack/
Write a Comment
User Comments (0)
About PowerShow.com