Title: Studying users behavior in chat rooms
1Studying users behavior in chat rooms
- DANSS
- January 25, 2004
- Michael Rochkind
2Agenda
- Motivation
- Project goals
- What was done
- Results
- Conclusions
3Motivation
- Need for simulations of interactive end-users to
evaluate algorithms and system designs (e.g
algorithms for estimation of multicast group
size) - Difficulty to get real data (both technical and
administrative) - Most researchers use trace collected for audio
multicast of IETF conference talks in 1996
4Problems with the trace
- Complete research field is based on a single
trace - The trace is quite old (from 1996)
- Collected from one specific type of service
(audio conference). The exact nature of users is
unknown. The behavior is not necessary the same
as in other applications. - Impossible to validate the data or collect new
one - Relatively little activity of members
- Percentage of spurious joins/leaves is very high
5Statistical analysis of the trace
- Different researchers got different statistical
models for various parameters. - Ammar and Almeroth (the original trace creators)
obtained exponential model for most parameters
and Zipf distribution for long session stay time. - Aluf, Altman, Nain recently obtained from the
same long trace lognormal distribution for both
inter-arrival times and stay times. For short
multicast session they obtained Weibull
distributions for both inter-arrival and stay
times. - Assumed uniform distribution of users (spatial)
6Project goals
- To find a publicly available system which
reasonably approximates multicast users behavior. - To develop tools for data retrieval so that it
can be run by anyone, anytime. - To analyze the collected data
7Parameters of interest
- Inter-arrival time
- Session duration (on-time)
- Number of logged in users (group size)
- Users activity (messages, bytes)
- Geographical distribution of users
- Lifespan of multicast event (for short events)
- Comparison with the famous trace
8First try - message boards (Yahoo)
- Difficult to define term of user session. Many
users send just one message. - Only active users can be seen (writers)
- A lot of information is missing (about 50)
- Activity peaks when outstanding events happen
9Chat rooms
- The model is similar to multicast group
- Users explicitly join the room and leave it
- Join/leave time and stay time are well-defined.
- Every message sent to the room is received by all
room members
10IRC- Internet Relay Chat protocol
- Run over TCP/IP
- Text-based teleconferencing
- Client-server model
- Can run in distributed fashion
- Five big networks with many tens of thousands
users and thousands of channels (rooms)
11IRC Servers
- Form a backbone of IRC network
- Connected together without circles (in the form
of a spanning tree) - Handle clients connections
- Each server knows about all other servers and all
clients.
C1
C2
S5
S1
S2
S3
S4
C3
C4
S6
12IRC clients
- IRC client is anything connected to IRC server
which is not another IRC server. - Any TCP enabled device can be IRC client
- Distinguished by unique nickname
- Each IRC server has the following info about each
IRC client - Nickname
- Real name of the host where the client is running
- Username of the client on that host
- IRC server to which the client is connected
13IRC Channels
- Parallel to the term Chat room
- Named group of one or more users which will all
receive messages addressed to that channel. - Created when first user joins the channel
- Ceases to exits when last users leaves it
- In case of network split the channel on each side
has only those clients connected to the servers
in the corresponding side. After network
reconnection the channel is joined again.
14IRC network example
C1
S5
S1
S2
C2
S4
S3
C3
C4
S6
15IRC message sending
C1
S5
S1
S2
C2
S4
S3
C3
C4
S6
16IRC new member joins to a channel
- Channel X with members C1, C2, C3
- Client C4 joins the channel X
join c4
C1
S5
join c4
S1
join c4
S2
join c4
C2
join c4
join c4
S4
join c4
S3
1. Join X
join c4
C3
2. names c1, c2, c3
S6
C4
17IRC Channel Monitoring
- Monitoring client written in Perl running under
cron - We choose randomly 3 channels from the group of
all channels with more than 100 users israel,
canada, bosnia - Channel activity data was collected for a period
of about 6 weeks.
18Log file format
- START
- EXIT
- JOIN
- PARTQUITKICK
- PUBLIC
- NICK
- NAMES
191053586971 START 1053587032 JOIN wponiw
IL 1053587032 NAMES wponiw Teo_ i-NA mr_shark
_kNibAL_ kaye_22 Old-Man CHA_555 klent Leila19f
Dan kalanko1 Manifa21f jennider1 eu_sunt
mangko18 hotguy holly20f sad_beaut swimgirl
ghazde swt_guy pseudonym bing_23 topgirl23
sexYica creatza sergio9 ZaRa glance cookie
aileen Ugly-GirL AFNAN EclipseM laurra-f garden
cai applej SHUNSY fatcock kikelph mhaelee16 aGaTa
Ercko lonebabe shellaine juulia priti2
HuntI2ess 1053587032 NAMES gienah Amanda
Jamali lishat18 cute_ashf jhen Horbit Sana18
AloneMan3 Errikka ext-ex Maysmile ynet02 poem_37M
ann3 jelle love_less dreeve18 indai adze LiWeiYi
TokyoBoy blossom dummee man__ marichu earp danone
jackdaw faraz ANGELA25 boby27 leah_ jossie
shyrgil jade-17 kian arnulpo ally16 FiNG
Carmina42 bangd sohail Janine33 anne--- joyce22
LUIE_M Travioli corn HOMBREJ2 sexybabes spyk2000
barbi3 1053587032 NAMES tumbleWED Gaby3
chynna babyTH lenjie jherome Certified
dj_france jane36 micay shah goerge24 bluediamo
master_po Jypsy bassma Bobson Fil24f dimple2
_THERE_ AloneGirL Naked_f shark_nyk morena23
Danniel_m Arwen_ ofw_park jimbern m40usa restie
_at_PacZzZzZz blackstud davis He11razor MultiMind
mater Fearless Adnan_pk Ermya Helena BrainDead
CStrixAW wooden birkof Cute_Girl Lisa_--
Megaframe barbara- 1053587032 NAMES Simple
Loren23 Diana27 Cozzo NateDogg legendh Angel19
Mariah19 fedfed SUNSEEKER PRONET7 bestofmi
D0gGi3 Don_Juan MrNylons teapot SkiPerZ Br0Th4
Linutech ShowerMia JenJen Mariahhh optimist
_at_X 1053587032 JOIN D-A-D-I IN 1053587045 JOIN
sydneyguy AU 1053587047 PUBLIC Certified 17
US 1053587053 PUBLIC Certified 13 US 1053587059
JOIN Mckay28 MT 1053587063 NICK CHA_555
zHTe 1053587068 PUBLIC Certified 31
US 1053587076 PART zHTe 1053587080 JOIN
villain PH 1053587082 JOIN cryn PH 1053587095
JOIN staticx US 1053587098 PUBLIC Certified 31
US
20Inter-arrival time
21Inter-Arrival distribution bosnia
occurrences
Time (in sec)
occurrences
Time (in sec)
22Inter-Arrival distribution israel
occurrences
Time (in sec)
occurrences
Time (in sec)
23Inter-Arrival distribution canada
occurrences
Time (in sec)
occurrences
Time (in sec)
24Inter-Arrival distribution
- Distrubution looks similar for all three channels
- The distribution is heavy-tailed from two main
reasons - Network splits - add zero values (during
reconnection) and big values (during the split) - Periods of low activity add tail (more actual for
channels with non-uniform geographical
distribution like bosnia)
25Inter-arrival time fits
israel
- LogNormal distribution is the best in almost all
cases - The only exception is InvGauss distribution using
A-D and K-S for israel - Exponential distribution is very far from being
optimal
canada
bosnia
26The audio trace inter-arrival fits
- Inter-arrival time distribution is similar to IRC
Channels - LogNormal/ InvGauss
27Session Duration
28Session duration distribution- israel
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
29Session duration distribution- canada
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
30Session duration distribution- bosnia
occurrences
Duration (105 sec)
occurrences
Duration (in sec)
31Session duration distribution
- Very heavy tail for two reasons
- Many users spent a lot of time in the channel
- Robots
32Session duration fits
israel
- BetaGeneral distribution gives best fit using
Chi-Square and K-S tests any time that we limit
the data samples - LogNormal is always on the second place (and best
fit using A-D tests) - When we dont limit the data samples LogNormal is
the best. - Exponential is very far from being optimal
canada
bosnia
33The audio trace session duration fits
- Session durations is not similar -extremely heavy
tail. - 90th percentile similar to IRC channels
occurrences
Time (in sec)
34The audio trace session durations
Long sessions (1 min)
- Long sessions are similar to IRC channels
- The phenomenon of short sessions is unique to the
audio trace. No analog in the IRC Channels
Short sessions (
35Main affecting factors
- Network failures (splits)
- Robots and long staying users
- Geographical distribution of users
36IRC network splits
- Any IRC server failure or link failure causes
split. - For channel member a split looks like massive
leave of users and reconnection looks as massive
join of users. - Contribute big number of zeros to inter-arrival
time (about 2 percent of joins come in groups) - Decrease session durations
- Most splits lasts for up to 20 minutes
37Short (temporal) Splits
- Heuristic Find group of quits followed by a
group of joins with the same users. - Finds only part of failures
38Split durations
occurrences
Duration (sec)
39Robots
We define robot as any client who is logged in
more than 8 hours in day in average.
- Add constant to number of logged users
- Add heavy tail to session durations
- Dont affect inter-arrival and join statistics
40Distribution of logged robots number
occurrences
Number of bots
41Robots session durations (channel canada)
42Geographical distribution
43Geographical distribution during day hours
44Number of logged in users (channel size)
45Number of user joins per hour
46User traffic (Israel)
Joins per hour
Hour of day
Channel size
Hour of day
47User traffic (bosnia)
Joins per hour
Hour of day
Channel size
Hour of day
48User traffic (canada)
Joins per hour
Hour of day
Channel size
Hour of day
49User traffic as function of time of day
observations
- The function is very stable over different days
- The graph shape is mainly defined by geographical
distribution of users - Has grate influence on other parameters
distribution like number of on-line users, number
of joins per hour.
50Joins per hour distribution - israel
Joins in hour
occurrences
Joins in hour
51Joins per hour distribution - bosnia
Joins in hour
occurrences
Joins in hour
52Joins per hour distribution - canada
Joins in hour
occurrences
Joins in hour
53Data traffic (Israel)
Msg per hour
Hour of day
Bytes Per hour
Hour of day
54Data traffic (bosnia)
Msg per hour
Hour of day
Bytes Per hour
Hour of day
55Data traffic (canada)
Msg per hour
Bytes Per hour
56Data traffic observations
- Two graphs are highly correlated due to the
nature of the messages. - Some exceptions coming from robots violating the
game rules. - Some correlation with number of logged in users
but much more flat.
57Users activity (writers)
58Users activity (part 2)
59Short multicast event
- 10 start joining
- 40 most participants joined
- 50 last particip. joins. Event starts.
- 110 event ends
- 120 participants leave
- 190 users leave
Time (minutes)
60Short multicast event (data traffic)
msgs
bytes
Time (minutes)
61Conclusions
- Modeling of multicast groups behavior through IRC
users is possible. - Its difficult to fit empirical data into pure
analytical models due to the combination of
different factors (user types, system failures
etc). Simulation process must take into account
all these factors - The famous audio log is inadequate with respect
to some important parameters - Traditional assumption about uniformity of
spatial distribution is not always correct - Data logs and scripts are available for use