Title: Clustering semantic spaces of suicide notes and newsgroups
1Clustering semantic spaces of suicide notes and
newsgroups Pawel Matykiewicz, John P. Pestian,
Wlodzislaw Duch
Methods Differences in semantic and syntactic
spaces were tested using LIWC software. Documents
were converted to a matrix representation using
Perl Natural Language Processing modules. Vector
space normalization, multidimensional scaling
(MDS) and performance measures were calculated
using R software. Clustering algorithms came from
Weka machine learning package.
Introduction Suicide is the third leading cause
of death in adolescents and a leading cause of
death in the United States. Those who attempt
suicide usually arrive at the Emergency
Department seeking help. These individuals are at
risk for a repeated attempt, that may lead to a
completed suicide. Emergency Medicine clinicians
are often left to manage suicidal patients by
clinical judgment alone. This research focuses on
better understanding of a large collection of
suicide notes to help clinicians with their
judgment. This is done by comparing them with a
non-suicidal control group.
Results Table below shows differences in semantic
and syntactic spaces from the LIWC software.
There are five syntactic features (number of
articles, words gt 6 letters, pronouns,
prepositions and verbs) and four semantic
(biological, affective, cognitive and social
processes). Clustering was done by
combining the four newsgroups into following data
sets talk.politics.guns suicide notes guns,
talk.politics.mideast suicide notes mideast,
talk.politics.misc suicide notes politics,
talk.religion.misc suicide notes religion.
Figures and tables in the center of the poster
show MDS and clustering results.
Data Data for the suicide note database was
collected from around the United States. They
were either in a hand written or machine typed
form. Once the note was acquired, it was scanned
and typed into the database exactly as seen into
the database. As a non-suicidal control group
four out of twenty newsgroups from the University
of California in Irvine (UCI) machine learning
Repository were chosen. Selected newsgroups were
talk.politics.guns, talk.politics.mideast,
talk.politics.misc, and talk.religion.misc.
Conclusion LIWC software showed statistical
significance in the difference between semantic
and syntactic spaces. Sequential information
bottleneck clustering showed an ability to find
same sub-groups of suicide notes even when
different types of newsgroups are present. In our
analysis, one subgroup showed no emotional
content while the other was emotionally charged.
This finding is consistent with Tuckmans, 1959
work that showed suicide notes fall into six
emotional categories emotionally neutral,
emotionally positive, emotionally negative
directed inward, emotionally negative directed
outward, emotionally negative directed inward and
outward (Tuckman et al., 1959)