Title: Michael P' Oakes
1Michael P. Oakes
2Contents
- Proposals for a Masters programme in Natural
Language Processing - Future research plans / link with Wolverhampton
- Plans for publications
- Plans for grant proposals
- Other funding ideas
3Proposals for a Masters programme in Natural
Language Processing
- Some preliminaries
- Entry requirements first or second class degree
in a related discipline. Computer programming
will be taught from scratch. - Funding Erasmus, European Social Fund, ESRC
Masters training package scheme for programme
development, work-based learning - Students must receive an accurate idea of the
content of the programme beforehand - Induction week meet the teaching team,
familiarity with the University, formal
registration, etc. - Diploma, Certificate and Masters awards. 8
taught modules (24 lectures, 18 hours practical,
58 directed reading, 50 self-directed research).
4Certificate Stage
5Diploma Stage
6Project
- Close links with industry established through
3-month industrial placements, based either with
the company or at the University. - The sponsor will either be from industry or
academia, and there will also be a staff member
from Wolverhampton to act as supervisor. - Project management (TOR, reviews), poster, viva,
dissertation (typically introduction, research,
analysis, implementation, evaluation /
experiments, reflective conclusions).
7Administration
- Programme board of studies Institute Director or
deputy, student representatives, one or more
employers representatives, module leaders,
programme leader, responsible for the management
of the programme and the well-being of each
module. - Board of assessment to decide student
progression. External Examiner, no student
representatives - Internal (prior to hand-out) and External (sample
work shown prior to programme assessments)
moderation. - Other quality control student and staff
feedback, EEs report, programme annual report. - Each student has a personal tutor and student
handbook. - Timely, face-to-face assessment may improve
student satisfaction.
8Future Research Plans,
- And how these might complement the research
topics of the Research Group in Computational
Linguistics.
9Automatic Summarisation
- CAST Project produced an automatic summarisation
tool term-based summarisation - Content-Based Abstracting (Paice).
- TRESTLE (Gaizauskas).
- David Evans evaluation of information extraction
- Query-based summaries. Intrinsic
(representativeness) vs. Extrinsic (judgeability)
evaluation (Liang). - SumTrain reached second round of EU evaluation.
- Extraction of statistics-related phrases, e.g.
greater than, significant reduction in, was
directly proportional to, did not affect.
10Concept-Based Abstracting Project
- window length 4
- STOP 6 "and foliar treatment AGEN"
- 5 "foliar treatment AGEN "
- 5 "treatment AGEN AGEN"
- 4 "effect of mildew AGEN"
- 3 "AGEN gave a significant"
- 2 "AGEN was the most"
- 2 "AGEN at different sowing"
- 2 "AGEN increased fertile tillers
- LOW-FQ 1 "effect of AGEN sprays"
11Automatic Terminology Processing
- Le An Ha looked at the concept of a terminology
rather than individual terms. Knowledge patterns
from glossaries store of terms and relations
between them. - David Evans. Identification of terms using TF.IDF
and other statistical methods (see slide 20). - Shiyan Ou. Sentiment classification (see slide
20). - Constantin Orasan. Corpus of junk mail (spam
filters, Farrow). - Constantin Orasan. Analysis of genre differences
project on Language, Computation and Style
(authorship). - Englishes, Scrip newsfeeds, BELGA feature
extraction for text classification.
12Annotation tools
- Constantin Orasan PALinkA, automatic annotation
of anaphoric links. - Lewandowska, Oakes Rayson part-of-speech and
semantic code tagging in English alignment
enables partial semantic tagging of L2.
13Annotation Aligned and Partially Tagged Polish
text (Lewandowska, Oakes and Rayson)
- Tak jest_A3 mowi Polemarch_Z99 a do_Z5 tego
jeszcze urzadra nocne nabozenstwo, ktore_Z8 warto
zobaczyc - __PUNC That_DD1_Z8 s_VBZ_A3 the_AT_Z5
way_NN1_X4.2 of_IO_Z5 it_PPH1_Z8 ,_,_PUNC
__PUNC said_VVD_Q2.1 Polymarchus_NP1_Z99
_,_,PUNC __PUNC and_CC_Z5 ,_,_PUNC
besides_RR_Z5 _,_,PUNC there_EX_Z5, is_VBZ_A3
to_TO_Z5 be_VBI_A3 a_AT1_Z5 night_NNT1_T1.3
festival_NN1_K1/S1.1.3 which_DDQ_Z8
will_VM_T1.1.3 be_VBI_A3 worth_II_I1.3
seeing_VVG_X3.4 ._._PUNC -
14Mobile Devices
- Laura Hasler and Dalila Mekhaldi QALL-ME,
Question-Answering for Digital Phones. - Chufeng Chen Annotation of digital photographs
taken with a GPS camera. A gazetteer
translated longitude and latitude data into
place name, geographical feature, e.g. Long
54.91, Lat -1.4, place Sunderland, feature
harbour. Episodic memory.
15Other Related Work
- Andrea Mulloni Corpus Linguistics.
- Empirical vs. Chomskyan
- Own interest Statistics for Corpus Linguistics.
- Driving the process rather than merely testing
for statistical significance, e.g. Mutual
Information to find collocations. - Irina Temnikova Machine Translation
- Alignment for example-based machine translation
(Lewandowska Oakes).
16Plans for Publications (1)
- Book Chapters in press
- Processing Multilingual Corpora, Chapter 32 of
Corpus Linguistics An International Handbook,
eds. Anke Lüdeling and Merja Kytö, Mouton de
Gruyter. - Corpus Linguistics and Stylometry, Chapter 52,
ibid. - Corpus Linguistics and Language Variation, in
Contemporary Approaches to Corpus Linguistics,
ed. Paul Baker, Continuum. - Javanese, in Languages of the World, ed.
Bernard Comrie, Routledge. - J. Vilares, M. Oakes and M. Vilares A
Knowledge-Light Approach to Query Translation in
CLIR. RANLP V, ed. N. Nicolov, Benjamins.
17Plans for Publications (2)
- Under second review
- S-W. Ke, C. Bowerman and M. Oakes, Automatic
classification of personal email with PERC and
time-related strategies, ACM Transactions on
Information Systems. - W-C Lin, M. Oakes and J. Tait, Improving image
annotation via representative feature selection,
Cognitive Processing.
18Plans for Publications (3)
- Future plans
- VITALAS Video and image Indexing and reTrievAl in
the LArge Scale. - Update Statistics for Corpus Linguistics sold
over 1500 copies, but now 10 years old - Last chapter was Literary Detective Work, which
could be a book in its own right disputed
authorship (compendium of techniques,
Shakespeare, religious texts, still unsolved
mysteries e.g. The Quiet Don, Marxism and the
Philosophy of Language), unknown languages
(Linear B, Voynich manuscript). JLLC, QL.
19Plans for Grant Proposals (1)
- Closing the Semantic Gap
- Related to machine learning (boosting), caption
analysis, gazetteers, alignment of low level
image content features and high level semantic
features (words) - Son of VITALAS?
20Plans for Grant Proposals (2)
- Which words are truly characteristic of a corpus?
X² etc. - Countable linguistic features.
- Measures from IR e.g. PageRank (Lódz, Palomino).
- AHRC (if theoretical, Englishes), ESRC (if
applied, e.g. spam filters). - Sentiment analysis (Thijs Westerveld at Teezir)
mining online opinions. Cheerful, chic, cheap,
clean vs. chaos, cranky, cumbersome, damaged. - Interface between NLP and IR sentence analysis
e.g. adjectives, negatives follow links to
navigate websites. - IR relevant vs. irrelevant documents.
21Plans for Grant Proposals (3)
- Temporal relations in query language modelling
(Dawei Song). - Temporal similarity semantic similarity ?
overall similarity. - The temporal similarity between texts (e.g. query
and document) can be estimated by a) time stamp,
b) temporal logic between the texts (Andrea
Setzer).
22Plans for Grant Proposals (4)
- Corpus Profiling Workshop on October 18th.
- Exploring how corpus characteristics affect the
behaviour of techniques in IR and NLP, and to set
out a roadmap for a shared research agenda. - Data set profile impacts on automatic
classification, IR, anaphora resolution,
automatic summarisation and word sense
disambiguation.
23Other Funding Ideas
- IRSG-like Industry Day to foster industrial
contacts (consultancy? Grant proposals?) - Organise conferences, e.g. bid for Corpus
Linguistics, CLEF, ECIR. - Exploitation of Intellectual Property.
- Is there an equivalent of CEDEC (Computing and
Engineering Distance Education Centre) with whom
we can discuss marketing programmes world-wide /
part-time? Work-based learning?