Title: Text to speech to text: a third orality?
1Text to speech to text a third orality?
Lawrie Hunter Kochi University of
Technology http//www.core.kochi-tech.ac.jp/hunter
2Current state Fragmentation of knowledge as a
result of the ongoing creation of research
niches A voracious, yet protective and
covetous knowledge industry
3Current state Isnt CALL just a subset
of User Experience (UX?)
4URGENT Just-in-time learner sociologyURGENT
Near-instant learner profilingUpgrade Learner
gt USERUser Experience (UX) practiceUZANTOs
MindCanvas -user profiling for a large target
group in a matter of hours RUMM rapid user
mental modelling GEMS game emulationThis may
be very fruitfully adapted to the foundation
explorationsleading to CALL decision-making.
Hunter (2006) Learners are evolving
The expanding palette Emergent CALL
paradigms (Invited virtual presentation) Antwerp
CALL 2006 http//www.core.kochi-tech.ac.jp/hunter/
professional/CALLparadigms/index.html
5Now text-to-speech and speech-to-text (T2S2T)
software have become truly usable in a very
practical sense. This blurs the line between
speech and text in a very immediate way.
http//www.nextuptech.com/
http//www.nuance.com/naturallyspeaking/
6Usable T2S2T No more typing. No more reading. No
more hands. Composition by speaking...ooh! Infor
mation acquisition by listening...ahh! If we do
this, we will be in a new orality.
7What?? Audio is lame VIDEO is the game. We are
in the youtube era. Get a second life!
8T2S will be fully usable in 2 (or x) years we
must assume the future and shift our place of
work there.
9QUESTION For second language learning systems
development, is audio going out?
10TODAY A search for principles governing the
use of voice in CALL
11Investigation of voice and cognition
12Walter Ong, 1982 Orality and Literacy The
Technologizing of the Word PRIMARY ORAL
cultures (no system of writing) think
differently from CHIROGRAPHIC cultures
13Walter Ong, 1982 Orality and Literacy The
Technologizing of the Word Electronic media
(e.g. telephone, radio and television) brought
about a second orality paraphrase Both
primary and secondary oralities afford a
strong sense of membership in a group.
paraphrase
14Walter Ong, 1982 Orality and Literacy The
Technologizing of the Word Electronic media
(e.g. telephone, radio and television) brought
about a second orality paraphrase Both
primary and secondary oralities afford a
strong sense of membership in a group.
paraphrase
BUT Secondary orality is "essentially a more
deliberate and self-conscious orality, based
permanently on the use of writing and print,"
and produces much larger groups.
15Kathleen Welch rejects claims that Ong posits
mutually exclusive, competitive, reductive
orality-literacy divide. Welch argues that Ong
emphasizes -a mingling of these types of
consciousness -tenacity of established forms as
new ones appear
Welch, K. (1999) Electric Rhetoric Classical
Rhetoric, Oralism, and a New Literacy. MIT Press.
p. 59
16Welch argues that TV's ubiquity has resulted in
a new, electronic literacy. We shall not go
there today.
17Workable T2S2T promises to change the nature of
cognitive load constraints in text
production/decoding, and hence in language
learning task.
18Workable T2S2T There is now S2T (Dragon Voice)
for Indian English, British English... but not
for Japanese English yet. (Ever?)
http//labnol.blogspot.com/2007/01/dragon-naturall
yspeaking-9-speech.html
19Workable T2S2T There is now S2T (Dragon Voice)
for Indian English, British English... but not
for Japanese English yet. (Ever?)
http//labnol.blogspot.com/2007/01/dragon-naturall
yspeaking-9-speech.html So the tech is
there for computers to decode human speech better
than humans can...?
20HOWEVER we dont know much about how orality
works. Perhaps that is because orality is so
ingrained in us.
21Walter Ong, 1982 Orality and Literacy The
Technologizing of the Word
Secondary orality 163 years
The three stages of consciousness
Literacy 2800 years
Primary orality 200,000 years
Telegraphy USA, 1844
Invention of phonetic alphabet in 8th century BCE
Rhys Carpenter (1933) The antiquity of the Greek
alphabet. American journal of archaeology 37
8-29.
22WIRED FOR SPEECH Orality has been part of human
life for a long time. After 200,000 years of
evolution ...humans have become
voice-activated, with brains that are wired to
equate voices with people and to act quickly on
that information.
Nass, C. S. Brave. (2005) Wired for speech.
(2005). MIT Press.
23Writing a secondary modelling system
Lotman, J., trans. R. Vroon (1977) The structure
of the artistic text. Michigan Slavic Studies, 7.
Writing can never exist without orality. p.
8 Speeches that were studied as rhetoric could
only be studied if they were transcribed.
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge.
24Writing a secondary modelling system
Lotman, J., trans. R. Vroon (1977) The structure
of the artistic text. Michigan Slavic Studies, 7.
...to this day no concepts have yet been formed
for effectively, let alone gracefully,
conceiving of oral art as such without
reference, conscious or unconscious, to
writing. p.10
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge.
25Psychodynamics of orality ...you know what you
can recall.
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge.
26Psychodynamics of orality Pythagoras and the
acousmatics
The term acousmatic dates back to Pythagoras, who
is believed to have tutored his students from
behind a screen so as not to let his presence
distract them from the content of his lectures.
wikipedia.org May 20, 2007 edited from Chion,
M.(1994). "Audio-Vision Sound on Screen",
Columbia University Press.
27Psychodynamics of orality Pythagoras and the
acousmatics
In cinema, acousmatic sound is sound one hears
without seeing an originating cause - an
invisible sound source. Radio, phonograph and
telephone, all which transmit sounds without
showing the source cause, are acousmatic media.
wikipedia.org May 20, 2007 edited from Chion,
M.(1994). "Audio-Vision Sound on Screen",
Columbia University Press.
28Psychodynamics of orality Acousmatic is
ubiquitous in CALL. Arent there situations
where acousmatic sound is appropriate? and
situations where it is not?
29Orality and writing production Kellogg Sentence
Production Demands Verbal Working
Memory Orthographic as well as phonological
representations must be activated for written
spelling. o Bonin, Fayol, Gombert
(1997) Verbal WM is necessary to maintain
representations during grammatical, phonological,
and orthographic encoding. o Levy Marek
(1999) o Chenoweth Hayes (2001) o Kellogg,
Olive, Piolat (2006)
Kellogg, R. (2006) Training writing skills A
cognitive developmental perspective. EARLI
SigWriting 2006 Antwerp. http//webhost.ua.ac.be/s
igwriting2006/Kellogg_SigWriting2006.pdf
30Audio sources in life
John Thackara tells of Ivan Illichs finding
that
In the 1930s, 9 out of 10 words a man heard by
age 20 were spoken directly to him.In the
1970s, 9 out of 10 words a man heard by age 20
were spoken through a loudspeaker.
Illich (1982) Computers are doing to
communication what fences did to pastures and
what cars did to streets. book In the
Bubble blog http//www.doorsofperception.com/
31We are innately orate
Human beings can quickly distinguish one
persons voice from another. p. 3
we know these things from differing heartbeat
responses Nass, C. S. Brave. (2005) Wired for
speech. (2005). MIT Press.
32We are innately orate
Human beings can quickly distinguish one
persons voice from another. p. 3 -even in the
womb we can distinguish our mothers voice from
that of another.
we know these things from differing heartbeat
responses Nass, C. S. Brave. (2005) Wired for
speech. (2005). MIT Press.
33We are innately orate
Human beings can quickly distinguish one
persons voice from another. p. 3 -even in the
womb we can distinguish our mothers voice from
that of another. -a few days after birth,
newborns prefer their mothers voice to that of
others, and can distinguish one unfamiliar voice
from another.
we know these things from differing heartbeat
responses Nass, C. S. Brave. (2005) Wired for
speech. (2005). MIT Press.
34We are innately orate
Human beings can quickly distinguish one
persons voice from another. p. 3 -even in the
womb we can distinguish our mothers voice from
that of another. -a few days after birth,
newborns prefer their mothers voice to that of
others, and can distinguish one unfamiliar voice
from another. -by 8 months of age we can attend
to one voice even when another is speaking at
the same time.
we know these things from differing heartbeat
responses Nass, C. S. Brave. (2005) Wired for
speech. (2005). MIT Press.
35Humans experts at extracting social from speech
Word choice carries social information. UX work
makes choices such as blaming 1. Speak
up. 2. Im sorry, I didnt catch
that. 3. We seem to have a bad connection.
Could you please repeat that?
Nass, C. S. Brave. (2005) Wired for speech.
(2005). MIT Press.
36Humans experts at extracting social from speech
Word choice carries social information. UX work
makes choices such as voice quality Booming
deep voice Could I possible ask you if you
wouldnt mind doing a tiny favor? High-pitched,
soft voice Pick up that shovel and start
digging!
Nass, C. S. Brave. (2005) Wired for speech.
(2005). MIT Press.
37Humans automatically react socially to voice
...the conscious knowledge that speech can have
a non-human origin is not enough for the brain
to overcome the historically appropriate
activation of social relationships by
voice even when voice quality is low and
speech understanding is poor.
Nass, C. S. Brave. (2005) Wired for speech.
(2005). MIT Press.
38Interiority of sound
...in an oral noetic economy, mnemonic
serviceability is sine qua non... p. 70 In
other words, oral information must be arranged
in a certain way a visual way if it is to be
remembered.
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge.
39Incorporating interiority
The eye cannot perceive interiority, only
surfaces. Taste and smell are not much help in
registering interiority/exteriority. Touch can
detect interiority but in the process damages
it. Hearing can register interiority without
violating it. Sight isolates, sound
incorporates.
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge.
40Incorporating interiority
41Oral memory
In primary oral cultures, need for an aide
memoire
-heavily rhythmic speech -balanced
patterns -epithetic expressions -formulary
expressions -standard thematic settings
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge. p. 33
42Oral memory
In primary oral cultures, thought and
expression are additive rather than subordinate.
Ong, W. (1982) Orality and literacy The
technologizing of the word. 1997 reprint
Routledge. p. 37 ff.
43Tentative observations based on the exploratory
hands-on experience of second language users.
Innisfree 1 Innisfree 2 Innisfree 3 Coney
Island 1 Coney Island 2 Coney Island 3
PhD technical writing class, KUT, May 24, 2007
44Tentative observations based on the exploratory
hands-on experience of second language users.
PhD technical writing class, KUT, May 24, 2007
45Tentative observations based on the exploratory
hands-on experience of second language users.
PhD technical writing class, KUT, May 24, 2007
46Tentative observations based on the exploratory
hands-on experience of second language users.
Self-reported estimates of comprehension of
samples.
PhD technical writing class, KUT, May 24, 2007
47Tentative observations based on the exploratory
hands-on experience of second language users.
Self-reported estimates of comprehension of
samples.
PhD technical writing class, KUT, May 24, 2007
48How might language learning support systems be
influenced by the new T2S2T technological
reality?
49Articulation at the phrase level In the
learners awareness S2T software foregrounds
articulation T2S foregrounds intonation,
blending, pausing
50Articulation at the phrase level Can S2T be used
to improve pronunciation?
Mitra, S., Tooley, J., Inamdar, P. and Dixond, P.
(2003) Improving English Pronunciation An
Automated Instructional Approach. Information
Technologies and International Development Volume
1, Number 1, Fall 2003, 7584. Massachusetts
Institute of Technology. http//www.mitpressjourna
ls.org/doi/abs/10.1162/itid.2003.1.1.75
51Articulation at the phrase level Can S2T be used
to improve pronunciation?
Mitra, S., Tooley, J., Inamdar, P. and Dixond, P.
(2003) Improving English Pronunciation An
Automated Instructional Approach. Information
Technologies and International Development Volume
1, Number 1, Fall 2003, 82. Massachusetts
Institute of Technology. http//www.mitpressjourna
ls.org/doi/abs/10.1162/itid.2003.1.1.75
52The looming prospect of a text-reduced
world Specificity as a foreign language e.g.
University X web site Japanese interview gt
English web site
53T2S2T brings richness to materials design. T2S2T
should imply that there will be a broad,
instantaneous choice of interface with
data. Aside from tangible choices of medium,
other parameters demand attention input
density -number of communication objects per
signal input complexity -degree of text
reduction -visual field richness -number of
simultaneous signals
54- Sometimes signals are
- 1. complementary, e.g. Changs sound track
supplies one of many possible intonations for a
hypertext. - 2. conflicting, e.g. phone user in a movie
theater - 3. mutually irrelevant, e.g. Muzak vs.
supermarket sale signs - 4. channel competing, e.g. powerpoint text and
speech - e.g. mosquito buzz vs. TV images
- 5. internal-external conflicting
- e.g. on-screen text back-checking during S2T
writing - http//www.yhchang.com
55A cubist look at text and attention Chang,
Young-Hae. NIPPON.html/ Here, is text so reduced
as to be iconic? How is this parallel to sound
objects?
http//www.yhchang.com
56A marvel in this age of niche books many
answers from one source
evolving from
Nass and Brave, Wired for speech.
57Improving voice interfaces by applying knowledge
of human speech
gender choice gender stereotyping voice
personalities accent, race, ethnicity user
emotion / voice emotion voice and content
emotion synthetic vs. recorded variation of
synthetic voice character assignment of
humanity input type error and blame
Nass and Brave, Wired for speech.
58Improving voice interfaces by applying knowledge
of human speech
Emotion can direct users towards or away from an
aspect of an interface. Emotion affects
cognition, e.g. in vehicle driving support
software. Finding people find it easier and
more natural to attend to voice emotions
consistent with their own present emotions. p. 77
gender choice gender stereotyping voice
personalities accent, race, ethnicity user
emotion / voice emotion voice and content
emotion synthetic vs. recorded variation of
synthetic voice character assignment of
humanity input type error and blame
Nass and Brave, Wired for speech.
59A promising task design tool Baddeley and
Hitchs 1986 model of working memory, with its
3 components.
- Three-component model of working memory
- -assumes an attentional controller, the central
executive, aided by two subsidiary systems - the phonological loop, capable of holding
speech-based information, and - the visuospatial sketchpad, which performs a
similar function for visual information. - The two subsidiary systems form active stores
that are capable of combining information from
sensory input, and from the central executive.
Hence a memory trace in the phonological store
might stem either from a direct auditory input,
or from the subvocal articulation of a visually
presented item such as a letter.
Please read this on Hunters web site.
60Working memory model extended Phonological loop
Important for short-term storage -ALSO for long
term phonological learning Associated
with -development of vocabulary in
children -speed of FLA in adults
Central Executive
Phonological Loop
Visuo-spatial Sketchpad
Visual semantics
Episodic LTM
Language
Baddeley, A. D. (2000) The episodic buffer a new
component of working memory? Trends in cognitive
sciences 4(11) 417-423.
61- Working memory model extended
- Phonological loop effects
- Phonological similarity
- Word-length
- Articulatory suppression
- Code transfer
- Central rehearsal code,
- not operation
Central Executive
Phonological Loop
Visuo-spatial Sketchpad
Visual semantics
Episodic LTM
Language
Baddeley, A. D. (2000) The episodic buffer a new
component of working memory? Trends in cognitive
sciences 4(11) 417-423.
62A most promising task design tool Baddeleys
model of working memory, with its (since 2000)
4 components.
Central Executive
The episodic buffer -assumed capable of storing
infor-mation in a multi-dimensional code. -thus
provides a temporary interface between the slave
systems and LTM. -assumed to be controlled by
the central executive -serves as a modelling
space that is separate from LTM, but which forms
an important stage in longterm episodic learning.
Phonological Loop
Visuo-spatial Sketchpad
Episodic Buffer
Visual semantics
Episodic LTM
Language
Shaded areas crystallized cognitive systems
capable of accumulating long-term
knowledge Unshaded areas fluid capacities
(such as attention and temporary storage),
themselves unchanged by learning.
Baddeley, A. D. (2000) The episodic buffer a new
component of working memory? Trends in cognitive
sciences 4(11) 417-423.
63Current state Isnt CALL just a subset
of User Experience (UX?)
64Thank you for your kind attention.
Dont hesitate to write to me.
Lawrie Hunter Kochi University of
Technology http//www.core.kochi-tech.ac.jp/hunter