Title: 25
1Singer SimilarityA Brief Literature Review
- Catherine Lai
- MUMT-611 MIR
- March 24, 2005
1
2Outline of Presentation
- Introduction
- Motivation
- Related research
- Recent publications
- Kim Whitman, 2002
- Liu Huang, 2002
- Tsai, Wang, Rodgers, Cheng Yu, 2003
- Bartsch Wakefield, 2004
- Discussion
- Conclusion
2
3Introduction
- Motivation
- Multitude of audio files circulation on the
Internet - Replace human documentation efforts and organize
collection of music recordings automatically - Singer identification relatively easy for human
but not machines - Related Research
- Speaker identification
- Musical instrument identification
3
4Kim Whitman, 2002.
- Singer Identification in Popular Music
Recordings Using Voice Coding Features (MIT
Media Lab) - Automatically establish the I.D. of singer using
acoustic features extracted from songs in a DB of
pop music - Perform segmentation of vocal region prior to
singer I.d. - Classifier uses features drawn from voice coding
based on Linear Predictive Coding (LPC) - Good at highlight formant locations
- Regions of resonance significant perceptually
-
4
5Kim Whitman, 2002.Detection of Vocal Region
- Detect region of singing detect energy within
frequencies bounded by the range of vocal energy - Filter audio signal with band-pass filter
- Used Chebychev IIR digital filter of order 12
- Attenuate other instruments fall outside of the
vocal range regions e.g. bass and cymbals - Voice not only remaining instrument in the region
- Discriminate the other sounds e.g. drums use a
measure of harmonicity - Vocal segment is 90 voiced is highly harmonic
- Measure harmonicity of filtered signal within
analysis frame and thresholding the harmonicity
against a fixed value
5
6Kim Whitman, 2002.Feature Extraction
- 12-pole LP analysis based on the general
principle behind LPC for speech used for feature
extraction - LP analysis performed on linear and warped scales
- Linear scale treats all frequencies equally on
linear scale - Human ears not equally sensitive to all
frequencies linearly - Warping function adjusts closely to the Bark
scale approx. frequency sensitivity of human
hearing - Warp function better at capture formant location
at lower frequencies -
6
7Kim Whitman, 2002.Experiments
- Data sets include 17 different singer 200 songs
- 2 classifier Gaussian Mixture Model (GMM) and SVM
used on 3 different feature sets - Linear scaled, warped scaled, both linear and
warped data - Run on entire song data and on segments
classified as vocal only -
7
8Kim Whitman, 2002.Results
-
- Linear frequency feature tend to outperform the
warped frequency feature when each used alone
combined best - Song and frame accuracy increases when using only
vocal segments in GMM - Song and frame accuracy decreases when using only
vocal segments in SVM -
Kim Whitman, 2002
8
9Kim Whitman, 2002.Discussion and Future Work
- Better performance of linear frequency scale
features vs. warped frequency scale features
indicate - Machine find increased accuracy of the linear
scale at higher frequencies useful - Contrary to human auditory system
- The performance of the SVM decreased is puzzling
- Finding aspects of the features not specifically
related to voice - Add high-level musical knowledge to the system
- Attempt to I.D. song structure such as locate
verses or choruses - Higher probability of vocals in these sections
9
10Liu Huang, 2002.
- A Singer Identification Technique for
Content-Based Classification of MP3 Music
Objects - Automatically classify MP3 music objects
according to singers - Major steps
- Coefficients extracted from compressed raw data
used to compute the MP3 features for segmentation - Use these features to segment MP3 objects into a
sequence of notes or phonemes - Waveform of 2 phonemes
-
- Each MP3 phoneme in the training set, its MP3
features extracted and stored with its associated
singer in phoneme DB - Phoneme in the MP3 DB used as discriminators in
an MP3 classifier to I.D. the singers of unknown
MP3 objects
Liu Huang, 2002
10
11Liu Huang, 2002.Classification
- Number of different phonemes a singer can sing is
limited and singer with different timbre possess
unique phoneme set - Phonemes of an unknown MP3 song can be associated
with the similar phoneme of the same singer in
the phoneme DB - kNN classifier used for classification
- Each unknown MP3 song first segmented into
phonemes - First N phonemes used and compared with every
discriminators in the phoneme DB - K closest neighbors found
- For each of the k closest neighbor,
- If its distance within a threshold, a weighted
vote given - KN weighted votes accumulated according to
singer - Unknown MP3 song is assigned to the singer with
largest score
11
12Liu Huang, 2002.Experiments
- Data set consists of 10 male and 10 female
Chinese singers each with 30 songs - 3 factors dominate the results of the MP3 music
classification method - Setting of k in the kNN classifier (best k 80
result 90 precision rate) - Threshold for vote decision used by the
discriminator (best threshold 0.2) - Number of singer allowed in a music class (larger
no. higher precision) - Allow 1 singer in a musical class
- Grouping several singers with similar voice
provide ability to find songs with singers of
similar voices
12
13Liu Huang, 2002.Results and Future Work
- Results within expectation
- Songs sung by a singer with very unique style
resulted in the highest precision rate ( 90) - Songs sung by a singer with a common voice
resulted in only 50 of the precision rate - Future work to use more music features
- Pitch, melody, rhythm, and harmonicity for music
classification - Try to represent MP3 features according to syntax
and semantics of the MPEG7 standards
Liu Huang, 2002
13
14Tsai et al., 2003.
- Blind Clustering of Popular Music Recordings
Based on Singer Voice Characteristics (ISMIR) - Technique for automatically clustering
undocumented music recording based on associated
singers given no singer information or population
of singers - Clustering method based on the singers voice
rather than background music, genre, or others - 3-stage process proposed
- Segmentation of each recording into
vocal/non-vocal segments - Suppressing the characteristics of background
from vocal segment - Clustering the recording based on singer
characteristic similarity
14
15Tsai et al., 2003.Classification
- Classifier for vocal/non-vocal segmentation
- Front-end signal processor to convert digital
waveform into spectrum-based feature vectors - Back-end statistical processor to perform
modeling, matching, and decision making
15
16Tsai et al., 2003.Classification
- Classifier operates in 2 phases training and
testing - During training phase, a music DB with manual
vocal/non-vocal transcriptions used to form two
separate GMMS a vocal GMM and non-vocal GMM - In testing phase, recognizer takes as input
feature vectors extracted from an unknown
recording, produces as output the frame
log-likelihoods for the vocal GMM and the
non-vocal GMM
16
17Tsai et al., 2003.Classification
Tsai, 2003
17
18Tsai et al., 2003.Decision Rules
- Decision for each frame made according to one of
three decision rules 1. frame-based, 2.
fixed-length-segment-based, and 3.
homogeneous-segment based decision rules.
Assign a single classification per segment
Tsai, 2003
18
19Tsai et al., 2003.Singer Characteristic Modeling
- Characteristics of voice be modeled to cluster
recordings - V v1, v2, v3, be features vectors from a
vocal region, is a mixture of - solo feature vectors S s1, s2, s3,
- background accompaniment feature vectors B b1,
b2, b3, - S and B unobservable
- B can be approximated from the non-vocal segments
- S is subsequent estimated given V and B
- A solo and a background music model is generate
for each recording to be clustered
19
20Tsai et al., 2003.Clustering
- Each recording evaluated against each singers
solo model - Log-likelihood of the vocal portion of one
recording tested against one solo model computed
(for all solo models) - K-mean algorithm used for clustering
- Starts with a single cluster and recursively
split clusters - Bayesian Information Criterion employed to decide
the best value of k
20
21Tsai et al., 2003.Experiments
- Data set consists of 416 tracks from Mandarin pop
music CD - Experiments run to validate the vocal/non-vocal
segmentation method - Best accuracy achieved was 78 using the
homogeneous segment-based method
21
22Tsai et al., 2003.Results
- System evaluation on the basis of average cluster
purity - When k singer population, the highest purity
0.77
Tsai, 2003
22
23Tsai et al., 2003.Future Work
- Test method on a wider variety of data
- Larger singer population
- Richer songs with different genre
23
24Discussion and Conclusion
- Singer similarity technique can be used to
- Automatically organize a collection of music
recordings based on lead singer - Labeling of guest performers information usually
omitted in music in music database - Replace human documentation efforts
- Extend to handle duets, chorus, background
vocals, other musical data with multiple
simultaneous or non-simultaneous singers - Rock band songs with parts sung by the guitarist,
drummer band members can be identified
24
25Bibliography
- Bartsch, M. and G. Wakefield (2004). Singing
voice identification using spectral envelope
estimation. IEEE Transactions on Speech and Audio
Processing, vol. 12, no. 2,100-9. - Kim, Y. and B. Whitman (2002). Singer
identification in popular music recordings using
voice coding features. In Proceedings of the 2002
International Symposium on Music Information
Retrieval. - Liu, C. and C. Huang (2002). A singer
identification technique for content-based
classification of mp3 music objects. In
Proceedings of the 2002 Conference on Information
and Knowledge Management (CIKM), 438-445. - Tsai, W., H. Wang, D. Rodgers, S. Cheng, and H.
Yu (2003). Blind clustering of popular music
recording based on singer voice characteristics.
In Proceedings of the 2003 International
Symposium on Music Information Retrieval.
25