Acoustic variability of voice signal as factor of information security for automatic speech recognition systems with tuning to user voice

Vladimir V. Savchenko

doi:10.3103/S0735272720100039

Authors

Vladimir V. Savchenko Nizhny Novgorod State Linguistic University, Russian Federation https://orcid.org/0000-0003-3045-3337

DOI:

https://doi.org/10.3103/S0735272720100039

Keywords:

digital signal processing, random signal, voice signal, automatic speech processing, speech technology, information protection, voice verification

Abstract

The phenomenon of the voice signal acoustic variability in automatic speech recognition systems is considered. There are two varieties—intra- and inter-speaker speech variability. The probabilistic cluster model of minimal speech units in the Kullback–Leibler information metric is used for their mathematical description and comparison in magnitude. On its basis, theoretical estimates of the voice signal acoustic variability for each of its varieties are obtained separately. The effect of information security in systems with tuning to the authorized user voice is described and quantitatively characterized. The intra-speaker variability is negligible in comparison with the inter-speaker variability of speech, and therefore does not have a noticeable harmful effect on the effectiveness of automatic speech recognition. The computational experiment is set up to confirm and develop the theoretical research results, where two speech streams from two different speakers are considered. The author’s software is used for its implementation. According to the experimental results we find that the level of inter-speaker speech variability in a number of cases goes beyond the inter-phonemic differences within a homogeneous speech flow. Therefore, in systems with tuning to the speaker voice, the effect of voice signal acoustic variability is not only unambiguously generally positive, namely: it is an information protection from unauthorized access, but also it is significant in terms of probability-theoretic relation. The obtained results are intended for the development of new and modernization of existing systems for automatic speech recognition, designed to work in a standalone mode.

References

L. Rabiner, R. Schafer, Theory and Applications of Digital Speech Processing. Boston: Pearson, 2010, uri: https://www.amazon.com/Theory-Applications-Digital-Speech-Processing/dp/0136034284.

I. B. Tampel, “Automatic speech recognition – the main stages over last 50 years,” Sci. Tech. J. Inf. Technol. Mech. Opt., vol. 100, no. 6, pp. 957–968, 2015, doi: https://doi.org/10.17586/2226-1494-2015-15-6-957-968.

D. Yu, L. Deng, Automatic Speech Recognition. London: Springer London, 2015, doi: https://doi.org/10.1007/978-1-4471-5779-3.

A. Rogowski, “Industrially oriented voice control system,” Robot. Comput. Manuf., vol. 28, no. 3, pp. 303–315, 2012, doi: https://doi.org/10.1016/j.rcim.2011.09.010.

M. Schuster, “Speech recognition for mobile devices at google,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6230 LNAI, Berlin, Heidelberg: Springer, 2010, pp. 8–10.

R. Rammohan, N. Dhanabalsamy, V. Dimov, F. J. Eidelman, “Smartphone conversational agents (apple siri, google, windows cortana) and questions about allergy and asthma emergencies,” J. Allergy Clin. Immunol., vol. 139, no. 2, p. AB250, 2017, doi: https://doi.org/10.1016/j.jaci.2016.12.804.

V. V. Savchenko, A. V. Savchenko, “Information-theoretic analysis of efficiency of the phonetic encoding–decoding method in automatic speech recognition,” J. Commun. Technol. Electron., vol. 61, no. 4, pp. 430–435, 2016, doi: https://doi.org/10.1134/S1064226916040112.

R. A. Ustinov, “Specific features of modern voice protection systems,” Bezop. Inf. Tehnol., vol. 24, no. 4, pp. 71–79, 2017, doi: https://doi.org/10.26583/bit.2017.4.08.

Z. Wu, Information Hiding in Speech Signal for Secure Communication. Amsterdam: Elsevier, 2015, doi: https://doi.org/10.1016/C2013-0-19179-9.

S. M. Qaisar, N. Hainmad, R. Khan, R. Asfour, “A speech to machine interface based on perceptual linear prediction and classification,” in 2019 Advances in Science and Engineering Technology International Conferences (ASET), 2019, pp. 1–4, doi: https://doi.org/10.1109/ICASET.2019.8714304.

R. González Hautamäki, M. Sahidullah, V. Hautamäki, T. Kinnunen, “Acoustical and perceptual study of voice disguise by age modification in speaker verification,” Speech Commun., vol. 95, pp. 1–15, 2017, doi: https://doi.org/10.1016/j.specom.2017.10.002.

V. V. Savchenko, “Minimum of information divergence criterion for signals with tuning to speaker voice in automatic speech recognition,” Radioelectron. Commun. Syst., vol. 63, no. 1, pp. 42–54, 2020, doi: https://doi.org/10.3103/S0735272720010045.

S. Heald, S. Klos, H. Nusbaum, “Understanding speech in the context of variability,” in Neurobiology of Language, Cambridge, MA: Academic Press, 2016, pp. 195–208.

I. A. Sieber, G. A. Moroz, “Estimating the acoustic variation of s via principal component analysis,” NSU Vestnik. Ser. Linguist. Intercult. Commun., vol. 17, no. 1, pp. 49–64, 2019, doi: https://doi.org/10.25205/1818-7935-2019-17-1-49-64.

J. H. L. Hansen, H. Bořil, “On the issues of intra-speaker variability and realism in speech, speaker, and language recognition tasks,” Speech Commun., vol. 101, pp. 94–108, 2018, doi: https://doi.org/10.1016/j.specom.2018.05.004.

N. А. Krasheninnikova, “Main factors interfering with recognition of speech commands,” Simbirsk Sci. Bull., no. 1, pp. 201–204, 2011.

V. V. Savchenko, L. V. Savchenko, “Method for measuring the intelligibility of speech signals in the kullback–leibler information metric,” Meas. Tech., vol. 62, no. 9, pp. 832–839, 2019, doi: https://doi.org/10.1007/s11018-019-01702-1.

O. F. Krivnova, “Prosodic phrasing in spoken text: localization of breathing pauses,” in Computational Linguistics and Intelligent Technologies: Based on the Materials of the International Conference, Moscow: Dialog, 2016, pp. 340–354.

V. V. Savchenko, “Itakura–saito divergence as an element of the information theory of speech perception,” J. Commun. Technol. Electron., vol. 64, no. 6, pp. 590–596, 2019, doi: https://doi.org/10.1134/S1064226919060093.

V. V. Savchenko, “Estimation of the phonetic speech quality using the information theoretic approach,” J. Commun. Technol. Electron., vol. 63, no. 1, pp. 53–57, 2018, doi: https://doi.org/10.1134/S1064226918010126.

S. Kullback, Information Theory and Statistics. New York: Dover Publications, 1997, uri: https://www.amazon.com/Information-Theory-Statistics-Dover-Mathematics/dp/0486696847.

V. V. Savchenko, “Criterion for minimum of mean information deviation for distinguishing random signals with similar characteristics,” Radioelectron. Commun. Syst., vol. 61, no. 9, pp. 419–430, 2018, doi: https://doi.org/10.3103/S0735272718090042.

V. V. Savchenko, A. V. Savchenko, “Criterion of significance level for selection of order of spectral estimation of entropy maximum,” Radioelectron. Commun. Syst., vol. 62, no. 5, pp. 223–231, 2019, doi: https://doi.org/10.3103/S0735272719050042.

H. B. Dwight, Tables of Integrals and Other Mathematical Data, 3rd ed. New York: Macmillan, 1961, uri: http://plouffe.fr/simon/Phys et Math/TableofIntegralsSeries.pdf.

J. Benesty, M. M. Sondhi, Y. A. Huang, Eds., “Linear prediction,” in Springer Handbook of Speech Processing, Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 111–124.

P. H. Müller, P. Neumann, R. Storm, “Tafeln der mathematischen statistik,” VEB Fachbuchverlag, p. 279, 1973, uri: http://doi.wiley.com/10.1002/bimj.19740160816.