این سایت در حال حاضر پشتیبانی نمی شود و امکان دارد داده های نشریات بروز نباشند
پردازش علائم و داده ها، جلد ۱۹، شماره ۴، صفحات ۱۷۹-۱۹۶

عنوان فارسی بازشناسی آوای فارسی با استفاده از شاخص‌های صوتی و روش‌های جبران‌سازی تنوعاتِ مبتنی بر شبکه‌های عصبی
چکیده فارسی مقاله شواهد و آزمایشات گفتاری نشان می‌دهد که اطلاعات در سیگنال گفتار به صورت غیر یکنواخت توزیع شده و انسان با تمرکز به نواحی پُر اطلاعات آن قادر است به صورت مقاوم گفتار را بازشناسی کند. در این راستا در این تحقیق، یک سامانه‌‌ی بازشناسی آوای فارسی مبتنی بر تمرکز روی بازشناسی مقاوم نواحی پُراطلاعات و مجزای صوتی ارائه شده است. این نواحی شاخص‌های صوتی نامیده می‌شوند. بدین منظور ابتدا برای سیگنال گفتارِ زبان فارسی یک مجموعه از شاخص‌های مناسب صوتی انتخاب شده و به یک شبکه‌ی عصبی عمیق آموزش داده شده‌اند. سپس، به منظور حذف تنوعات شاخص‌های صوتی، تغییراتی در ساختار مدل و شیوه‌ی آموزش آن در چهار طرح مختلف انجام شده است. در طرح اول، از یک شبکه‌ی عصبی جداگانه و در طرح دوم از یک ساختار یادگیری چند تکلیفی برای جبران­سازی غیرخطی تنوعات شاخص­های صوتی استفاده شده است. در طرح سوم نیز از یک اتصال بازگشتی در لایه­ی پنهان شبکه برای بازسازی ورودی و در طرح چهارم از یک ساختار مبتنی بر شبکه­های جاذب­دار عمیق برای کاهش تنوعات ناخواسته استفاده شده است. در این مقاله آزمایش‌ها روی مجموعه دادگانِ گفتاری فارسی "فارس‌دات" انجام شده است و نتایج بازشناسی به صورت خطای بازشناسی آوا گزارش شده است. بهترین مدل آموزش یافته، یک شبکه‌‌ی عصبی جلوسو با پنج لایه‌‌ی پنهان است. خطای بازشناسی آوای این ساختار روی دادگان آزمون برابر 74/21 درصد به دست آمد. همچنین استفاده از چهارطرحِ پالایش تنوعات به ترتیب خطای بازشناسی آوا را به طور مطلق 39/0، 58/0، 43/0 و 3/1 درصد کاهش داده است.
کلیدواژه‌های فارسی مقاله بازشناسی آوا، شاخص‌های صوتی، یادگیری عمیق، بازشناسی مقاوم، پالایش غیر‌خطی

عنوان انگلیسی Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods
چکیده انگلیسی مقاله Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed in all of them. Auditory experiments have also shown that the human brain pays more attention to information-rich areas. By focusing on these areas instead of uniform processing, the brain can more robustly recognize speech in intrinsic and environmental speech variations such as speaker and noise. In contrast, the performance of most speech recognition systems degrades dramatically in these conditions. Therefore, to boost speech recognition systems' robustness, some researchers have focused on developing speech recognition systems by modeling these informative parts of the speech signal named landmarks. Similarly, in this article, we implemented a landmark-based system to obtain a robust Persian speech recognition system inspired by human brain perception. We also conducted neural networks-based variation compensation methods to boost its performance. In this article, acoustic landmarks are classified into two categories of events and states with the following definitions. Events are defined as areas of the speech signal in which the spectral characteristics change drastically while their length does not change a lot. The transition areas between some adjacent pairs of phones (phones' borders) are primarily selected as events. States are also defined as areas of the speech signal that spectral characteristics do not change significantly. Here the nuclei of phones are considered as the states. Previous research, linguistic sources, and implementation results have been used to determine the Persian language's appropriate landmarks. Finally, a set of 313 landmarks was selected and used in our acoustic landmarks-based phone recognition system.  The neural network structure used to recognize acoustic landmarks is a feed-forward fully connected structure with ReLU function in its hidden layers and a linear function in its final layer. The number of layers and neurons of this structure has been determined experimentally. The best structure is composed of 5 fully connected layers with 1000 neurons per layer. In this study, instead of considering 313 neurons to express each of the 313 landmarks, a heuristic labeling method is used to reduce the number of output neurons and utilize the shared information between the landmarks. The landmark recognition model slides on the speech feature sequence in the test phase to produce the output landmark sequence. Finally, to convert the obtained landmark sequence to a phone sequence, three rule-based post-processing steps are performed.  Variabilities are among the essential quality degradation sources in speech recognition; therefore, we proposed two approaches to reduce them and boost phone recognition quality in our landmark-based system. To this aim, we have utilized the nonlinear filtering characteristic of neural networks by implementing four neural network schemes. In scheme 1, a feed-forward neural network is first trained to map training landmarks to their corresponding well-recognized samples. Then this structure can act as a nonlinear filter before the landmark recognition block. In scheme 2, a unified structure is simultaneously trained to learn landmark labels and the filtering part. In both of these schemes, we used a recursive loop to increase the chance of attractor manipulation in the structures. In scheme 3, a recursive loop is added to one hidden layer. This loop acts as an input variability simulator and forces the network to recognize the input data and its variations correctly. Finally, in scheme four, a deep attractor neural network-based structure is proposed to shape the structure's hidden layer components so that it can compensate for variabilities. The experiments are implemented on a Persian database named Farsdat, and the results are reported using phone error rate (PER) criteria. From every 25-millisecond speech frame, an acoustic feature called LHCB is extracted and combined with delta and delta-delta features of that frame. Every frame's features are concatenated with fourteen adjacent frames and are finally fed to our neural network-based landmark extraction model. The best-trained model obtained the PER of 21.74% on test data. Using scheme one to four, we achieved an absolute PER decrease by 0.39, 0.58, 0.43 and 1.30 percent, respectively. Comparing our landmark-based system's performance with other Persian phone recognition systems shows that this method could perform efficiently as a Persian phone recognition system.  In our future works, we intend to compare our acoustic-based phone recognition system's performance with conventional methods such as CTC in noisy conditions. Besides, it seems that acoustic landmarks can be used to create an alignment of the input speech sequence and the output transcription. Therefore, we will present a combination of CTC-based methods and acoustic landmarks to utilize acoustic landmarks' complementary information. This information might boost the performance and speed of CTC-based speech recognition methods, particularly in low resource languages.
کلیدواژه‌های انگلیسی مقاله Phone Recognition, Acoustic Landmarks, Deep Learning, Robust Recognition, Nonlinear Filtering

نویسندگان مقاله شقایق رضا | Shaghayegh Reza


علی سید صالحی | Ali Seyyedsalehi


زهره سید صالحی | Zohreh Seyyedsalehi



نشانی اینترنتی http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-798-2&slc_lang=fa&sid=1
فایل مقاله فایلی برای مقاله ذخیره نشده است
کد مقاله (doi)
زبان مقاله منتشر شده fa
موضوعات مقاله منتشر شده مقالات پردازش گفتار
نوع مقاله منتشر شده پژوهشی
برگشت به: صفحه اول پایگاه   |   نسخه مرتبط   |   نشریه مرتبط   |   فهرست نشریات