سامانه اطلاعات پژوهشی ایران

این سایت در حال حاضر پشتیبانی نمی شود و امکان دارد داده های نشریات بروز نباشند

پنجشنبه 27 آذر 1404


پردازش علائم و داده ها، جلد ۱۹، شماره ۳، صفحات ۱۷۵-۱۸۸


عنوان فارسی	تولید پیکره برچسب‌خورده واحدساز زبان فارسی با درنظر‌گرفتن ملاحظات زبان‌شناسی رایانشی آن

چکیده فارسی مقاله	متون نگاشته‌شده فارسی به‌طورمعمول دو مشکل ساده، ولی مهم دارند. مشکل نخست واژه‌های چندواحدی هستند که از اتصال یک واژه به واژه‌های بعدی حاصل میشوند. مشکل دیگر واحدهای چندواژهای هستند که از جداشدگی واژه‌هایی که با هم یک واحد واژگانی را تشکیل می‌دهند، حاصل می‌شوند. ابزار واحدساز در زبان فارسی که به‌عنوان یکی از ابزارهای پیش‌پردازش زبان است، کاربرد فراوانی در تجزیه و تحلیل متون داشته و باید بتواند واحدهای واژگانی را تشخیص دهد. به عبارتی، این ابزار، مرکز کلمات را در متون تشخیص داده و آن را به دنباله‌ای از کلمات به‌منظور تحلیل‌های بعدی تبدیل می‌کند. تنوع در رسم‎‌الخط فارسی و عدم رعایت قوانین جدانویسی و پیوسته‌نویسی کلمات از یک‌سو و پیچیدگی‌های واژگانی زبان فارسی از سویی دیگر فرایندهای مختلف پردازشی زبان از جمله واحدسازی را با چالش‌‌های بسیاری روبه‌رو می‌کند؛ لذا برای عملکرد بهینه این ابزار، لازم است ابتدا ملاحظات زبان‌شناسی رایانشی واحدسازی در زبان فارسی مشخص و سپس بر اساس این ملاحظات مجموعه‌دادهای برای آموزش و آزمایش آن فراهم شد. در این مقاله سعی شد ضمن تبیین ملاحظات یاد‌شده، به تهیه پیکرهای در این خصوص بپردازیم. پیکره تهیه‌شده شامل 183/21 کلمه و متوسط طول جملات 28/40 است.

کلیدواژه‌های فارسی مقاله	پیکره واحدساز زبان فارسی، پردازش زبان فارسی، زبان‌شناسی رایانشی

عنوان انگلیسی	Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

چکیده انگلیسی مقاله	The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create this tool is to identify and recognize the units that are known as independent semantic units in Persian language. This tool detects word boundaries in texts and converts the text into a sequence of words. In the English language, many activities have been done in the field of text tokenization and many tools have been development; such as: Stanford, Ragel, ANTLR, JFLex, JLex, Flex and Quex. In recent decades, valuable researches have also been conducted in the field of tokenization in Persian language that all of them have worked on the lexical and syntactic layer. In the current research, we tried to focus on the semantic layer in addition to those two layers. Persian texts usually have two simple but important problems. The first problem is multi-word tokens that result from connecting one word to the next. Another problem is polysyllabic units, which result from the separation of words that together form a lexical unit. Tokenizer is one of the language preprocessing tools that is widely used in text analysis. This component recognizes the center of words in texts and turns it into a sequence of words for later analysis. Variety in Persian script and non-observance of the rules of separation and spelling of words on the one hand and the lexical complexities of Persian language on the other hand, different language processing such as tokenization face many challenges. Therefore, in order to obtain the optimal performance of this tool, it is necessary to first specify the computational linguistics considerations of tokenization in Persian and then, based on these considerations, provide a data set for training and testing. In this article, while explaining the mentioned considerations, we tried to prepare a data set in this regard. The prepared data set contains 21.183 tokens and the average length of sentences is 40.28.

کلیدواژه‌های انگلیسی مقاله	Persian text tokenization corpus, Natural Language Processing (NLP), cyber linguistic

نویسندگان مقاله	مژگان فرهودی \| Mojgan Farhoodi پژوهشگاه ارتباطات و فناوری اطلاعات مریم محمودی \| Maryam Mahmoudi پژوهشگاه ارتباطات و فناوری اطلاعات مونا داودی شمسی \| Mona Davoudi پژوهشگاه ارتباطات و فناوری اطلاعات

نشانی اینترنتی	http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-2101-1&slc_lang=fa&sid=1
فایل مقاله	فایلی برای مقاله ذخیره نشده است
کد مقاله (doi)
زبان مقاله منتشر شده	fa
موضوعات مقاله منتشر شده	مقالات پردازش متن
نوع مقاله منتشر شده	کاربردی

برگشت به: صفحه اول پایگاه \| نسخه مرتبط \| نشریه مرتبط \| فهرست نشریات

ارسال پیام برخط

در صورت مشاهده هر نوع اشکال در داده های پایگاه و یا برای ارسال نظرات و پیشنهاد های خود می توانید با پر کردن فرم تماس ما را در جریان قرار دهید.
برای پر کردن فرم تماس اینجا را کلیک کنید.

آمار پایگاه

نمایه شده در ISI 135

نمایه شده در PubMed 109

نمایه شده در Scopus 192

کاربران برخط 370

بازدید امروز 30989

بازدید کل 39411888

اطلاعات تماس

آدرس : تهران، سعادت آباد، بلوار پاکنژاد شمالی، بالاتر از میدان سرو، نبش کوچه ندا، پلاک ۶۸، ساختمان جاوید، واحد ۱۶

پست الکترونیک: yektaweb-AT-gmail.com

توجه

کلیه حقوق این وب سایت و مطالب آن متعلق به شرکت یکتاوب بوده و استفاده از مطالب آن با ذکر منبع بلامانع است
طراحی و برنامه نویسی: یکتاوب افزار شرق