سامانه اطلاعات پژوهشی ایران

این سایت در حال حاضر پشتیبانی نمی شود و امکان دارد داده های نشریات بروز نباشند

یکشنبه 30 آذر 1404


پردازش علائم و داده ها، جلد ۱۹، شماره ۴، صفحات ۱۴۳-۱۵۴


عنوان فارسی	ارائه روشی جدید برای تعبیه اسناد برای دسته‌بندی متون خبری

چکیده فارسی مقاله	یکی از کاربردهای مهم در پردازش زبان طبیعی، دسته‌بندی متون است. برای دستهبندی متون خبری باید ابتدا آنها را به شیوه مناسبی بازنمایی کرد. روشهای مختلفی برای بازنمایی متن وجود دارد ولی بیشتر آنها روشهایی همه منظوره هستند و فقط از اطلاعات هم‌رخدادی محلی و مرتبه اول کلمات برای بازنمایی استفاده مینمایند. در این مقاله روشی بیناظر برای بازنمایی متون خبری ارائه شده است که از اطلاعات هم‌رخدادی سراسری و اطلاعات موضوعی برای بازنمایی اسناد استفاده مینماید. اطلاعات موضوعی علاوه بر اینکه بازنمایی انتزاعیتری از متن ارائه میدهد حاوی اطلاعات هم‌رخدادیهای مراتب بالاتر نیز هست. اطلاعات هم‌رخدادی سراسری و موضوعی مکمل یکدیگرند. بنابراین در این مقاله به‌منظور تولید بازنمایی غنیتری برای دستهبندی متن، هر دو بکارگرفته شدهاند. روش پیشنهادی بر روی پیکرههای R8 و 20-Newsgruops که از پیکرههای شناختهشده برای دستهبندی متون هستند آزمایش شده و با روشهای مختلفی مقایسه گردید. در مقایسه با روش پیشنهادی با سایر روش‌ها افزایش دقتی به میزان افزایش 3% مشاهده گردید.

کلیدواژه‌های فارسی مقاله	بازنمایی سند، تعبیه سند، تعبیه کلمه، همرخدادی کلمات، اطلاعات موضوعی، دسته‌بندی متن

عنوان انگلیسی	A New Document Embedding Method for News Classification

چکیده انگلیسی مقاله	Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way that can be distinguishable by a classifier. There is an abundance of methods in the literature for document representation which can be divided into a bag of words model, graph-based methods, word embedding pooling, neural network-based, and topic modeling based methods. Most of these methods only use local word co-occurrences to generate document embeddings. Local word co-occurrences miss the overall view of a document and topical information which can be very useful for classifying news articles. In this paper, we propose a method that utilizes term-document and document-topic matrix to generate richer representations for documents. Term-document matrix represents a document in a specific way where each word plays a role in representing a document. The generalization power of this type of representation for text classification and information retrieval is not very well. This matrix is created based on global co-occurrences (in document-level). These types of co-occurrences are more suitable for text classification than local co-occurrences. Document-topic matrix represents a document in an abstract way and the higher level co-occurrences are used to generate this matrix. So this type of representation has a good generalization power for text classification but it is so high-level and misses the rare words as features which can be very useful for text classification. The proposed approach is an unsupervised document-embedding model that utilizes the benefit of both document-topic and term-document matrices to generate a richer representation for documents. This method constructs a tensor with the help of these two matrices and applied tensor factorization to reveal the hidden aspects of data. The proposed method is evaluated on the task of text classification on 20-Newsgroups and R8 datasets which are benchmark datasets in the news classification area. The results show the superiority of the proposed model with respect to baseline methods. The accuracy of text classification is improved by 3%.

کلیدواژه‌های انگلیسی مقاله	Text classification, Document representation, Document Embedding, Topic modeling, word co-occurrences

نویسندگان مقاله	زهرا رحیمی \| Zahra Rahimi Amirkabir University of Technology دانشگاه صنعتی امیرکبیر محمدمهدی همایونپور \| Mohammad Mehdi Homayounpour Amirkabir university of technology دانشگاه صنعتی امیرکبیر

نشانی اینترنتی	http://jsdp.rcisp.ac.ir/browse.php?a_code=A-10-31-5&slc_lang=fa&sid=1
فایل مقاله	فایلی برای مقاله ذخیره نشده است
کد مقاله (doi)
زبان مقاله منتشر شده	fa
موضوعات مقاله منتشر شده	مقالات پردازش متن
نوع مقاله منتشر شده	کاربردی

برگشت به: صفحه اول پایگاه \| نسخه مرتبط \| نشریه مرتبط \| فهرست نشریات

ارسال پیام برخط

در صورت مشاهده هر نوع اشکال در داده های پایگاه و یا برای ارسال نظرات و پیشنهاد های خود می توانید با پر کردن فرم تماس ما را در جریان قرار دهید.
برای پر کردن فرم تماس اینجا را کلیک کنید.

آمار پایگاه

نمایه شده در ISI 135

نمایه شده در PubMed 109

نمایه شده در Scopus 192

کاربران برخط 825

بازدید امروز 12578

بازدید کل 39503159

اطلاعات تماس

آدرس : تهران، سعادت آباد، بلوار پاکنژاد شمالی، بالاتر از میدان سرو، نبش کوچه ندا، پلاک ۶۸، ساختمان جاوید، واحد ۱۶

پست الکترونیک: yektaweb-AT-gmail.com

توجه

کلیه حقوق این وب سایت و مطالب آن متعلق به شرکت یکتاوب بوده و استفاده از مطالب آن با ذکر منبع بلامانع است
طراحی و برنامه نویسی: یکتاوب افزار شرق