قرآن کتاب نازلشده از طرف خداست و تا به امروز اندیشمندان و پژوهشگران مختلفی در جهت شناخت قرآن و فهم آن تلاش نمودهاند. در دسترس بودن سیستمهای رایانهای فرصت مغتنمی است که با افزایش سرعت پژوهشگران در پیمودن مسیر، آنها را در رسیدن به قلههای بلندتری یاری کند. خوشهبندی یکی از روشهایی است که برای فهم ساختار داده به کار میرود. در این مقاله به خوشهبندی سورههای قرآن کریم بر اساس هموقوعی کلمات در آن پرداخته و برای دستیابی به این هدف از یک رویکرد موجود مبتنی بر گراف استفاده نمودهایم. در پژوهش جاری ابتدا هر سوره را به صورت یک گراف غیرجهتدار و وزندار بازنمایی کرده، سپس بردار هر سوره را بر اساس گراف سوره تشکیل دادهایم و پس از آن سورهها را خوشهبندی نمودهایم. برای ارزیابی کیفیت خوشهبندی از معیار نیمرخ استفاده کردهایم. بر اساس این معیار در بهترین خوشهبندی در بین اجراهای مختلف مقدار نیمرخ ۰/۹۱ به دستآمده است. این پژوهش زیرساخت ساختاری مناسبی برای توصیف لایه معنایی سورهها و آیات قران پیش روی پژوهشگران حوزه زبانشناسی محاسباتی در دامنه علوم قرآنی فراهم میسازد.
The Holy Qur'an is revealed from God Almighty. Up to now many scholars and researchers have tried to understand the Holy Qur'an and comprehend it. The availability of computer systems is a great opportunity to help researchers reach higher peaks by speeding them up in their way. Clustering is one of the methods has been used to understand the structure of the data. In clustering, we want to divide samples of data into groups so that the members of each cluster are similar together and are different from the members of the other clusters. Clustering of Qur'anic surahs has been the subject of some computer studies on the Qur'an. In these studies, different approaches have been considered to vectorizing the surahs. In a study, Thabet formed vectors of each surah by considering some stems of Qur'anic words as features and the normalized probability of their occurrences in the surah as feature values and clustered just 24 surahs due to the sparseness of the obtained data matrix. With a similar approach in vectorizing the surahs, Moisl calculated the minimum surah length threshold per feature in order to solve the problem of shorter surahs by using some concepts of statistical sampling theory, and could cluster more surahs. Instead of using words as features, Sharaf considered 13 features including existence of referring to the story of Adam and Ebliys, number of the phrase «یا أَیُّهَا الَّذینَ آمَنُوا» (O you who believe), and determined the method of measuring each feature. Then, he formed data matrix and clustered the Qur'anic surahs. In another study, Sufi et al. considered the topics identified for each verse in the Tafsir Rahnama as features and constructed a binary data matrix based on the presence or absence of that topic in the Tafsir of that surah and applied clustering. In this article, we have clustered the surahs of the Holy Qur'an based on the co-occurrence of words in it. To achieve this goal, we have used an existing graph-based approach. In the present study, we first represent each surah as a weighted undirected graph. Then we form the vector of each surah by considering closed frequent sub-graphs as features and relative occurrence of them in each surah as feature values, and eventually cluster the surahs. We used the Silhouette score to evaluate the quality of clustering. Based on this criterion, in the best clustering among different runs, the Silhouette score of 0.91 was obtained. This research provide a proper structural infrastructure for specifying the semantic layer of Holy Qur'an surahs for computational linguistics researchers in the domain of Qur'anic studies.