A topic-specific web crawler using deep convolutional networks

dc.contributor.authorALkaraleh, Saed
dc.contributor.authorŞirin, Hatice Meltem Nergiz
dc.date.accessioned2023-10-13T06:26:26Z
dc.date.available2023-10-13T06:26:26Z
dc.date.issuedMAY 2023en_US
dc.departmentHKÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.description.abstractThis paper presented a new focused crawler that efficiently supports the Turkish language. The developed architecture was divided into multiple units: a control unit, crawler unit, link extractor unit, link sorter unit, and natural language processing unit. The crawler's units can work in parallel to process the massive amount of published websites. Also, the proposed Convolutional Neural Network (CNN) based natural language processing unit can professionally classifying Turkish text and web pages. Extensive experiments using three datasets have been performed to illustrate the performance of the developed approach. The first dataset contains 50,000 Turkish web pages downloaded by the developed crawler, while the other two are publicly available and consist of "28,567" and "22,431" Turkish web pages, respectively. In addition, the Vector Space Model (VSM) in general and word embedding state-of-the-art techniques, in particular, were investigated to find the most suitable one for the Turkish language. Overall, results indicated that the developed approach had achieved good performance, robustness, and stability when processing the Turkish language. Also, Bidirectional Encoder Representations from Transformer (BERT) was found to be the most appropriate embedding for building an efficient Turkish language classification system. Finally, our experiments showed superior performance of the developed natural language processing unit against seven state-of-the-art CNN classification systems. Where accuracy improvement compared to the second-best is 10% and 47% compared to the lowest performance.en_US
dc.identifier.citationALqaraleh, S & Sirin, HMN. (MAY 2023). A topic-specific web crawler using deep convolutional networks. Internatıonal Arab Journal Of Informatıon Technology. (20, 3, 310-318). https://doi.org/10.34028/iajit/20/3/3.en_US
dc.identifier.doi10.34028/iajit/20/3/3
dc.identifier.endpage318en_US
dc.identifier.issn1683-3198
dc.identifier.issue3en_US
dc.identifier.orcid0000-0002-7354-1364en_US
dc.identifier.scopus2-s2.0-85160273908
dc.identifier.scopusqualityQ2
dc.identifier.startpage310en_US
dc.identifier.urihttps://doi.org/10.34028/iajit/20/3/3
dc.identifier.urihttps://hdl.handle.net/20.500.11782/3887
dc.identifier.volume20en_US
dc.identifier.wosWOS:001046095400003
dc.identifier.wosqualityQ3
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherZarka Prıvate Unıven_US
dc.relation.ispartofInternatıonal Arab Journal Of Informatıon Technology
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectCNNen_US
dc.subjectnatural language processingen_US
dc.subjecttext classificationen_US
dc.subjecttopic specific crawleren_US
dc.subjectfocused crawleren_US
dc.subjectweb crawlingen_US
dc.titleA topic-specific web crawler using deep convolutional networks
dc.typeArticle

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
001046095400003.pdf
Boyut:
962.01 KB
Biçim:
Adobe Portable Document Format
Açıklama:
Makale Dosyası

Lisans paketi

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
license.txt
Boyut:
1.44 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: