BERT Based Topic-Specific Crawler

dc.contributor.authorTawil, Yahya
dc.contributor.authorAlqaraleh, Saed
dc.contributor.institutionauthorTawil, Yahya
dc.contributor.institutionauthorAlqaraleh, Saed
dc.date.accessioned2023-03-13T05:54:14Z
dc.date.available2023-03-13T05:54:14Z
dc.date.issued2021en_US
dc.departmentHKÜ, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümüen_US
dc.description.abstractNowadays, retrieving certain information using search engines is very popular and one of the main applications of the Internet. To speed up the process of getting the required information(web pages), having a topic-specific crawler is essential to fetch and index only the relevant ones. This paper presents a multi-thread web crawler using a Sentence Bidirectional Encoder Representations from Transformers (S-BERT). The S- BERT is used to calculate the similarity between the predefined classes and the text of the downloaded web pages. This provides a lightweight model compared to using a word embedding with deep learning for text classification. © 2021 IEEE.en_US
dc.identifier.citationTawil, Y., Alqaraleh, S. (2021). BERT Based Topic-Specific Crawler. Proceedings - 2021 Innovations in Intelligent Systems and Applications Conference, ASYU 2021: Code 174400.en_US
dc.identifier.doi10.1109/ASYU52992.2021.9599076
dc.identifier.isbn978-166543405-8
dc.identifier.orcid0000-0003-0321-0866en_US
dc.identifier.orcid0000-0002-7146-3905en_US
dc.identifier.scopus2-s2.0-85123208764
dc.identifier.scopusqualityN/A
dc.identifier.urihttps://hdl.handle.net/20.500.11782/3119
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherInstitute of Electrical and Electronics Engineers Inc.en_US
dc.relation.ispartofProceedings - 2021 Innovations in Intelligent Systems and Applications Conference, ASYU 2021
dc.relation.publicationcategoryKonferans Öğesi - Uluslararası - Kurum Öğretim Elemanıen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectdocument classificationen_US
dc.subjectsearch engineen_US
dc.subjecttext categorizationen_US
dc.subjecttext classificationen_US
dc.subjecttopic-specific crawleren_US
dc.subjectweb crawleren_US
dc.titleBERT Based Topic-Specific Crawler
dc.typeConference Object

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
makale - yayıncı sürümü70.pdf
Boyut:
334.91 KB
Biçim:
Adobe Portable Document Format
Açıklama:
Makale Dosyası

Lisans paketi

Listeleniyor 1 - 1 / 1
Yükleniyor...
Küçük Resim
İsim:
license.txt
Boyut:
1.44 KB
Biçim:
Item-specific license agreed upon to submission
Açıklama: