Abstract
The dark web is that integral part of WWW that provides freedom of Content Hosting. The dark web is accessible with specially designed browsers and tools of having peer-to-peer network technology such as TOR, IP2, and FREENET etc. These tools help users exchange information on the dark web while remaining anonymous; We used TOR (The Onion Router) browser in our research for understanding the dark web. It provides excellent anonymity to its users. The users mainly utilized the dark web for illegal activities; although accessing it is legal in most countries, its usage can arouse suspicion with the law. Categories like Adult, Counterfeits, illicit markets, and weapons are prevalent. This research provides an analytical framework for automating the classification of web pages with scraping and analysis of its hosted content on the dark web. The method we used can easily crawl data, classify the hosted content by machine learning model, and categorize that the hosted content is illegal and legal. The proposed framework contains Machine Learning Classifier Algorithms that are Naïve Bayes with an accuracy of 0.87%, Random Forest with an accuracy of 0.91%, Linear SVM with an accuracy of 0.91%, and Logistic Regression with an accuracy of 0.94%. This study created a machine learning framework that can classify hosted text content on dark web websites—models trained to classify the content and evaluated them based on accuracy. The results from our study validate the effectiveness of the proposed classification framework for analyzing the text data, which has more relevance with smaller datasets. Also, it is encouraging further studying this growing phenomenon and for investigators examining illegal activities on the Dark Web.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
D. Hayes, F. Cappa, J. Cardon, A framework for more effective Dark Web market-place investigations. Information 9(8), 186 (2018). https://doi.org/10.3390/info9080186
The Onion Router project metrics (2021). URL at https:// metrics.torproject.org/
H. Alnabulsi, R. Islam, Identification of Illegal Forum Activities Inside the Dark Net. In 2018 International conference on machine learning and data engineering (iCMLDE) (2018). https://doi.org/10.1109/icmlde.2018.00015 (2018)
N. Tavabi, N. Bartley, A. Abeliuk, S. Soni, E. Ferrara, K. Lerman, Charac-terizing activity on the deep and Dark Web. In: Companion proceedings of the 2019 world wide web conference (2019). https://doi.org/10.1145/3308560.3316502
A. Kumar, E. Rosenbach, The truth about the dark web (2019). At https://www.imf.org/external/pubs/ft/fandd/2019/09/the-truth-about-the-dark-web-kumar.htm
Deep web -the hidden side of Internet. URL at https://tharjournal.com/deep-web/
M. Mirea, V. Wang, J. Jung, The not so dark side of the darknet: a qualitative study. Secur. J. 32(2), 102–118 (2018). https://doi.org/10.1057/s41284-018-0150-5
K. Nalini, L.J. Sheela, Survey on text classification. Int. J. Innov. Res. Adv. Eng. 1(6), 412–417 (2014)
A. Mehler, C. Wolff, Text Mining. Themenheft des LDV-Forum (2005)
S. Brindha, K. Prabha, S. Sukumaran, A survey on classification techniques for text mining. In 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS) (2016). https://doi.org/10.1109/icaccs.2016.7586371
Z. Xu, D. Zhao, Research on mobile learning system based on web mining. In 2012 Third International Conference on Intelligent Control and Information Processing (2012). https://doi.org/10.1109/icicip.2012.6391484
Cybersecurity Spotlight—The Surface Web, Dark Web, and Deep Web at URL https://www.cisecurity.org/spotlight/cybersecurity-spotlight-the-surface-web-dark-web-and-deep-web/
S. Sarlis, I. Maglogiannis, On the Reusability of sentiment analysis datasets in applications with dissimilar contexts. In IFIP Advances in Information and Communication Technology, vol. 34 (2020), pp. 409–418. https://doi.org/10.1007/978-3-030-49161-1
F. Thomaz, C. Salge, E. Karahanna, J. Hulland, Learning from the Dark Web: Leveraging conversational agents in the era of hyper-privacy to enhance marketing. J. Acad. Mark. Sci. 48(1), 43–63 (2019). https://doi.org/10.1007/s11747-019-00704-3
H. Chen, IEDs in the Dark Web: Genre classification of improvised explosive device web pages. In 2008 IEEE International Conference on Intelligence and Security Informatics (2008). https://doi.org/10.1109/isi.2008.4565036
R. W. Gehl, Archives for the Dark Web: A field guide for study. In Research methods for the digital humanities (2018), pp. 31–51. https://doi.org/10.1007/978-3-319-96713-43
R. Islam, E. Ozkaya, Inside the Dark Web (CRC Press, 2019)
M. K. Bergman, White paper: The Deep Web: surfacing hidden value. J. Electron. Publish. 7(1) (2001). https://doi.org/10.3998/3336451.0007.104
S. Dumais, H. Chen, Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval—SIGIR ’00 (2000). https://doi.org/10.1145/345508.345593
A. Sun, E. Lim, W. Ng, Web classification using support vector machine. In Proceedings of the Fourth International Workshop on Web Information and Data Management—WIDM ’02 (2002). https://doi.org/10.1145/584931.584952
M. Kan, H.O. Thi, Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management—CIKM ’05 (2005). https://doi.org/10.1145/1099554.1099649
P. Kaur, Web content classification: A survey. Int. J. Comput. Trends Technol. 10(2), 97–101 (2014). https://doi.org/10.14445/22312803/ijctt-v10p117
W. Su, J. Wang, F. Lochovsky, Automatic hierarchical classification of structured deep web databases. In International Conference on Web Information Systems Engineering (pp. 210–221). Springer (2006)
L. Barbosa, J. Freire, A. Silva, Organizing hidden-web databases by clustering Visible Web documents. In 2007 IEEE 23rd International Conference on Data Engineering (2007). https://doi.org/10.1109/icde.2007.367878
U. Noor, Z. Rashid, A. Rauf, A survey of automatic Deep Web classification techniques. Int. J. Comput. Appl. 19(6), 43–50 (2011). https://doi.org/10.5120/2362-3099
X. Xian, P. Zhao, W. Fang, J. Xin, Z. Cui, Automatic classification of Deep Web databases with simple query interface. In 2009 International Conference on Industrial Mechatronics and Automation (2009). https://doi.org/10.1109/icima.2009.5156566
M. Khelghati, D. Hiemstra, M. Van Keulen, Efficient web harvesting strategies for monitoring Deep Web content. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (2016). https://doi.org/10.1145/3011141.3011198
D.R. Moore, Thomas, Cryptopolitik and the Darknet. Survival 58, 7–38 (2016). 1080/00396338.2016.1142085
K. Kinningham, M. Graczyk, Automatic product categorization for anonymous marketplaces Kevin Kinningham project overview (2015)
M.W. Al Nabki, E. Fidalgo, E. Alegre, I. De Paz, Classifying illegal ac-tivities on TOR network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, Long Papers (2017). https://doi.org/10.18653/v1/e17-1004
T. Honda, M. Yamamoto, A. Ohuchi, Automatic classification of web-sites based on keyword extraction of nouns. Inf. Commun. Technol. Tourism 2006, 263–272 (2006). https://doi.org/10.1007/3-211-32710-x38
S. Shibu, A. Vishwakarma, N. Bhargava, A combination approach for web page classificationusing page rank and feature selection technique. Int. J. Comput. Theory Eng. 897–900 (2010). https://doi.org/10.7763/ijcte.2010.v2.259
Alnabulsi, H., Islam, R. (2018). Identification of illegal forum activities inside the dark net. In: 2018 International Conference on Machine Learning and Data Engineering (iCMLDE). https://doi.org/10.1109/icmlde.2018.00015
S. He, Y. He, M. Li, Classification of illegal activities on the Dark Web. In Proceedings of the 2019 2nd International Conference on Information Science and Systems—ICISS 2019 (2019). https://doi.org/10.1145/3322645.3322691
C. Cortes, W Support-vector network. Mach. Learn. 20, 1–25 (1995)
D.R. Cox, The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–232 (1958)
D.J. Hand, K. Yu, Idiot’s Bayes: Not so stupid after all? Int. Statist. Rev./Revue Internationale de Statistique 69(3), 385 (2001). https://doi.org/10.2307/1403452
Breiman, L. (2001). Mach. Learn. 45(1), 5–32. https://doi.org/10.1023/a:1010933404324
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Murty, C.A.S., Rana, H., Verma, R., Pathak, R., Rughani, P.H. (2022). Building an AI/ML Based Classification Framework for Dark Web Text Data. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_9
Download citation
DOI: https://doi.org/10.1007/978-981-19-0604-6_9
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0603-9
Online ISBN: 978-981-19-0604-6
eBook Packages: EngineeringEngineering (R0)