Building an AI/ML Based Classification Framework for Dark Web Text Data

Murty, Ch. A. S.; Rana, Harmesh; Verma, Rachit; Pathak, Roshan; Rughani, Parag H.

doi:10.1007/978-981-19-0604-6_9

Ch. A. S. Murty¹³,
Harmesh Rana¹³,
Rachit Verma¹³,
Roshan Pathak¹³ &
…
Parag H. Rughani¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 394))

563 Accesses
1 Citations

Abstract

The dark web is that integral part of WWW that provides freedom of Content Hosting. The dark web is accessible with specially designed browsers and tools of having peer-to-peer network technology such as TOR, IP2, and FREENET etc. These tools help users exchange information on the dark web while remaining anonymous; We used TOR (The Onion Router) browser in our research for understanding the dark web. It provides excellent anonymity to its users. The users mainly utilized the dark web for illegal activities; although accessing it is legal in most countries, its usage can arouse suspicion with the law. Categories like Adult, Counterfeits, illicit markets, and weapons are prevalent. This research provides an analytical framework for automating the classification of web pages with scraping and analysis of its hosted content on the dark web. The method we used can easily crawl data, classify the hosted content by machine learning model, and categorize that the hosted content is illegal and legal. The proposed framework contains Machine Learning Classifier Algorithms that are Naïve Bayes with an accuracy of 0.87%, Random Forest with an accuracy of 0.91%, Linear SVM with an accuracy of 0.91%, and Logistic Regression with an accuracy of 0.94%. This study created a machine learning framework that can classify hosted text content on dark web websites—models trained to classify the content and evaluated them based on accuracy. The results from our study validate the effectiveness of the proposed classification framework for analyzing the text data, which has more relevance with smaller datasets. Also, it is encouraging further studying this growing phenomenon and for investigators examining illegal activities on the Dark Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

D. Hayes, F. Cappa, J. Cardon, A framework for more effective Dark Web market-place investigations. Information 9(8), 186 (2018). https://doi.org/10.3390/info9080186
Article Google Scholar
The Onion Router project metrics (2021). URL at https:// metrics.torproject.org/
Google Scholar
H. Alnabulsi, R. Islam, Identification of Illegal Forum Activities Inside the Dark Net. In 2018 International conference on machine learning and data engineering (iCMLDE) (2018). https://doi.org/10.1109/icmlde.2018.00015 (2018)
N. Tavabi, N. Bartley, A. Abeliuk, S. Soni, E. Ferrara, K. Lerman, Charac-terizing activity on the deep and Dark Web. In: Companion proceedings of the 2019 world wide web conference (2019). https://doi.org/10.1145/3308560.3316502
A. Kumar, E. Rosenbach, The truth about the dark web (2019). At https://www.imf.org/external/pubs/ft/fandd/2019/09/the-truth-about-the-dark-web-kumar.htm
Deep web -the hidden side of Internet. URL at https://tharjournal.com/deep-web/
M. Mirea, V. Wang, J. Jung, The not so dark side of the darknet: a qualitative study. Secur. J. 32(2), 102–118 (2018). https://doi.org/10.1057/s41284-018-0150-5
K. Nalini, L.J. Sheela, Survey on text classification. Int. J. Innov. Res. Adv. Eng. 1(6), 412–417 (2014)
Google Scholar
A. Mehler, C. Wolff, Text Mining. Themenheft des LDV-Forum (2005)
Google Scholar
S. Brindha, K. Prabha, S. Sukumaran, A survey on classification techniques for text mining. In 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS) (2016). https://doi.org/10.1109/icaccs.2016.7586371
Z. Xu, D. Zhao, Research on mobile learning system based on web mining. In 2012 Third International Conference on Intelligent Control and Information Processing (2012). https://doi.org/10.1109/icicip.2012.6391484
Cybersecurity Spotlight—The Surface Web, Dark Web, and Deep Web at URL https://www.cisecurity.org/spotlight/cybersecurity-spotlight-the-surface-web-dark-web-and-deep-web/
S. Sarlis, I. Maglogiannis, On the Reusability of sentiment analysis datasets in applications with dissimilar contexts. In IFIP Advances in Information and Communication Technology, vol. 34 (2020), pp. 409–418. https://doi.org/10.1007/978-3-030-49161-1
F. Thomaz, C. Salge, E. Karahanna, J. Hulland, Learning from the Dark Web: Leveraging conversational agents in the era of hyper-privacy to enhance marketing. J. Acad. Mark. Sci. 48(1), 43–63 (2019). https://doi.org/10.1007/s11747-019-00704-3
Article Google Scholar
H. Chen, IEDs in the Dark Web: Genre classification of improvised explosive device web pages. In 2008 IEEE International Conference on Intelligence and Security Informatics (2008). https://doi.org/10.1109/isi.2008.4565036
R. W. Gehl, Archives for the Dark Web: A field guide for study. In Research methods for the digital humanities (2018), pp. 31–51. https://doi.org/10.1007/978-3-319-96713-43
R. Islam, E. Ozkaya, Inside the Dark Web (CRC Press, 2019)
Google Scholar
M. K. Bergman, White paper: The Deep Web: surfacing hidden value. J. Electron. Publish. 7(1) (2001). https://doi.org/10.3998/3336451.0007.104
S. Dumais, H. Chen, Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval—SIGIR ’00 (2000). https://doi.org/10.1145/345508.345593
A. Sun, E. Lim, W. Ng, Web classification using support vector machine. In Proceedings of the Fourth International Workshop on Web Information and Data Management—WIDM ’02 (2002). https://doi.org/10.1145/584931.584952
M. Kan, H.O. Thi, Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management—CIKM ’05 (2005). https://doi.org/10.1145/1099554.1099649
P. Kaur, Web content classification: A survey. Int. J. Comput. Trends Technol. 10(2), 97–101 (2014). https://doi.org/10.14445/22312803/ijctt-v10p117
W. Su, J. Wang, F. Lochovsky, Automatic hierarchical classification of structured deep web databases. In International Conference on Web Information Systems Engineering (pp. 210–221). Springer (2006)
Google Scholar
L. Barbosa, J. Freire, A. Silva, Organizing hidden-web databases by clustering Visible Web documents. In 2007 IEEE 23rd International Conference on Data Engineering (2007). https://doi.org/10.1109/icde.2007.367878
U. Noor, Z. Rashid, A. Rauf, A survey of automatic Deep Web classification techniques. Int. J. Comput. Appl. 19(6), 43–50 (2011). https://doi.org/10.5120/2362-3099
Article Google Scholar
X. Xian, P. Zhao, W. Fang, J. Xin, Z. Cui, Automatic classification of Deep Web databases with simple query interface. In 2009 International Conference on Industrial Mechatronics and Automation (2009). https://doi.org/10.1109/icima.2009.5156566
M. Khelghati, D. Hiemstra, M. Van Keulen, Efficient web harvesting strategies for monitoring Deep Web content. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (2016). https://doi.org/10.1145/3011141.3011198
D.R. Moore, Thomas, Cryptopolitik and the Darknet. Survival 58, 7–38 (2016). 1080/00396338.2016.1142085
Google Scholar
K. Kinningham, M. Graczyk, Automatic product categorization for anonymous marketplaces Kevin Kinningham project overview (2015)
Google Scholar
M.W. Al Nabki, E. Fidalgo, E. Alegre, I. De Paz, Classifying illegal ac-tivities on TOR network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, Long Papers (2017). https://doi.org/10.18653/v1/e17-1004
T. Honda, M. Yamamoto, A. Ohuchi, Automatic classification of web-sites based on keyword extraction of nouns. Inf. Commun. Technol. Tourism 2006, 263–272 (2006). https://doi.org/10.1007/3-211-32710-x38
Article Google Scholar
S. Shibu, A. Vishwakarma, N. Bhargava, A combination approach for web page classificationusing page rank and feature selection technique. Int. J. Comput. Theory Eng. 897–900 (2010). https://doi.org/10.7763/ijcte.2010.v2.259
Alnabulsi, H., Islam, R. (2018). Identification of illegal forum activities inside the dark net. In: 2018 International Conference on Machine Learning and Data Engineering (iCMLDE). https://doi.org/10.1109/icmlde.2018.00015
S. He, Y. He, M. Li, Classification of illegal activities on the Dark Web. In Proceedings of the 2019 2nd International Conference on Information Science and Systems—ICISS 2019 (2019). https://doi.org/10.1145/3322645.3322691
C. Cortes, W Support-vector network. Mach. Learn. 20, 1–25 (1995)
Google Scholar
D.R. Cox, The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–232 (1958)
MathSciNet MATH Google Scholar
D.J. Hand, K. Yu, Idiot’s Bayes: Not so stupid after all? Int. Statist. Rev./Revue Internationale de Statistique 69(3), 385 (2001). https://doi.org/10.2307/1403452
Article MATH Google Scholar
Breiman, L. (2001). Mach. Learn. 45(1), 5–32. https://doi.org/10.1023/a:1010933404324

Download references

Author information

Authors and Affiliations

Centre for Development of Advanced Computing (C-DAC), Hyderabad, India
Ch. A. S. Murty, Harmesh Rana, Rachit Verma & Roshan Pathak
National Forensic Sciences University (NFSU), Gandhinagar, India
Parag H. Rughani

Authors

Ch. A. S. Murty
View author publications
You can also search for this author in PubMed Google Scholar
Harmesh Rana
View author publications
You can also search for this author in PubMed Google Scholar
Rachit Verma
View author publications
You can also search for this author in PubMed Google Scholar
Roshan Pathak
View author publications
You can also search for this author in PubMed Google Scholar
Parag H. Rughani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ch. A. S. Murty .

Editor information

Editors and Affiliations

Manchester Metropolitan University, Manchester, UK
Ali Kashif Bashir
University of Calabria, Rende, Italy
Giancarlo Fortino
Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India
Ashish Khanna
Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India
Deepak Gupta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murty, C.A.S., Rana, H., Verma, R., Pathak, R., Rughani, P.H. (2022). Building an AI/ML Based Classification Framework for Dark Web Text Data. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_9

Download citation

DOI: https://doi.org/10.1007/978-981-19-0604-6_9
Published: 09 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-0603-9
Online ISBN: 978-981-19-0604-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics