Skip to main content

Building an AI/ML Based Classification Framework for Dark Web Text Data

  • Conference paper
  • First Online:
Proceedings of International Conference on Computing and Communication Networks

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 394))

Abstract

The dark web is that integral part of WWW that provides freedom of Content Hosting. The dark web is accessible with specially designed browsers and tools of having peer-to-peer network technology such as TOR, IP2, and FREENET etc. These tools help users exchange information on the dark web while remaining anonymous; We used TOR (The Onion Router) browser in our research for understanding the dark web. It provides excellent anonymity to its users. The users mainly utilized the dark web for illegal activities; although accessing it is legal in most countries, its usage can arouse suspicion with the law. Categories like Adult, Counterfeits, illicit markets, and weapons are prevalent. This research provides an analytical framework for automating the classification of web pages with scraping and analysis of its hosted content on the dark web. The method we used can easily crawl data, classify the hosted content by machine learning model, and categorize that the hosted content is illegal and legal. The proposed framework contains Machine Learning Classifier Algorithms that are Naïve Bayes with an accuracy of 0.87%, Random Forest with an accuracy of 0.91%, Linear SVM with an accuracy of 0.91%, and Logistic Regression with an accuracy of 0.94%. This study created a machine learning framework that can classify hosted text content on dark web websites—models trained to classify the content and evaluated them based on accuracy. The results from our study validate the effectiveness of the proposed classification framework for analyzing the text data, which has more relevance with smaller datasets. Also, it is encouraging further studying this growing phenomenon and for investigators examining illegal activities on the Dark Web.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. D. Hayes, F. Cappa, J. Cardon, A framework for more effective Dark Web market-place investigations. Information 9(8), 186 (2018). https://doi.org/10.3390/info9080186

    Article  Google Scholar 

  2. The Onion Router project metrics (2021). URL at https:// metrics.torproject.org/

    Google Scholar 

  3. H. Alnabulsi, R. Islam, Identification of Illegal Forum Activities Inside the Dark Net. In 2018 International conference on machine learning and data engineering (iCMLDE) (2018). https://doi.org/10.1109/icmlde.2018.00015 (2018)

  4. N. Tavabi, N. Bartley, A. Abeliuk, S. Soni, E. Ferrara, K. Lerman, Charac-terizing activity on the deep and Dark Web. In: Companion proceedings of the 2019 world wide web conference (2019). https://doi.org/10.1145/3308560.3316502

  5. A. Kumar, E. Rosenbach, The truth about the dark web (2019). At https://www.imf.org/external/pubs/ft/fandd/2019/09/the-truth-about-the-dark-web-kumar.htm

  6. Deep web -the hidden side of Internet. URL at https://tharjournal.com/deep-web/

  7. M. Mirea, V. Wang, J. Jung, The not so dark side of the darknet: a qualitative study. Secur. J. 32(2), 102–118 (2018). https://doi.org/10.1057/s41284-018-0150-5

  8. K. Nalini, L.J. Sheela, Survey on text classification. Int. J. Innov. Res. Adv. Eng. 1(6), 412–417 (2014)

    Google Scholar 

  9. A. Mehler, C. Wolff, Text Mining. Themenheft des LDV-Forum (2005)

    Google Scholar 

  10. S. Brindha, K. Prabha, S. Sukumaran, A survey on classification techniques for text mining. In 2016 3rd International Conference on Advanced Computing and Communication Systems (ICACCS) (2016). https://doi.org/10.1109/icaccs.2016.7586371

  11. Z. Xu, D. Zhao, Research on mobile learning system based on web mining. In 2012 Third International Conference on Intelligent Control and Information Processing (2012). https://doi.org/10.1109/icicip.2012.6391484

  12. Cybersecurity Spotlight—The Surface Web, Dark Web, and Deep Web at URL https://www.cisecurity.org/spotlight/cybersecurity-spotlight-the-surface-web-dark-web-and-deep-web/

  13. S. Sarlis, I. Maglogiannis, On the Reusability of sentiment analysis datasets in applications with dissimilar contexts. In IFIP Advances in Information and Communication Technology, vol. 34 (2020), pp. 409–418. https://doi.org/10.1007/978-3-030-49161-1

  14. F. Thomaz, C. Salge, E. Karahanna, J. Hulland, Learning from the Dark Web: Leveraging conversational agents in the era of hyper-privacy to enhance marketing. J. Acad. Mark. Sci. 48(1), 43–63 (2019). https://doi.org/10.1007/s11747-019-00704-3

    Article  Google Scholar 

  15. H. Chen, IEDs in the Dark Web: Genre classification of improvised explosive device web pages. In 2008 IEEE International Conference on Intelligence and Security Informatics (2008). https://doi.org/10.1109/isi.2008.4565036

  16. R. W. Gehl, Archives for the Dark Web: A field guide for study. In Research methods for the digital humanities (2018), pp. 31–51. https://doi.org/10.1007/978-3-319-96713-43

  17. R. Islam, E. Ozkaya, Inside the Dark Web (CRC Press, 2019)

    Google Scholar 

  18. M. K. Bergman, White paper: The Deep Web: surfacing hidden value. J. Electron. Publish. 7(1) (2001). https://doi.org/10.3998/3336451.0007.104

  19. S. Dumais, H. Chen, Hierarchical classification of web content. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval—SIGIR ’00 (2000). https://doi.org/10.1145/345508.345593

  20. A. Sun, E. Lim, W. Ng, Web classification using support vector machine. In Proceedings of the Fourth International Workshop on Web Information and Data Management—WIDM ’02 (2002). https://doi.org/10.1145/584931.584952

  21. M. Kan, H.O. Thi, Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management—CIKM ’05 (2005). https://doi.org/10.1145/1099554.1099649

  22. P. Kaur, Web content classification: A survey. Int. J. Comput. Trends Technol. 10(2), 97–101 (2014). https://doi.org/10.14445/22312803/ijctt-v10p117

  23. W. Su, J. Wang, F. Lochovsky, Automatic hierarchical classification of structured deep web databases. In International Conference on Web Information Systems Engineering (pp. 210–221). Springer (2006)

    Google Scholar 

  24. L. Barbosa, J. Freire, A. Silva, Organizing hidden-web databases by clustering Visible Web documents. In 2007 IEEE 23rd International Conference on Data Engineering (2007). https://doi.org/10.1109/icde.2007.367878

  25. U. Noor, Z. Rashid, A. Rauf, A survey of automatic Deep Web classification techniques. Int. J. Comput. Appl. 19(6), 43–50 (2011). https://doi.org/10.5120/2362-3099

    Article  Google Scholar 

  26. X. Xian, P. Zhao, W. Fang, J. Xin, Z. Cui, Automatic classification of Deep Web databases with simple query interface. In 2009 International Conference on Industrial Mechatronics and Automation (2009). https://doi.org/10.1109/icima.2009.5156566

  27. M. Khelghati, D. Hiemstra, M. Van Keulen, Efficient web harvesting strategies for monitoring Deep Web content. In Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services (2016). https://doi.org/10.1145/3011141.3011198

  28. D.R. Moore, Thomas, Cryptopolitik and the Darknet. Survival 58, 7–38 (2016). 1080/00396338.2016.1142085

    Google Scholar 

  29. K. Kinningham, M. Graczyk, Automatic product categorization for anonymous marketplaces Kevin Kinningham project overview (2015)

    Google Scholar 

  30. M.W. Al Nabki, E. Fidalgo, E. Alegre, I. De Paz, Classifying illegal ac-tivities on TOR network based on web textual contents. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 1, Long Papers (2017). https://doi.org/10.18653/v1/e17-1004

  31. T. Honda, M. Yamamoto, A. Ohuchi, Automatic classification of web-sites based on keyword extraction of nouns. Inf. Commun. Technol. Tourism 2006, 263–272 (2006). https://doi.org/10.1007/3-211-32710-x38

    Article  Google Scholar 

  32. S. Shibu, A. Vishwakarma, N. Bhargava, A combination approach for web page classificationusing page rank and feature selection technique. Int. J. Comput. Theory Eng. 897–900 (2010). https://doi.org/10.7763/ijcte.2010.v2.259

  33. Alnabulsi, H., Islam, R. (2018). Identification of illegal forum activities inside the dark net. In: 2018 International Conference on Machine Learning and Data Engineering (iCMLDE). https://doi.org/10.1109/icmlde.2018.00015

  34. S. He, Y. He, M. Li, Classification of illegal activities on the Dark Web. In Proceedings of the 2019 2nd International Conference on Information Science and Systems—ICISS 2019 (2019). https://doi.org/10.1145/3322645.3322691

  35. C. Cortes, W Support-vector network. Mach. Learn. 20, 1–25 (1995)

    Google Scholar 

  36. D.R. Cox, The regression analysis of binary sequences. J. Roy. Stat. Soc.: Ser. B (Methodol.) 20(2), 215–232 (1958)

    MathSciNet  MATH  Google Scholar 

  37. D.J. Hand, K. Yu, Idiot’s Bayes: Not so stupid after all? Int. Statist. Rev./Revue Internationale de Statistique 69(3), 385 (2001). https://doi.org/10.2307/1403452

    Article  MATH  Google Scholar 

  38. Breiman, L. (2001). Mach. Learn. 45(1), 5–32. https://doi.org/10.1023/a:1010933404324

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ch. A. S. Murty .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Murty, C.A.S., Rana, H., Verma, R., Pathak, R., Rughani, P.H. (2022). Building an AI/ML Based Classification Framework for Dark Web Text Data. In: Bashir, A.K., Fortino, G., Khanna, A., Gupta, D. (eds) Proceedings of International Conference on Computing and Communication Networks. Lecture Notes in Networks and Systems, vol 394. Springer, Singapore. https://doi.org/10.1007/978-981-19-0604-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-0604-6_9

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-0603-9

  • Online ISBN: 978-981-19-0604-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics