ABSTRACT
A common characteristic of communication on online social networks is that it happens via short messages, often using non-standard language variations. These characteristics make this type of text a challenging text genre for natural language processing. Moreover, in these digital communities it is easy to provide a false name, age, gender and location in order to hide one's true identity, providing criminals such as pedophiles with new possibilities to groom their victims. It would therefore be useful if user profiles can be checked on the basis of text analysis, and false profiles flagged for monitoring. This paper presents an exploratory study in which we apply a text categorization approach for the prediction of age and gender on a corpus of chat texts, which we collected from the Belgian social networking site Netlog. We examine which types of features are most informative for a reliable prediction of age and gender on this difficult text type and perform experiments with different data set sizes in order to acquire more insight into the minimum data size requirements for this task.
- Argamon, S., Koppel, M., Fine, J., and Shimoni, A. 2002. Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing. 17, 4 (November 2002), 401--412. DOI=10.1093/llc/17.4.401.Google Scholar
- Argamon, S., Koppel, M., Pennebaker, W., and Schler, J. 2007. Mining the Blogosphere: Age, gender and the varieties of self-expression. First Monday.12, 9 (September 2007). DOI= http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2003.Google ScholarCross Ref
- Burger, J. D., and Henderson, J. C. 2006. An exploration of observable features related to blogger age. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006).Google Scholar
- Burrows, J. 2007. All the way through: testing for authorship in different frequency strata. Literary and Linguistic Computing. 22, 1 (2007), 27--47. DOI= http://dx.doi.org/10.1093/llc/fqi067.Google ScholarCross Ref
- Caverlee, J., and Webb, S. 2008. A large-scale study of MySpace: observations and implications for online social networks. In Proceedings of the 2nd International Conference on Weblogs and Social Media (Seattle, USA, March 30 - April 2, 2008). ISWCM'08. International AAAI Conference on Weblogs and Social Media. DOI= http://www.aaai.org/Library/ICWSM/2008/icwsm08-012.php.Google Scholar
- Crystal, D. 2001. Language and the Internet. Cambridge University Press, Cambridge, NY, USA. Google ScholarDigital Library
- Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., and Lin, C. J. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research. 9 (August, 2008), 1871--1874. DOI= http://doi.acm.org/10.1145/1390681.1442794. Google ScholarDigital Library
- Goswami, S., Sarkar, S., and Rustagi, M. 2009. Stylometric analysis of bloggers' age and gender. In Proceedings of the Third International ICWSM Conference (San Jose, USA, May 17 - 20, 2009). ISWCM'09. International AAAI Conference on Weblogs and Social Media. DOI= http://aaai.org/ocs/index.php/ICWSM/09/paper/view/208.Google Scholar
- Herring, S. C., and Paolillo, J. C. 2006. Gender and genre variation in weblogs. Journal of Sociolinguistics. 10, 4 (August, 2006), 439--459. DOI=10.1111/j.1467-9841.2006.00287.xGoogle ScholarCross Ref
- Hirst, G., and Feiguina, O. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing. 22, 4 (October, 2007), 405--417. DOI= 10.1093/llc/fqm023.Google ScholarCross Ref
- Holmes, J., and Meyerhoff, M. 2003. The Handbook of Language and Gender. Blackwell, Oxford, UK. DOI= 10.1111/b.9780631225034.2004.x.Google Scholar
- Luyckx, K., and Daelemans, W. 2010. The Effect of Author Set Size and Data Size in Authorship Attribution. Literary and Linguistic Computing. 26, 1 (August, 2010). DOI= 10.1093/llc/fqq013.Google Scholar
- Manning, C. D., and Schütze, H. 2001. Foundations of statistical natural language processing. MIT Press, Cambridge, Massachusetts, USA. DOI=10.1145/601858.601867. Google ScholarDigital Library
- Mukherjee, A., and Liu, B. 2010. Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (Cambridge, USA, October 9 - 11, 2010). EMNLP '10. Association for Computational Linguistics, Stroudsburg, PA, USA, 207--217. DOI= http://www.aclweb.org/anthology/D10-1021. Google ScholarDigital Library
- Nguyen, D., Smith, N., and Rosé C. 2011. Author Age Prediction from Text using Linear Regression. In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 115--123. Google ScholarDigital Library
- Nowson, S., and Oberlander, J. 2007. Identifying more bloggers. Towards large scale personality classification of personal weblogs. In Proceedings of the 1st International Conference on Weblogs and Social Media (Boulder, USA, March 26 - 28, 2007). ISWCM'07. International AAAI Conference on Weblogs and Social Media.Google Scholar
- Pennebaker, J. W., and Graybeal, A. 2001. Patterns of natural language use: disclosure, personality, and social integration. Current Directions in Psychological Science. 10, 3 (2001), 90--93. DOI= 10.1111/1467-8721.00123.Google ScholarCross Ref
- Pennebaker, J. W., and Stone, L. D. 2003. Words of wisdom: Language use over the lifespan. Journal of Personality and Social Psychology. 85, 2 (Aug 2003, 2003), 291--301. DOI=10.1037/0022-3514.85.2.Google ScholarCross Ref
- Rosenthal, S., and McKeown, K. 2011. Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 763--772. Google ScholarDigital Library
- Ryan, C., Hall, W., and Hall, R. 2007. A profile of pedophilia: definition, characteristics of offenders, recidivism, treatment outcomes, and forensic issues. In Mayo Clinic Proceedings. 82, 4 (April, 2007), 457--471. DOI= 10.4065/82.4.457.Google Scholar
- Sanderson, C., and Guenter, S. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: an investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (Sydney, Australia, 22 - 23 July , 2006). EMNLP'06. Association for Computational Linguistics, Stroudsburg, PA, USA, 482--491. DOI= http://www.aclweb.org/anthology/W06-1657. Google ScholarDigital Library
- Sarawgi, R., Gajulapalli, K., and Choi, Y. 2011. Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (Portland, USA, 19 - 24 June, 2011). Association for Computational Linguistics, Stroudsburg, PA, USA, 78--86. Google ScholarDigital Library
- Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. 2006. Effects of age and gender on blogging. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006). DOI= http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.216.Google Scholar
- Snyder, H. N. 2000. Sexual assault of young children as reported to law enforcement: victim, incident, and offender characteristics. US Departement of Justice, Bureau of Justice Statistics. Washington, DC, USA. Publication NCJ 182990.Google Scholar
- Tam, J., and Martell, C. 2009. Age Detection in Chat. In Proceedings of the 3rd IEEE International Conference on Semantic Computing. (Berkeley, USA, September 14-16, 2009). DOI=10.1109/ICSC.2009.37. Google ScholarDigital Library
- Vandekerckhove, R., and Nobels, J. 2010. Code eclecticism: Linguistic variation and code alternation in the chat language of Flemish teenagers. Journal of Sociolinguistics. 14, 5 (November, 2010), 657--677. DOI=10.1111/j.1467-9841.2010.00458.x.Google ScholarCross Ref
- Yan, X., and Yan, L. 2006. Gender classification of weblog authors. In Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (California, USA, March 27 - 29, 2006).Google Scholar
- Zhang, C., and Zhang, P. 2010. Predicting gender from blog posts. Technical Report. University of Massachusetts Amherst, USA.Google Scholar
Index Terms
- Predicting age and gender in online social networks
Recommendations
Benefits and risks of LGBT social media use for sexual and gender minority individuals: An investigation of psychosocial mechanisms of LGBT social media use and well-being
AbstractThere has been a proliferation of lesbian, gay, bisexual, and transgender (LGBT) social media platforms and users over the past decade. Previous studies have reported mixed effects of social media use on well-being, but less is known ...
Highlights- LGBT social media use may involve both benefits and risks.
- Effects of LGBT ...
Romantic motivations for social media use, social comparison, and online aggression among adolescents
This study examines whether adolescent motivations for social media use, social comparison tendencies and gender are related to online aggression victimization and/or perpetration. Results from a national cross-sectional survey of adolescents (N=340) ...
Uses and gratifications of social networking sites for bridging and bonding social capital
Applying uses and gratifications theory (UGT) and social capital theory, our study examined users of four social networking sites (SNSs) (Facebook, Twitter, Instagram, and Snapchat), and their influence on online bridging and bonding social capital. ...
Comments