Abstract
The increasing availability of online data has prompted researchers in social sciences and humanities to adopt text mining tools for studying digital social phenomena. While these methods require technical expertise, ChatGPT-4’s Data Analysis module now offers a more accessible alternative, allowing researchers to process data through natural language commands. However, generative AI introduces methodological challenges, including non-reproducible results due to evolving models and algorithmic opacity, complicating interpretability. This paper investigates ChatGPT-4’s performance in text mining and its implications for computational social science. Using 7,731 Reddit comments from 16 discussion threads about a polarizing event in an online gaming community, we prompted ChatGPT-4 through multiple rounds of topic modeling (text mining technique) to generate 15 refined topics, three of which were selected for further analysis. We assessed how ChatGPT-4 summarized these three topics and extracted representative comments. Manual verification showed an average accuracy of 84.4%, with variations: topics #1 and #2 had high concordance (90.6% and 94.3%), while topic #3 had a lower match rate (67.5%). Our results suggest that ChatGPT-4 is effective in capturing dominant patterns in homogeneous discussions but may overlook nuanced aspects in more diverse topics, emphasizing the need for human oversight. Based on these findings, we propose a structured approach for using generative AI in research to ensure scientific rigor. We conclude by discussing limitations in big data science and text mining methods. Drawing from critical data studies, we examine how scientific standards and values underpin a productivity-driven ideology in generative AI technology.
Presenters
Antoine JobinStudent, PhD in Communication, Université du Québec à Montréal, Quebec, Canada
Details
Presentation Type
Paper Presentation in a Themed Session
Theme
KEYWORDS
Text Mining; Generative AI; Computational Social Science; Scientific rigor