Skip to Main Content
FGCU Logo

Computational Analysis of Chat Transcripts

Library Assessment Conference 2024 poster presentation

Design & Methodology

The plan began with a conversation about anonymizing and tagging the transcript data. We downloaded chat transcripts through Springshare from April 2015 (the beginning of its use at FGCU) through the end of the 2023. The transcripts already included useful data such as the user's name, referring URL, answerer's name, timestamp, wait time, duration, rating, initial question, message count, and a full transcript of the chat. There was also a category for tags, but FGCU librarians have never used tags, unfortunately.

Because the raw transcript data is primarily textual and would require an exorbitant amount of work to manually process and tag, we decided to consult the literature to determine possible methods to analyze it more efficiently. Several methods came to light, including examples of sentiment analysis, word frequency, and topic modeling using tools such as R, NVivo, Voyant, and Python (Tasking and Al, 2019; Brousseau, Jonson, and Thacker, 2021; Koh and Fienup, 2021; Sharma, Barrett, and Stapelfeldt, 2022; Wang, 2022; Watson, 2023). Because the Digital Humanities (DH) librarian has some experience working with Python and natural language processing (NLP), we decided to start there. 

There were multiple steps to the process, and the plan evolved as we learned more about the data and its limitations.

 

Quantitative Analysis

Although the quantitative data included in the statistics was useful, the Springshare system only allows for download and analysis of one year of data at a time. In order to analyze all nine years, an external program would need to be used. Our initial plan was to use Tableau. However, we were unable to obtain an institutional license for Tableau in the time allotted, so we used Matplotlib and Flourish

Qualitative Analysis

Cleaning the dataset was an undertaking for a relatively new coder. There are a total of 9,079 transactions since 2015. 7,771 of those transactions are from the past five years. Many users did not input a question when they began chatting, so that field was of limited use. Although the message count average for all the transcripts only equals 10-11 messages (back and forth between user and answerer), the fifty longest transcripts from the last five years average 68 messages per chat and the longest chat is a whopping 158 messages and lasted for 42 minutes. The questions and transcripts included a great deal of personally identifiable information (PII) including: answerer's name, user's name (when provided), names of other people (professors and mentors), email addresses, student university ID numbers (UINs), and even addresses. 

Due to the learning curve that this project entailed, we decided to take a sample of the transcripts on which to practice. We chose the ten longest transcripts from each of the last five years, for a total of fifty transcripts. Once the process is perfected, it will be applied to the full corpus, which will then be fed into a GuidedLDA semi-supervised topic model (Koh & Fienup, 2019). Eventually, we intend to use ChatGPT to classify the corpus and compare the results to those from GuidedLDA.

To process the data, we used Constellate, a text analysis platform that uses the Jupyter environment for coding. FGCU recently subscribed to the platform and will be using it to teach text analysis and other coding workshops. To clean the data, a combination of pandasregex, spaCy, and NLTK, and Counter was used. Some simple Python coding was used to replace answerer names with categorical names such as Librarian1 and Intern1. The transcripts included HTML character entities and URLs; those were removed using regex. spaCy was used to remove named entities, including names, email addresses, and UINs. NLTK was used to tokenize and remove stopwords and punctuation. Finally, Counter was used to generate word frequencies.