Skip to Main Content
FGCU Logo

Computational Analysis of Chat Transcripts

Library Assessment Conference 2024 poster presentation

Project Challenges

There were several challenges in the initial leg of this project. 

First, the transcript data is clunky. There are a total of 9,079 transactions since 2015; 7,771 of those have been in the last five years. FGCU librarians have never used the tagging feature, so there are no categories to filter by. Complaints about noise are mixed in with genuine research consultations. Length of chat does give some indication of “importance,” but evaluating the fifty longest chats over the last five years demonstrates that length does not always equal academic rigor. Just analyzing the initial question does not work, either, as many users do not input a question when they begin chatting.

Although the message count average for all of the transcripts only equals 10-11 messages, the fifty longest transcripts from the last five years contain an average of 68 messages per chat and the longest chat is a whopping 158 messages and lasted for 42 minutes. Upon review, that chat was during the COVID lockdown. It was with a professor who was chatting in about assignment design and requesting the creation of instructional videos for their composition classes.

Finally of concern, each chat transcript may include personally identifiable information (PII) such as: librarian’s full name, user’s full name or first name, names of other people such as professors and mentors, email addresses, and student’s university ID numbers (UIN). Here is an example of a portion of a single transcript that has been manually redacted. In this sample, you can see that there are timestamps and names for each response. In addition, there are HTML character entities, HTML tags, and URLs, all of which presented some challenge in cleaning the data.

In the time since the inception of the project, there have been drastic changes in the data analysis landscape. The continual evolution of artificial intelligence has, in some ways, superseded the more antiquated methods of machine learning that were originally aspired to. It is difficult to find time to keep up with the continual training on new technologies as well as continuing to do our “day-jobs” as librarians.