Analyzing 3 Years of Text Messages: The Metadata of Staying Social at a DistanceReading time: 7 minutes
As we approach our seventh week of social distancing here in New Hampshire, I’m starting to reflect on the creative ways we’re managing our social time.
I know I am not alone in the number of digital parties I’m now regularly invited to. My calendar currently has a recurring weekly game night over Google Hangouts and a work happy hour through Slack. My wife and I also have FaceTime dinners with our moms, who both live alone. Lastly, there is the occasional Zoom family or friends get together.
Of all of the forms of “distance socializing” that I am participating in, the one that I think is the most exceptional is actually the one that has been happening since before COVID-19 ever entered into anyone’s lexicon.
The Nerd Thread
Since June of 2018 I’ve been part of a group text thread with seven close friends from college. This thread has become a place for us to share ideas and topics that we find funny or interesting. Mostly these conversations are jokes about current events, our lives, comic books, and movies.
We call our text group “the nerd thread”, and at this point the thread has in itself become a joke to the group and a source of eye-rolling to our spouses and significant others. We don’t take ourselves too seriously and don’t think anyone else should either.
Rapid Insight’s Product Manager, Jon MacMillan, recently wrote a blog post about tracking COVID-19, and in it he shared his Tableau dashboard showing the rate of infection in different countries. “This is pretty cool and nerdy” I said to myself. So I shared the dashboard with my nerd text thread.
Soon one of my friends asked if it was possible for me to do an analysis on our group thread. I pondered the possibility. I became curious to what some of the summary statistics of the meta data would be.
For instance, who sends the most GIFs? When are individuals most active, during the day or night? What day of the week has the most activity? What is the highest number of texts sent in a single day?
With some help from Construct I was able to capture these details and show my socially distant friends who took top marks in which categories.
Getting the texts into Construct
The first step was the trickiest – how do I get the data into Construct?
The answer came by using a Mac tool for iOS called PhoneView. I already own it for backing up and saving content from my phone and tablet to my laptop. I connected my phone to my computer, ran this tool, and found a way to extract the entire thread into an XML file.
The XML file had two tables. One was for message content like the text messages and showing the names of attached files, like GIFs, Pictures, and Videos. The other table contained metadata of the messages, like the datetime it was sent, contact name of sender, and their phone number.
Next it was time to see how the rows and columns look inside Construct. I was able to merge both tables together using a message ID field. From there I ran the job to see what other data cleaning needed to happen. It turns out there were a few corrupted messages that I could identify easily. I was able to filter them out immediately using their message ID.
Next I noticed there were some strange spacing issues as a result of some text characters, like linefeeds and carriage returns. Using a transform node, I was able to quickly remove all of them, giving me a very clean, structured dataset:
Now that my text data was structured inside Construct it was time to answer some questions that had come to mind. Who is leading the group with specific kinds of behavior? Using various configurations of the aggregate node I was able to find answers to the following questions.
Who sends the most texts?
Turns out that Craig by far fires off more texts than anyone else in the group. More than myself and Jon combined and more than Shawn and Mike combined.
Who sends the longest texts and what was the longest message sent?
This category took an extra step of data preparation. Originally there were two messages with around 500 words but they seemed like outliers to me. I was able to find those messages and discovered that in both cases somebody had copied and pasted an article from the internet in their message. Wanting to return words that were entered by hand I removed these outlying records from the dataset. The result was this:
Craig may send the most individual texts, but Clint is the clear leader with the number of words that he puts into his messages. He also takes the top spot for longest message sent.
Who loves to send the most GIFs?
I was also curious as to who sends the most additional content, like pictures and GIFs. I can identify the type of file sent by using a transform node and returning the right four characters, then bin those file types and count the records in the bins:
It turns out that Craig and Dave both send the most GIFs and picture files to the group. When we remember that Craig and Dave also send the most texts overall in the group this activity makes sense.
What Days of the week were messages sent?
Having explored some summary statistics of the messages that my friends are sending, I next became curious about the frequency of time at which those messages were sent. This shifted my attention from the messages and words to the timestamp of each message. Using some additional transform operations in the transform node I was able to easily return the day of the week that each message was sent on:
With a new variable of the day of week, I can use a chart node to draw up a histogram of the number of texts that were sent by each group member by the day of the week.
We know by now that Craig and Dave are the Chatty Kathy’s of the group, and now we can see that Craig and Dave both spend most of their time texting on Fridays. Actually, Friday is a popular time for everyone. Mondays and Tuesdays see the lowest amount of traffic.
My thoughts took me one step further, and I wanted to see what time of day was most active. Again, using the time that a text was sent, I was able to flag texts as being sent in the AM or PM. Using the aggregate node, I could count the number of AM text vs. PM texts.
I choose to look specifically at Craig’s texts since he has the most texts out of anyone in the group. When I bring Craig’s texts into a stacked bar chart I can see that he is far more active in the afternoon and evening than in the morning. I can also infer that he may go to bed relatively early, since any texts sent after midnight would count toward the AM category.
If I wanted to set a custom cut off for what is considered an AM text and a PM text, I could use the transform node and create IF statements or use the binning feature. From there I could set a rule that returns the category of AM or PM or different times of the day (like morning, early afternoon, late night, ect) based on a window of time of my choice.
The most valuable parts of friendships are qualitative, not quantitative. You don’t become a better friend because you send more texts, or send texts with longer messages. Friendship is an intangible connection that changes with time. Living in a digital age however, we can peer behind the curtain of socialization over video call platforms, emails, and yes, text messages.
Until the time comes where we can sit around a table again and play a board game, or share a meal, social distancing with technology is going to be the most convenient way to share time together.
When I shared the results of my little data analysis project with my nerd friends we had a big nerd laugh and then went on to talk about the next nerd topic. I’m sorry that due to social distancing, I won’t be able to see these guys in person for a little while, and I miss having in-person conversations with them. But in the meantime, I’ll keep my eye on my phone and try hard to send more GIFs.