The software vendor used a research grant to set up the project with WBS and replicate typical conditions for dealing with customer feedback. Rapide Communication has been offering a free “sentiment challenge” which had generated a pool of text for use. Out of the 20-plus clients to take up the offer, three were chosen for the comparison – a supermarket, a car company and a train company – each supplying around 1,000 pieces of customer feedback.
“In most companies, feedback gets analysed by a couple of people in the marketing department giving up their evenings to look through what people have said,” says Chris Worth, coresearcher and consultant who ran the study with WBS’s Dr Temi Abimbola. “They didn’t just want to test it against other software because that is doing it in a vacuum. Rapide Communication wanted to look at how their application might perform against what happens in the real world..”
The first step of the study involved preparing the data by classifying it into 15 or 16 types to reflect typical customer service issues. The human analysts then read the comments and scored them on a five-point scale where one was a highly negative response and five highly positive. The software used the same approach. Each entity in a comment was assigned a score and an aggregate score taken for the comment overall.
“The most significant finding was that there was a lot of variability in the human scores and that Rant & Rave’s scores fell well within that set. So the application is behaving like an average human,” says Worth.
Where the software differed in an important way from the human analysts was in its appetite for work. “The humans couldn’t analyse the full data set, only one quarter of it. That was a decision they made because of time and energy. The software ran the full 1,000 data set each time,” he says.
This is an important benefit from automation of this task. To make it manageable for the analysts, the data needed to be cut down. But there is no way of doing this without the risk of excluding a potentially vital insight. The researchers simply took the first and last 125 comments in the sets.
One example of how the software was able to identify a usable insight better than the human analysts was within the train company feedback. A lot of complaints related to the closure of a buffet car, leading to customer dissatisfaction. The researchers classified these as either customer service or catering issues and scored the strength of sentiment. Rant & Rave was able to drill down and identify a broken kettle, mentioned in several comments, as the reason for the closure.
“Something costing a few pounds may have been losing the business thousands of pounds in revenue,” notes Worth.
Similarly, the human analysts tended to score feedback more cautiously, whereas the application emphasised both positives and negatives, giving more scores of either five or one, whereas the analysts would assign a four or two to the same comment. “That turns out to be a positive outcome because it provides extra emphasis and gives the findings more value,” he says.
It was also the case that as the human analysts became tired, their discrimination of the category a comment should fall into and its relative strength also declined. Scoring was not consistent, unlike for the application. Worth concludes: “It gets to the point where the software is seeing something the humans don’t.”