AI Ethics: Trusting AI for Employee Assessments

In the Malcolm Gladwell book, Talking to Strangers, he exposes the faults we have in assessing people. We might infer incorrect meanings from their words or assign them with traits they don’t actually possess. In one example, Gladwell shows how wrong judges were at predicting recidivism among accused. Worse, these judges were dramatically outperformed by artificial intelligence (AI)-based prediction engines. The files might have been complicated, but the judges were experienced in law and had seen hundreds — if not thousands — of cases. Despite this, a machine devoid of being able to look into the eyes of the accused was better at predicting whether someone would re-offend. More information, in particular being able to see the alleged perpetrator, led to worse outcomes.

So, you might wonder if AI and machines are better than humans when it comes to assessing how we will perform at our jobs.

The Art of Employee Assessments
When it comes to contact centers, we must look at how employee assessments are performed. One part of an employee’s role might be clearly measurable through KPIs. These might include total sales, number of interactions handled per hour, average handle time and first interaction resolution. Assessing against a KPI can vary widely depending on the type of queue the employee is working on, the product being sold, or even the time or day the employee is working. A customer support queue will have a different type of interaction than an inbound sales inquiry queue or an outbound appointment setting queue. Handle times, the types of questions being raised or the sentiment of the customer will vary widely. These interactions will also vary quite a bit from company to company, even for those providing similar products or services.

A business with a strategy of driving customers to retail or service centers might want to end interactions as quickly as possible to reduce costs. However, another business prides itself on white glove customer service might not care about handle time but will put a premium on first interaction resolution. Many influences on these KPIs might be outside the control of the employee.

A more direct, but also more subjective part of an employee’s assessment, might involve determining a net promoter score (NPS). A post-interaction survey might elicit answers from a customer about how well the employee resolved the issue or whether the employee was sympathetic to the customer’s needs. Depending on how the questions are framed, the responses might reflect more directly on a company’s products, services or approach to customer service rather than the performance of an individual employee. If the product is a problem, it might not be the employee’s fault. Likewise, a policy that the company has implemented in customer support might tie the employee’s hands in resolving the issue. However, an employee might bear the brunt of this through a poor post-interaction survey score.

The most direct assessment of an employee’s performance is the interaction review. This is when the employee’s manager or a quality assessor listens to an interaction recording or a live interaction and then tries score how well the employee handled the interaction. The employee might be assessed against a single rating or the assessor might answer multiple questions related to the interaction — with some answers eventually rolled into a single score.

While the direct assessment of a interaction likely is the most faithful way to determine how an employee performs in an interaction with a customer, it has limitations.

This type of assessment usually can only be done at real-time speed. The result is that, with only a limited amount of time in a day, each employee might only have one or two interactions out of hundreds assessed every few weeks. This creates a significant problem: Any interaction that a manager listens to represents a tiny sample of the employee’s total interactions. If an employee was on a bad interaction (unusually an angry customer) or was having a bad day, it could have a dramatic and over-represented effect on their performance rating. And that can negatively affect their performance scoring, career advancement and renumeration.

Similarly, the subjectivity of the assessor could also lead to poor evaluations. Suppose that, on that particular day, the assessor wasn’t feeling well or was tired. He or she could skew the evaluations to the negative. However, the problem could be larger. Human assessors might hold biases — and these biases could be reflected in their scoring of employees over time.

To an employee, rarely being assessed, having the assessment of a single interaction impact their advancement and knowing that there’s a potential for bias in the judging of their work can be discouraging. Over time, this could lead to employee attrition or worse — liability to the company.

New Assessment Options Emerge
AI offers some options for companies that want to change how they handle assessments. Imagine if every interaction was evaluated and an employee scored without requiring a manager to constantly listen in on the employee. This could free up the assessor to work on interaction coaching or taking on interaction escalation.

By assessing every interaction, the system could also provide a holistic picture of the employee’s performance, instead of a snapshot from a single interaction. This would be more equitable to the employee.

By analyzing transcripts, a system can manage the quality of the interactions and make sure that an employee follows the required protocol. Did the employee state a disclaimer? Did they follow the up-sell script? Speech recognition and keyword or phrase spotting can be performed automatically — and at scale — to flag any issues or confirm compliance.

You could apply technologies like sentiment analysis to determine whether an employee is broadcasting a positive disposition to the customer — or whether they could maintain or improve the sentiment of the customer during the interaction. Emotion classification can be used in the same vein. An even more ambitious approach would be to learn how assessors rate employees’ interactions and then apply this rubric against all interactions.

Use Caution When Implementing AI-Based Assessments
Using AI for employee assessments creates its own risks. While these AI-based methods provide a new level of automation and accuracy to employee assessments in contact centers, they can’t be blindly implemented. Instead, they need to one part of a larger picture of an employee’s performance. They’re prone to error and bias, so they shouldn’t completely replace all other assessments.

If the heart of an assessment tool relies on accurate transcription of an employee’s conversation with a customer, it could negatively affect the employee’s score if there’s a mismatch between the sample population used to gather data and the background of the employee. For example, if the sample population used for the speech recognition modeling was primarily female from a particular region who spoke a specific dialect of a language, and the employee was a male from a different area, with an accent, the speech recognition might be less accurate. And this could affect the resulting score.

Likewise, you could train sentiment analysis and emotion detection services with a different target population that could invert the analysis of an employee’s work. For example, it could assign a negative sentiment when the employee is very accommodating and helpful to the customer.

At the extreme, AI-based systems can learn and mimic human biases and prejudice. Tay, the Microsoft Twitter bot, provides an extreme example of this. Within hours, Microsoft engineers had to pull the plug after Tay began spewing misogynistic and racist epithets that it had picked up from Twitter. When it comes to automatic assessments of interactions, a service could be trained with embedded systemic biases and then might apply these blindly to assessing future interactions.

To avoid this, humans should always have a dose of healthy skepticism toward machine-derived scores. It should be very clear from where the service received data for training, whether that sample is reflective of the population being assessed and whether there were any systemic biases that could be carried forward by the service. These services also need to show their weightings whenever providing a score.

Ultimately, machines could become better than humans at assessing employee performance. But they’ll still need to convince us that they’re right in their assessments. AI-based technologies are powerful. Don’t wield them haphazardly, especially when using them to assess the performance of your contact center agents.

For more information on our AI Ethics efforts, read all the blogs in the series and join in the AI Ethics discussion online.