{"id":493900,"date":"2023-10-02T04:32:49","date_gmt":"2023-10-02T11:32:49","guid":{"rendered":"https:\/\/www.genesys.com\/blog\/post\/measuring-ai-quality-bias-accuracy-and-benchmarking-for-conversational-ai"},"modified":"2023-10-02T05:32:03","modified_gmt":"2023-10-02T12:32:03","slug":"measuring-ai-quality-bias-accuracy-and-benchmarking-for-conversational-ai","status":"publish","type":"blog","link":"https:\/\/www.genesys.com\/en-gb\/blog\/post\/measuring-ai-quality-bias-accuracy-and-benchmarking-for-conversational-ai","title":{"rendered":"Measuring AI Quality: Bias, Accuracy and Benchmarking for Conversational AI"},"content":{"rendered":"<div class=\"wpb-content-wrapper\"><p>[vc_section full_width=&#8221;stretch_row&#8221;][vc_row][vc_column][vc_column_text]<span data-contrast=\"auto\">Artificial intelligence (AI) practitioners are often asked to show their work. They have to prove that their AI technology works and is on par with \u2014 or better than \u2014 an alternative AI solution. It seems like a reasonable request. But measuring AI quality is difficult at best and, in some cases, it\u2019s just impossible. There are measures that are used for testing AI \u2014 error rates, recall, lift, confidence \u2014 but many of them are meaningless without context. And with AI, the real KPI is ROI.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">That\u2019s not to say that all AI technology is built the same or that quality is irrelevant. The quality of your AI solution has a material impact on your ability to use AI to achieve ROI. In this blog, I\u2019ll examine AI quality benchmarks and concepts as well as some best practices. This can serve as a reference point for those at any stage of the AI implementation journey.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2><span data-contrast=\"auto\">Start with Realistic Expectations<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">Some expect AI to be consistently accurate. The perception is that AI will correct human flaws and, since human error is inevitable and expected, AI must be its opposite. Achieving this level of perfection is an impossible standard. Expectations need to be realistic; the best way to measure AI success is business impact.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">There are some things that AI can in the contact centre that a human can\u2019t. For example, even if a chatbot can only answer one question, it can still answer that one question 24\/7 without ever stopping for a break. If that question is important to a large percentage of customers, or a small but important customer segment, then that chatbot has value well beyond its ability to accurately understand and respond conversationally to a wide set of requests. <\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">For conversational AI, expectations of perfection are sure to disappoint.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Conversational AI bots are trained on data and the quality of the underlying natural language understanding (NLU) model depends on the data set used for training and testing. You might have seen some reports that show NLU benchmarks. When reviewing the numbers, make sure you understand what data was used.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Let\u2019s say Vendor A used the same training and testing data for the analysis, but vendors B and C had a different training data set. The results for Vendor A will likely outperform Vendor B and C. Vendor A is (essentially) using birth year to predict age, which is a model that is 100% accurate but likely not the best use of AI.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<h2><span data-contrast=\"auto\">Measuring Quality and the Basics of Benchmarks<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/h2>\n<p><span data-contrast=\"auto\">NLU models are measured using these <\/span><a href=\"https:\/\/towardsdatascience.com\/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">dimensions<\/span><\/a><span data-contrast=\"auto\">:<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<ul>\n<li><span data-contrast=\"auto\"><strong>Accuracy:<\/strong> Number of correct predictions over all predictions<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\"><strong>Precision:<\/strong> Number of positive predictions that are correct<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\"><strong>Recall<\/strong>: How many of the positive cases the classifier correctly predicted, over all the positive cases in the data \u2014 also known as sensitivity<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<li><span data-contrast=\"auto\"><strong>F1 score:<\/strong> Combines precision and recall and is considered the better measure\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/li>\n<\/ul>\n<p><span data-contrast=\"auto\">In the case of conversational AI, a positive prediction is a match between what a customer said and what a customer meant. Quality analysis compares how well the NLU model understands natural language. It doesn\u2019t measure how the conversational AI responds to what has been asked.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Your NLU model might be able to capture what the customers want accurately. However, your framework might not be able to connect to the systems it needs to satisfy the asks, transition the call to the right channel with context preserved or identify the right answer to the question. NLU accuracy is not a proxy for customer satisfaction or first-contact resolution.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">Scores are typically a point-in-time evaluation and are highly dependent on the data used for the analysis. Differences in scores might be difficult to interpret (unless you\u2019re a linguistic model expert). For example, if one NLU has a score of 79% and another has a score of 80%, what does that mean? Published comparisons often don\u2019t include \u2014 or include as very fine print \u2014 the scope of the test, how many times a test was run and rarely provide the actual data used.\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">If you\u2019re considering an off-the-shelf, pre-trained bot, then having these benchmarks could be useful. But you may need to expand your evaluation to incorporate other factors such as the ability to personalise, analyse and optimise. More about that later.<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">For those who are using Genesys AI to build and deploy bots, NLU models are as unique as our customers. They\u2019re trained for purpose \u2014 using data that\u2019s either customer-specific or is specific to the use case the customer is trying to solve. Standard benchmark reports wouldn\u2019t do Genesys AI justice. However, teams regularly test the performance of <\/span><a href=\"https:\/\/www.genesys.com\/article\/set-bot-confidence-thresholds-with-confidence\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Genesys NLU<\/span><\/a><span data-contrast=\"auto\"> versus others using standard corpus (Figure 1).\u00a0<\/span><span data-ccp-props=\"{&quot;201341983&quot;:0,&quot;335559739&quot;:160,&quot;335559740&quot;:259}\">\u00a0<\/span>[\/vc_column_text][vc_single_image image=&#8221;491296&#8243; css=&#8221;.vc_custom_1694532622553{margin-top: 1.5em !important;margin-bottom: 1.5em !important;}&#8221;][vc_column_text]\u201cBanking bot\u201d is trained using a standard data set that represents the many self-service requests that are typical for a bank. This test shows that Genesys NLU is on par with some third-party AI and performs better on this test than others. When we expand the test to other languages, results vary.<\/p>\n<p>This test was performed with the same dataset translated (by a human) to the various languages represented below. If trying to compare NLU providers based on benchmarks alone, language is an important dimension. Some NLUs work particularly well with one set of languages but not as well with others (Figure 2).[\/vc_column_text][vc_single_image image=&#8221;491297&#8243; css=&#8221;.vc_custom_1694532610171{margin-top: 1.5em !important;margin-bottom: 1.5em !important;}&#8221;][vc_column_text]Consider the results: These and many other test results are used to fine-tune the underlying components to ensure that those who are creating bots can achieve the same or better level of NLU with their bot than they would through other popular NLU options.<\/p>\n<h2>How to Train a Conversational AI Bot<\/h2>\n<p>Bot are built for real-world use, not test data sets. We need to have a feel for how a model performs on data that is representative of the actual mix of incoming end-user queries for the use case that the bot has been created to address.\u00a0 A &#8220;representative&#8221; test set will take a snapshot of actual (anonymised) customer utterances. This can be manual or automated.<\/p>\n<p>The intent distribution will be very unbalanced, since there&#8217;s usually a small number of intents that are the most frequent reasons customers are initially contacting your bot. And after a bot has gone through fine-tuning, the most frequent customer utterances will be part of the model; only those that aren\u2019t addressed will remain. This type of test set measures in-the-field performance and is <a href=\"https:\/\/www.genesys.com\/blog\/post\/optimizing-your-bot-an-ai-love-story\" target=\"_blank\" rel=\"noopener\">critical for maintaining quality over time<\/a>.<\/p>\n<p>NLU models are the foundation for conversational AI bots. An NLU model predicts the end user&#8217;s <a href=\"https:\/\/help.genesys.cloud\/articles\/intents-overview\/\" target=\"_blank\" rel=\"noopener\">intent<\/a> and extracts <a href=\"https:\/\/help.genesys.cloud\/articles\/slots-and-slot-types-overview\/\" target=\"_blank\" rel=\"noopener\">slots<\/a> (data) by being trained using a set of example <a href=\"https:\/\/help.genesys.cloud\/articles\/work-with-utterances\/\" target=\"_blank\" rel=\"noopener\">utterances<\/a>, which typically consist of different ways a customer asks a question.<\/p>\n<p>Comprehensive bot authoring tools give the bot creator the ability to add and train the intents and slots required for their bot, while also providing analytical tools to learn how the bot is performing \u2014 and tools to improve the NLU model over time.<\/p>\n<p>Generic NLU models might do a good job understanding basic questions, but customer service isn\u2019t generic. The specificity that bots need to be effective comes from the training data (corpus) used during the training process. The closer the training data represents actual conversations, the better the bot will perform. One way to get better data is to use the actual conversations.<\/p>\n<p>A best practice would be to implement conversational AI technology with a tool that can extract intents and the utterances that represent those intents from actual conversations \u2014 from voice or digital. It&#8217;s important to have a way to test the bot prior to deployment and to capture any missed intent identification post-deployment. Conversational AI quality isn\u2019t static.<\/p>\n<p>Bots can improve over time if there\u2019s a way to optimise them post-deployment. This can be difficult with off-the-shelf bots that need custom development.<\/p>\n<p>A major advantage of having an integrated bot framework is that you can course correct bots that aren\u2019t performing as expected in real time without disruption. When thinking about quality, ask about optimisation. Is optimisation automatic? Is there a human-in-the-loop process? How can you, as the business, \u201csee\u201d what\u2019s happening?<\/p>\n<h2>Understanding Bias in AI<\/h2>\n<p>Bias should be part of the quality discussion. Bias in AI is inevitable. What\u2019s critical isn\u2019t where bias exists (it does) \u2014 but how well the bot can recognise it and trace it back to the source.<\/p>\n<p>If you are using a pre-trained model (an off-the-shelf bot), you likely don\u2019t know its training set. Even a large corpus drawn from a wide set of industry content can be biased if that source represents a single geography or a specific point in time. Examples of built-in bias that have derailed <a href=\"https:\/\/www.kdnuggets.com\/2022\/11\/expect-ai-quality-trends-2023.html\" target=\"_blank\" rel=\"noopener\">AI projects are out there<\/a>; many AI projects are impossible to correct enough to be practical.<\/p>\n<p>Bias is of particular interest for those looking to use pre-trained large language models (LLMs). The advantage of having off-the-shelf, ready-to-go large models is that they\u2019re highly conversational and have been exposed to a wide variety of conversational patterns. However, the training sets are so large that they\u2019re hard to curate, and they depend on the ability of the vendors to find and use data that\u2019s truly impartial. A <a href=\"https:\/\/aclanthology.org\/2023.acl-long.656.pdf\" target=\"_blank\" rel=\"noopener\">recent paper<\/a> shows that LLMs are partisan \u2014 which you may want to consider when asking questions about politics or events. A partisan bias could alter the information you\u2019re receiving (Figure 3).[\/vc_column_text][vc_single_image image=&#8221;491299&#8243; css=&#8221;.vc_custom_1694532594022{margin-top: 1.5em !important;margin-bottom: 1.5em !important;}&#8221;][vc_column_text]<\/p>\n<h2>How to Deal with Conversational AI Bias<\/h2>\n<p>When embarking on an AI project, start with the goal and the output. What is the AI project meant to accomplish (automate) and what would the impact be to that accomplishment if the data used contains bias that will materially alter the decision?<\/p>\n<p>For example, a model that\u2019s used to automate loan approval or employment that\u2019s biased can yield a decision that\u2019s unethical and likely illegal in most countries. The mitigation strategy is to avoid use of data such as age, gender and racial background. But often, within the data, measures exist that are related to these protected categories that could drive the output and introduce the wrong type of bias into the model.<\/p>\n<p>For some sensitive decisions, such as employment, the use of AI is highly regulated and monitored (as it should be). When considering the outcome of conversational AI, bias might not have a significant impact on the conversation. While it should be considered, its impact is unlikely to have the type of business impact a biased employment model would. To evaluate and consider bias, start with the outcome and work your way back to the data.<\/p>\n<p>The source of bias is the training data. An employment model that was trained on data from a point in time when employment practices precluded certain groups from employments or specific roles will bring this bias forward into the modern day. Making sure data is well balanced is one way to control for bias.<\/p>\n<p>For conversational AI and for AI-enabled customer experience automation, use actual conversational data from your own customer base as that represents the possible conversations. Built-in analytics enable users to assess whether there is bias in how the bot is responding. Watching how intents are derived from utterances, as well as which utterances are understood and which aren\u2019t, will show if there is bias potential.<\/p>\n<p>A built-in feedback mechanism can help capture any issues and the optimisation tools enable organisations to adjust a bot. This isn\u2019t an unattended process that can run amuck.<\/p>\n<p>This is very much controlled, measured and optimised with a human-in-the-loop process. This means that this can work for both highly sensitive information and for general information. Some forms of AI can only do one or the other, which will limit its efficacy.<\/p>\n<h2>Conversational AI Best Practices and Recommendations<\/h2>\n<p>Contact centres need conversational AI that provides a quality response to your customers and advances your business objectives. The temptation to search for benchmark reports and engineering specifications might yield a lot of data that could be hard to understand and is unlikely to help you meet your goals.<\/p>\n<p>It\u2019s important to have a solution that has the following characteristics:<\/p>\n<ol>\n<li>A human-in-the-loop process ensures there\u2019s oversight and control throughout the build, deploy, measure and optimise steps critical to effective conversational AI implementations.<\/li>\n<li>Ease of use is critical to quality, as it puts the control in the hands of the business. Having an easy-to-use, no-code bot framework democratises AI and removes skill blockers.<\/li>\n<li>Domain expertise means that the AI has built-in accelerators, analytics and connectors to make it easier to create a quality AI solution.<\/li>\n<li>Data transparency is a critical factor. You need to know where the data is coming from and what kind of data is being used during the training process. Opaque, pre-built models may seem easy, but they could contain issues that will derail a project \u2014 with no way to course correct.<\/li>\n<\/ol>\n<p>Learn more about the <a href=\"https:\/\/www.youtube.com\/watch?v=VmKKmKg6B4I\" target=\"_blank\" rel=\"noopener\">Genesys approach to conversational AI<\/a> with this video.[\/vc_column_text][\/vc_column][\/vc_row][vc_row][vc_column]<a class=\"component-cta-block card w-100 h-100 bgc-teal centered cta-text-center \"  href=\"https:\/\/www.genesys.com\/en-gb\/resources\/increase-your-cx-effectiveness-with-conversational-ai?ost_tool=blog&ost_campaign=ft-blog\" target=\"_blank\" rel=\"\"><div class=\"card-body text-center col-content\"><h4 class=\"font-roboto font-swb\">Increase CX effectiveness with conversational AI<\/h4>\n<p>Explore how to deliver better automated and human-assisted conversational experiences.<\/p>\n<div class=\" btn-container justify-content-center mt-2\"><div class=\"btn btn-white\">Read the ebook<\/div><\/div><\/div><\/a>[\/vc_column][\/vc_row][\/vc_section]<\/p>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>[vc_section full_width=&#8221;stretch_row&#8221;][vc_row][vc_column][vc_column_text]Artificial intelligence (AI) practitioners are often asked to show their work. They have to prove that their AI technology works and is on par with \u2014 or better than \u2014 an alternative AI solution. It seems like a reasonable request. But measuring AI quality is difficult at best and, in some cases, it\u2019s just [&hellip;]<\/p>\n","protected":false},"author":954,"featured_media":491322,"template":"","tax_priority":[54],"tax_blogtype":[17751],"tax_blogcategory":[15939],"tax_contenttheme":[14913],"tax_bundle":[],"tax_contenttheme2":[],"tax_capability_sitewide":[16209],"tax_products_programs":[16489],"tax_buying_job":[16658],"tax_buyer_persona":[16900],"tax_sector":[],"tax_segment":[17096,17121,17123],"class_list":["post-493900","blog","type-blog","status-publish","has-post-thumbnail","hentry","tax_priority-54","tax_blogtype-genesys-en-gb","tax_blogcategory-ai-and-machine-learning-en-gb","tax_contenttheme-level-up-your-technology-en-gb","tax_capability_sitewide-ai-and-automation-en-gb","tax_products_programs-genesys-ai-en-gb","tax_buying_job-job-2-solution-exploration-en-gb","tax_buyer_persona-technical-en-gb","tax_segment-enterprise-en-gb","tax_segment-midsized-en-gb","tax_segment-smb-en-gb","tax_content_type-blog-en-gb"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/blog\/493900","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/users\/954"}],"version-history":[{"count":5,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/blog\/493900\/revisions"}],"predecessor-version":[{"id":493905,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/blog\/493900\/revisions\/493905"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/media\/491322"}],"wp:attachment":[{"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/media?parent=493900"}],"wp:term":[{"taxonomy":"tax_priority","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_priority?post=493900"},{"taxonomy":"tax_blogtype","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_blogtype?post=493900"},{"taxonomy":"tax_blogcategory","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_blogcategory?post=493900"},{"taxonomy":"tax_contenttheme","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_contenttheme?post=493900"},{"taxonomy":"tax_bundle","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_bundle?post=493900"},{"taxonomy":"tax_contenttheme2","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_contenttheme2?post=493900"},{"taxonomy":"tax_capability_sitewide","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_capability_sitewide?post=493900"},{"taxonomy":"tax_products_programs","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_products_programs?post=493900"},{"taxonomy":"tax_buying_job","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_buying_job?post=493900"},{"taxonomy":"tax_buyer_persona","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_buyer_persona?post=493900"},{"taxonomy":"tax_sector","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_sector?post=493900"},{"taxonomy":"tax_segment","embeddable":true,"href":"https:\/\/www.genesys.com\/en-gb\/wp-json\/wp\/v2\/tax_segment?post=493900"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}