Author: Chris Sheehan, EVP High Tech and AI, Applause
Reviewed by: Vaishnavi Nashte
Generative AI (Gen AI) has taken the world by storm. New Gen AI services designed to meet the requirements of almost every use case and profession imaginable are appearing all the time. Analysts estimate the market value will reach around $52.2B by 2028. However, to fully the benefits of Gen AI, organisations should implement the key elements of Responsible AI. Without adherence to Responsible AI, organisations run a high degree of risk across inaccuracy, safety, compliance, security and fairness.
In order to effectively mitigate these risks, organisations should develop a Responsible AI testing and evaluation process that generates trust internally and externally. Human-based testing is an essential part of this process and should be implemented at various stages of development, as well as after the application is released. Including wide-ranging, diverse perspectives is critical to reliable, inclusive output, but it’s a significant challenge to apply at scale.
The wisdom of crowds
Many businesses have attempted to train their Gen AI systems in-house, using only their employees to add the “human touch.” But even those with large workforces are unable to produce datasets that are large and diverse enough to meet the needs of their product teams. This is why organisations are turning to crowdtesting solutions that give them access to a global community of independent testers who can provide the levels of diversity and scrutiny required to test and validate their Gen AI systems.
These experts and end users provide the necessary input and feedback to navigate complex ethical issues and fine-tune Gen AI responses to meet human expectations on a scale that would be impossible to achieve in-house. The concept of incorporating crowd-sourced insights ahead of the release of a Gen AI application can uncover issues and opportunities for improvement that development and QA teams wouldn’t necessarily think of themselves.
Harnessing data and human feedback
High-quality training and testing data from experts and end users is often needed to power the large language models (LLMs) that underpin Gen AI services. LLMs generally require an incredible amount of data and information to make accurate predictions. Different datasets serve different purposes. Take financial services for example – preparing an LLM or algorithm to make predictions means collecting lots of quality data from experts within the financial services sector. As businesses increasingly trust AI in their decision-making processes, it’s even more important to make sure the dataset is accurate, large and diverse enough to reduce any biases that can lead to harmful conclusions.
Another layer of testing and validation can be achieved using a process known as red teaming, an adversarial technique designed to find points of failure in Gen AI. Red teaming involves a team of experts who execute a series of tests to identify problems related to security, safety, accuracy, functionality or performance. Red teaming can unearth points of failure that would be difficult to identify through regular testing methods.
Gen AI inclusivity and accessibility
Building ethical Gen AI systems requires proactive testing practices to uncover biases and mitigate risks. Risk reduction can be achieved by employing diverse testing practices, like recruiting testers from a range of demographic, geographic and psychographic backgrounds, including different skin tones, genders, ages, languages and more. For example, facial recognition systems trained on non-representative datasets can produce discriminatory results. Diverse testing is a strategy that can help mitigate various risks, although the technology will always have some level of risk associated with it.
Another consideration is accessibility – digital products, including Gen AI tools, can’t be considered inclusive if they are not accessible. Companies should optimise their applications for people with disabilities (PWD) through accessibility testing and inclusive design. Organisations can provide genuine value for users with both permanent and temporary disabilities by including PWD in testing and research. This approach naturally helps organisations deliver more reliable and inclusive AI systems.
The value of applying rigorous testing and training techniques involving expert testers and end users at scale is clear. Organisations would prefer to identify and resolve potential Gen AI safety and ethical issues during testing, before they occur in the real world. Incorporating elements of Responsible AI throughout development can help avoid unintended harms. By involving a large number of participants from a range of backgrounds and locations, who are willing to test and provide input early and often, organisations can achieve the level of diversity required for less biased, more effective results.