Testing the Future: A Guide to Testing AI Products with Users

June 5, 2023

Do you remember a world in which smartphones couldn’t recognize your face to unlock automatically, dictionaries were essential for translating foreign languages, and you relied on magazines for music and TV-show recommendations? Our interactions with the world have changed dramatically, and it’s now hard to imagine life before these advancements. Artificial intelligence (AI) products are revolutionizing the way we live, work, and interact with technology.

Artificial intelligence is a broad term that refers to machine-learning (ML) algorithms and other technologies that mimic human intelligence and automate tasks that would typically require human intervention. [1] AI products—ranging from virtual assistants such as Siri and Alexa to self-driving cars and smart-home appliances—are becoming increasingly prevalent in our daily lives. Predictions forecast that the global AI software market will experience rapid growth, reaching an estimated $126 billion US by 2025. [2]

Champion Advertisement

But how do these AI products function. and how can designers and developers ensure that they are reliable and effective? The development and operation of AI products involve several key steps, as shown in Figure 1.

Now, let’s consider the following key steps in the development and operation of AI products in greater detail:

Data collection—Everything starts with data, the fuel for machine-learning algorithms. The system collects, preprocesses, cleans, and transforms as much data as possible, making it suitable for use by the ML algorithms.
Model training—The system trains the AI model, using the collected data and an appropriate machine-learning algorithm. The processing adjusts the model’s parameters to minimize the differences between the predicted and actual results.
User-interface development—Once the system trains the AI model, it embeds the model into the final product, where it interacts with real user data. This involves creating a user interface that enables the model to receive input and display its results.
AI model operation—At this stage, the algorithms receive data through the user interface, process it, and display the results in the form of recommendations, predictions, or advice.
Feedback collection—The product collects feedback from users, which it can use to retrain the model and improve its performance.

Feedback collection initiates a feedback loop, a key aspect of AI product development, enabling continuous learning and improvement of the model. By gathering feedback from users, we can identify areas where the algorithm makes incorrect, inaccurate, or biased predictions and work to eliminate those flaws, building a better and better experience with each iteration.

In essence, the product designer is the architect of the feedback loop, shaping the way users interact with the AI and ensuring its continuous development. This work is vital not only to the product’s initial success but also to its continued high performance and relevance in an ever-changing technology landscape.

In this article, I’ll explore the process of testing AI products with users and discuss key metrics and techniques that designers can use to ensure their accuracy. Choosing the right testing method helps AI products to continue improving our lives by minimizing the risks and pitfalls that accompany any new technology.

Testing AI Products with Users

Ensuring that digital products are both user friendly and useful is crucial, and testing them with actual users is vital to achieving this goal. However, unlike traditional digital products, AI products are typically highly personalized and able to learn and adapt to new data, making them more complex. Therefore, their results can be unpredictable. For each new user, the results are slightly different.

Thus, testing AI products requires a slightly different approach than testing traditional digital products. As Figure 2 shows, you might sometimes have the opportunity to test a finished product, at other times only an idea for a product, and in some cases, just a smart algorithm that will be the basis for a future product.

At the end of the day, testing AI products should not be significantly different from testing any other digital product. The principle of user-centered design (UCD) requires that testing should evaluate any product from the perspective of ordinary, actual users without interfering with the underlying technologies of the product.

Preparing for Testing

Preparing to test AI products requires careful planning to produce relevant results. To ensure that the testing process is effective and yields actionable insights, consider following these steps:

Determining product goals—Planning for testing should always begin by identifying the product’s core value and primary purpose. This focus helps guide the study and maintain its relevance throughout the process.
Developing a test plan—A detailed action plan should cover testing methods and procedures, the desired research audience, the metrics to use in measuring the success of the testing, and the expected outcomes. A well-structured plan also facilitates effective communication among all the departments who are involved in the testing.
Preparing your equipment—Ensure all the necessary hardware and software components are in place and functioning properly for the testing process.
Validating the test script—Conduct several internal test rehearsals with colleagues acting as users, and try to anticipate various user behaviors. This approach helps make your test script more flexible and better aligned with real-world scenarios. For instance, consider how users might interact with a virtual assistant such as Siri, asking different types of questions and expecting various typical responses.
Engaging users—Start inviting users who represent your target audience to participate in the testing. To ensure a smooth testing process for both users and the testing team, create a schedule, allocate breaks between sessions, and allow time for any necessary schedule adjustments and notetaking.

Testing the Finished Product

It’s easiest to test a smart product once it’s complete. In fact, testing a finished product is the most comprehensive approach and can provide a full assessment of its usefulness and convenience. However, this approach requires the availability of a finished product, which might not yet be ready or could be too expensive to develop without full confidence in its usefulness.

Whenever you have the opportunity to use a finished product for testing, you should take advantage of it. The most useful testing methods include usability testing, system-speed testing, and A/B testing. Let’s look at each of these methods in turn.

Usability Testing

Users perform tasks using the product’s user interface, while an observer captures notes on their errors and determines the severity of those errors. Usually, you’ll use a three-level scale for this purpose, as follows:

The most serious errors are those that prevented the user from completing a task.
Less serious errors caused delays in execution.
The least serious errors are simply cosmetic.

System-Speed Testing

Since the performance of the finished product varies for each user, it makes sense to pay special attention to evaluating the system’s performance, or speed, for different users. For instance, when testing a voice-activated virtual assistant such as Alexa, you might assess response times for users with different accents or speech patterns.

A/B Testing

Given that the operation of ML algorithms is often a black box, comparing two versions of the product with two different groups of users can be useful. For example, you could test two different recommendation algorithms for a music-streaming app, presenting each group with personalized playlists and measuring user engagement and satisfaction.

Testing Product Ideas

There are situations where the product is not yet ready or is only at the idea stage. Testing a product idea with real users can help determine whether it meets the needs of potential users, prevent future mistakes, or even influence the decision whether to continue working on the idea.

In this case, testing a simulation of the system and collecting user feedback about the product’s value can be useful. However, simulating all aspects of an artificial intelligence’s operation might not be possible, and user feedback might not accurately reflect the actual performance of the product.

Prototype Testing

In preparation for prototype testing, the product designer must design a user interface and create a clickable prototype that demonstrates how the AI component works. Participation in such testing requires a certain level of human empathy from the user because the product demonstration is not based on their personal data. Instead, the test data is that of a more abstract person whose goals and tasks must be clear to the user throughout testing. For example, when testing an idea for a personalized fitness app, the prototype might include predefined workout routines and nutrition plans that users can explore, then provide feedback on.

In addition to conducting prototype testing, you can also explore a product idea with the help of in-depth user interviews, focus groups—especially if the designer is looking for deep insights—and surveys—if the designer wants to obtain numeric data from a wide audience. For instance, when validating the idea for a smart travel-planning app, you could conduct focus groups to understand users’ painpoints and expectations, then use surveys to gather broader data on desired features and preferences.

Testing Product Algorithms

Now let’s imagine the opposite situation: when the system has trained an ML algorithm and an AI model on a large amount of data, but the algorithm is not yet integrated with a user interface so isn’t part of a fully realized digital product. Would testing the algorithms be worthwhile, and do the algorithms have any use for the average user?

You can assess algorithms by working with individual users. The designer should recruit several people, organize the collection of their data, and transfer the data to the team that is responsible for the ML algorithm. The ML team can then process the data and generate the model’s results—maybe in the form of a simple xls file. Finally, the designer should arrange review sessions with the recruited users to share the algorithm’s results and gather their feedback on the relevance and accuracy of the algorithm’s output.

For example, imagine that you’ve developed an ML algorithm to predict stock prices. To test the algorithm, you could recruit users with different investment portfolios, process their data through the model, and analyze the accuracy and relevance of the predictions together with the users.

Testing an ML algorithm for a product can provide valuable insights into the usefulness of an AI system’s core features. This approach can help you identify any problems with the accuracy and performance of the algorithm using real data. However, it’s worth remembering that this approach does not test the usability of the future user interface, which can significantly affect the overall usability of the product.

Key Metrics to Assess

To which metrics should you pay the most attention? As I’ve already mentioned, the process of testing an AI product should not differ significantly from the usability testing for any other product because the main objective of this research is to understand users’ needs, preferences, and personal experiences to create a more convenient, effective product. Although any user research should take into account many more metrics than I’ll present in this article, I want to highlight the key metrics that can help the product designer understand a bit more about the effectiveness of the AI aspect of a product.

algorithm accuracy—This metric represents the percentage of correct predictions, classifications, or recommendations the algorithm makes. [3] A higher value indicates a more accurate AI. For example, in a movie-recommendation system, you could measure accuracy by comparing the AI’s recommendations to the users’ actual preferences.
bias and fairness—Assessing the algorithm’s performance across users from different demographic groups ensures that the AI treats users equitably and avoids biased behavior. For instance, you must ensure that a job-matching AI doesn’t favor specific genders or ethnicities when presenting job opportunities.
user-error rate—Tracking the user-error rate helps identify usability issues and areas where the user interface might require improvement. High error rates could suggest that users are struggling to understand the AI’s recommendations or to navigate the user interface.
speed—If you expect your AI system to operate in real time, it is important to measure the speed at which it processes data to ensure efficiency and a positive user experience. For example, a real-time, language-translation app should provide translations quickly enough to maintain a smooth conversation.
explainability—Ensuring that your AI can explain how it arrived at a particular prediction or decision is essential, especially in high-stakes applications or those subject to regulatory scrutiny. For instance, a credit-scoring AI should provide clear reasons for its decisions to both users and regulators.
learnability—Measuring a system’s learnability helps determine whether your design is user friendly and easy for new users to understand, enabling them to become proficient with your AI user interface quickly and easily. To determine the system’s learnability, you could measure how long users take to complete tasks or ask them to rate the ease of learning the system.

Challenges of Testing AI Products

During the process of testing an AI product with users, UX researchers might encounter two main challenges that they must be prepared to address.

One challenge that product designers could face when testing AI products with users is managing user expectations. Users might have unrealistic expectations about the capabilities and performance of an AI system that could impact their perception of the product’s usefulness and effectiveness. Therefore, designers must provide clear, accurate information about the AI’s capabilities and limitations during testing. Doing so helps users to understand what the AI can and cannot do and ensures that the feedback the testing gathers is relevant and valuable in improving the product. Figure 3 shows the impact of testing an AI product over time.

Another challenge in testing AI products with users is accounting for the dynamic and adaptive nature of any AI system. Because AI models learn and evolve over time, their performance and behaviors can change, potentially impacting the user experience. To address this challenge, designers should consider incorporating iterative testing and ongoing monitoring of the AI system into their testing process. This lets designers track the performance of the AI over time, identify any issues that might arise as the AI adapts, and make any adjustments necessary to maintain a consistent, high-quality user experience.

Conclusion

Big Data and machine-learning algorithms offer the potential to foresee future events and changes before they happen. Domains such as finance, medicine, and the automotive industry are currently leading the way in adopting these technologies and are investing heavily in their development. However, for these AI technologies to be effective, collaboration between developers, data scientists, and product designers is crucial to ensure that they are comprehensible and accessible to everyday users.

As AI technologies advance, testing with users becomes increasingly critical in creating user-centric, unbiased, transparent AI products. It is essential to recognize that we cannot attribute the success of an AI/ML product solely to either the data scientists or the designers who have created a user-friendly interface. Instead, it is the combined effort of both that truly drives the success of an AI product. Understanding the fundamentals of product design is just as important for developers as grasping the basic principles of ML algorithms is for designers. This is why collaboration and cooperation play a critical role in the development of AI products.

Endnotes

[1] You’ll find a little more about the definition of AI in the Forbes 2018 article “The Key Definitions of Artificial Intelligence (AI) That Explain Its Importance,” by Bernard Marr.

[2] Worldwide revenues from the artificial intelligence (AI) software market, from 2018 to 2025.

[3] Read more about algorithm accuracy.

In Testing AI