Adopt the Right Testing Strategies for AI/ML Applications

testing startegies for AI/ML applications

The adoption of systems based on Artificial Intelligence (AI) and Machine Learning (ML) has seen an exponential rise in the past few years and is expected to continue to do so. As per the forecast by Markets and Markets, the global AI market size will grow from USD 58.3 billion in 2021 to USD 309.6 billion by 2026, at a CAGR of 39.7% during the aforementioned forecast period. In a recent Algorithmia Survey, 71% of respondents mentioned an increase in budgets for AI/ML initiatives. Some organizations are even looking at doubling their investments in these areas. With the sporadic growth in these applications, the QA practices and testing strategies for AI/ML applications models also need to keep pace.

An ML model life-cycle involves multiple steps. The first is training the model based on a set of feature sets. The second involves deploying the model, assessing model performance, and modifying the model constantly to make more accurate predictions. This is different from the traditional applications, where the model’s outcome is not necessarily an accurate number but can be right depending on the feature sets used for its training. The ML engine is built on certain predictive outcomes from datasets and focuses on constant refining based on real-life data. Further, since it’s impossible to get all possible data for a model, using a small percentage of data to generalize results for the larger picture is paramount.

Since ML systems have their architecture steeped in constant change, traditional QA techniques need to be replaced with those focusing on taking the following nuances into the picture.

The QA approach in ML

Traditional QA approaches require a subject matter expert to understand possible use case scenarios and outcomes. These instances across modules and applications are documented in the real world, which makes it easier for test case creation. Here the emphasis is more on understanding the functionality and behavior of the application under test. Further, automated tools that draw from databases enable the rapid creation of test cases with synthesized data. In a Machine Learning (ML) world, the focus is mainly on the decision made by the model and understanding the various scenarios/data that could have led to that decision. This calls for an in-depth understanding of the possible outcomes that lead to a conclusion and knowledge of data science.

Secondly, the data that is available for creating a Machine Learning model is a subset of the real-world data. Hence, there is a need for the model to be re-engineered consistently through real data. A rigor of manual follow-up is necessary once the model is deployed in order to enhance the model’s prediction capabilities continuously. This also helps to overcome trust issues within the model as the decision would have been taken through human intervention in real life. QA focus needs to be more in this direction so that the model is closer to real-world accuracy.

Finally, business acceptance testing in a traditional QA approach involves the creation of an executable module and being tested in production. This traditional QA approach is more predictable as the same set of scenarios continue to be tested until a new addition is made to the application. However, the scenario is different with ML engines. Business acceptance testing, in such cases, should be seen as an integral part of refining the model to improve its accuracy, using real-world usage of the model. 

The different phases of QA

Three phases characterize every machine learning model creation:

The QA focus, be it functional or non-functional, is applied to the ML engine across these 3 phases.

  • Data pipeline: The quality of input data sets has a significant role in the ability to predict a Machine Learning system. The success of an ML model lies in the testing data pipelines which ensure clean and accurate data availability through big data and analytics techniques.
  • Model building: Measuring the effectiveness of a model is very different from traditional techniques. Out of a specified number of datasets available, 70-80% is used in training the model, while the remaining is used in validating & testing the model. Therefore, the accuracy of the model is based on the accuracy shown on the smaller of datasets. Ensuring that the data sets used for validating & testing the model are representative of the real-world scenario is essential. It shouldn’t come to pass that the model, when pushed into production, will fail for a particular category that has not been represented either in the training or the testing data sets. There is a strong need to ensure equitable distribution and representation in the data sets.
  • Deployment: Since the all-round coverage of scenarios determines the accuracy of an ML model and the ability to do that in real life is limited, the system cannot be expected to be performance-ready in one go. A host of tests need to be done to the system like candidate testing; A/B testing to ensure that the system is working correctly and can ease into a real-life environment. The concept of a sweat drift becomes valid here whereby we arrive at a measure of time by when the model starts behaving reliably. During this time, the QA person needs to manage data samples and validate model behavior appropriately. The tool landscape that supports this phase is still in an evolving stage.

The QA approaches need to emphasize the following for ensuring the development and deployment of a robust ML engine.

Fairness:

The ideal ML model should be nonjudgmental and fair. Since it depends largely on learning based on data received from real-life scenarios, there is a strong chance that the model will be biased if it gets data from a particular category/feature set. For example, if a chatbot with learning ability through ML engine is made live and receives many inputs that are racist, the datasets that are being received for learning by ML engine are heavily skewed towards racism. The feedback loops that power many of these models ensure that racist bias comes into the ML engine. There have been instances of such chatbots being pulled down after noticeable differences in their behavior.

In a financial context, the same can be extended to biases being developed by the model receiving too many loan approval requests from a particular category of requestors, as an example. Adequate efforts need to be made to remove these biases while aggregating or slicing and dicing these datasets and adding them to the ML engine.

One approach that’s commonly followed to remove the bias that can creep into a model is by building another model (an adversary) that understands the potential of bias from the list of various parameters and incorporates that bias within itself. By frequently moving back and forth between these two models with the availability of real-life data, the possibility of a model that removes the bias becomes higher.

Security:

Many ML models are finding widespread adoption across industries and are already beginning to be used in critical real-life situations. The ML model development is very different from that adopted for software development. It is more error-prone on account of loopholes that can cause malleable attacks and a higher propensity to err on the wrong side on account of erroneous input data.

Many of these models do not start from scratch. They are built atop pre-existing models through transfer and learning methods. If created by a malicious actor, these transfer learning models have every possible way of corrupting the purpose of the model. Further, even after the model goes into production, malicious intent data being fed into the model can change the prediction generated by the model.

In conclusion, assuring the quality of AL/ML-based models and engines needs a fundamentally different approach from traditional testing. It needs to be continuously changing to focus on the data being fed into the system and on which predictive outcomes are made. Continuous testing, focusing on the quality of data, the ability to affect the predictive outcome, and remove biases in prediction is the answer.

(This blog was originally published in EuroStar)

Author

  • Raghukiran B

    Raghukiran is working as Test Architect with Trigent Software and has over 11 experience in QA and Test Automation. Experienced in building automation strategies, scalable test automation frameworks. Worked on various domains such as Healthcare, Retail, Education, Supply Chain Management, Fleet Management, Telematics etc... Extensively worked on Automation of Web, Desktop and Mobile applications on Android and iOS platforms. He is interested in exploring new automation tools, continuous integration setups and automation coverage.