A quick scan of the application landscape shows that customers are more empowered, digitally savvy, and eager to have superior experiences faster. To achieve and maintain leadership in this landscape, organizations need to update applications constantly and at speed. This is why dependency on agile, DevOps, and CI/CD technologies has increased tremendously, further translating to an exponential increase in the adoption of test data management initiatives. CI/CD pipelines benefit from the fact that any new code that is developed is automatically integrated into the main application and tested continuously. Automated tests are critical to success, and agility is lost when test data delivery does not match code development and integration velocity.
Why Test Data Management?
Industry data shows that up to 60% of development and testing time is consumed by data-related activities, with a significant portion dedicated to testing data management. This amply validates that the global test data management market is expected to grow at a CAGR of 11.5% over the forecast period 2020-2025, according to the ResearchandMarkets TDM report.
Best Practices for Test Data Management
Any organization focusing on making its test data management discipline stronger and capable of supporting the new age digital delivery landscape needs to focus on the following three cornerstones.
The principle of shift left mandates that each phase in an SDLC has a tight feedback loop that ensures defects don’t move down the development/deployment pipeline, making it less costly for errors to be detected and rectified. Its success hinges to a large extent on close mapping of test data to the production environment. Replicating or cloning production data is manually intensive, and as the World Quality Report 2020-21 shows, 79% of respondents create test data manually with each run. Scripts and automation tools can take up most heavy lifting and bring this down to a large extent when done well. With production quality data being very close to reality, defect leakage is reduced vastly, ultimately translating to a significant reduction in defect triage cost at later stages of development/deployment.
However, using production-quality data at all times may not be possible, especially in the case of applications that are only a prototype or built from scratch. Additionally, using a complete copy of the production database is time and effort-intensive – instead, it is worthwhile to identify relevant subsets for testing. A strategy that brings together the right mix of product quality data and synthetic data closely aligned to production data models is the best bet. While production data maps to narrower testing outcomes in realistic environments, synthetic data is much broader and enables you to simulate environments beyond the ambit of production data. Usage of test data automation platforms that allocates apt dataset combinations for tests can bring further stability to testing.
Tight coupling with production data is also complicated by a host of data privacy laws like GDPR, CCPA, CPPA, etc., that mandate protecting customer-sensitive information. Anonymizing data or obfuscating data to remove sensitive information is an approach that is followed to circumvent this issue. Usually, non-production environments are less secure, and data masking for protecting PII information becomes paramount.
Accuracy is critical in today’s digital transformation-led SDLC, where app updates are being launched to market faster and need to be as error-free as possible, a nearly impossible feat without accurate test data. The technology landscape is also more complex and integrated like never before, percolating the complexity of data model relationships and the environments in which they are used. The need is to maintain a single source of data truth. Many organizations adopt the path of creating a gold master for data and then make data subsets based on the need of the application. Adopting tools that validate and update data automatically during each test run further ensures the accuracy of the master data.
Accuracy also entails ensuring the relevance of data in the context of the application being tested. Decade-old data formats might be applicable in the context of an insurance application that needs historic policy data formats. However, demographic data or data related to customer purchasing behavior applicable in a retail application context is highly dynamic. The centralized data governance structure addresses this issue, at times sunsetting the data that has served its purpose, preventing any unintended usage. This also reduces maintenance costs for archiving large amounts of test data.
Also important is a proper data governance mechanism that provides the right provisioning capability and ownership driven at a central level, thereby helping teams use a single data truth for testing. Adopting similar provisioning techniques can further remove any cross-team constraints and ensure accurate data is available on demand.
The rapid adoption of digital platforms and application movement into cloud environments have been driving exponential growth in user-generated data and cloud data traffic. The pandemic has accelerated this trend by moving the majority of application usage online. ResearchandMarkets report states that for every terabyte of data growth in production, ten terabytes are used for development, testing, and other non-production use cases, thereby driving up costs. Given this magnitude of test data usage, it is essential to align data availability with the release schedules of the application so that testers don’t need to spend a lot of time tweaking data for every code release.
The other most crucial thing in ensuring data availability is to manage version control of the data, helping to overcome the confusion caused by conflicting and multiple versioned local databases/datasets. The centrally managed test data team will help ensure single data truth and provide subsets of data as applicable to various subsystems or based on the need of the application under test. The central data repository also needs to be an ever-changing, learning one since the APIs and interfaces of the application keeps evolving, driving the need for updating test data consistently. After every test, the quality of data can be evaluated and updated in the central repository making it more accurate. This further drives reusability of data across a plethora of similar test scenarios.
The importance of choosing the right test data management tools
In DevOps and CI/CD environments, accurate test data at high velocity is an additional critical dimension in ensuring continuous integration and deployment. Choosing the right test data management framework and tool suite helps automate various stages in making data test ready through data generation, masking, scripting, provisioning, and cloning. World quality report 2020-21 indicates that the adoption of cloud and tool stacks for TDM has witnessed an increase, but there is a need for more maturity to make effective use.
In summary, for test data management, like many other disciplines, there is no one size fits all approach. An optimum mix of production mapped data, and synthetic data, created and housed in a repository managed at a central level is an excellent way to go. However, this approach, primarily while focusing on synthetic data generation, comes with its own set of challenges, including the need to have strong domain and database expertise. Organizations have also been taking TDM to the next level by deploying AI and ML techniques, which scan through data sets at the central repository and suggest the most practical applications for a particular application under test.