Introduction
There's one logical recursion I encounter with test automation. Test automation is about developing software targeted to test some software. So, the output of test automation is another software. This is one of the reason for treating the test automation as the development process (which is one of the best practices for test automation). But how are we going to make sure that the software we create for testing is good enough? Indeed, when we develop the software we use testing (and test automation) as one of the tools for checking and measure the quality of the software under test.
So, what about software we create for test automation?
On the other hand we use testing to make sure that the software under test is of acceptable quality. In case of test automation we use another software for this. And in some cases this software becomes complicated as well. So, how can we rely on non-tested software for making any conclusions about the target product we develop? Of course, we can make test automation simple but it's not the common solution. So, we should find some compromise where we use reliable software to check the target software (the system under test). Also, we should find the way to find out how deep testing can be and how we can measure that.
So, main questions which appear here are:
- How can we identify that the automated tests we have are enough to measure quality of end product?
- How can we identify that our tests are really good?
- How can we keep quality control on our automated tests?
- How can we identify if our tests are of acceptable complexity?
What tests are applied for?
Before started describing how we can measure our tests quality we should be able to identify what exactly we should measure or what our metrics should be based on. The below picture shows main artifacts tests are bound to:
- Requirements - any formal definition on how system under test should work. It can be either some dedicated document or set of descriptions or even simply information based on previous experience with similar systems. In any case, there should be any kind of description of how system should behave.
- Implementation - set of source code and corresponding resources which implements all items defined in requirements
- Tests - any form of instructions targeted to verify the correspondence between requirements and actual system under test behavior.
Despite implementation is the reflection of requirements tests can be mapped not just to requirements but also to some separate part of implementation which is not strictly bound to any part of functionality. This may be related to various auxiliary utility code which is used across the project. It is used by various functional parts representing business logic but they're not dedicated to any of it. At the same time it's necessary to cover such utilities with tests to make sure nothing is broken after any change as such change may affect business logic implementation.
So, given all the above information tests cover requirements and they should be mapped somehow to them. In addition to that tests cover implementation modules and should be mapped to them as well. So, this is the basis to respond to the next question.
How can we identify that the automated tests we have are enough to measure quality of end product?
How do we cover requirements?
There's common practice for requirements coverage. This is Traceability Matrix. It normally sets correspondence between requirements and tests. In case of test automation it also sets correspondence to automated tests. So, this matrix can be represented with the table like:
Requirement ID | Test Case ID | Auto-test ID |
---|---|---|
REQ-1 | TC-1 | ATC-1 |
REQ-1 | TC-1 | ATC-2 |
REQ-1 | TC-2 | ATC-3 |
REQ-2 | TC-3 | ATC-4 |
REQ-3 | TC-4 | - |
REQ-4 | - | - |
In general case each requirement may have multiple test cases verifying different aspects of the requirement (e.g. positive/negative tests). Each test case may have multiple automated tests assigned especially when test case plays several scenarios.
With such scheme we can't get simple measure saying how good we are at requirements coverage, especially for automated tests. All we can use is just 2 separate (slightly relevant) measures:
- Test Case coverage - the relation between requirements with test cases to the overall number of requirements. It can be reflected with the following formula:
where:RCOVtc = Rtc/R- RCOVtc - requirements coverage by test cases
- Rtc - the number of requirements covered by test cases
- R - overall number of requirements
- Automated Tests coverage - it is the part of requirements covered by tests which have automated implementation. It can be reflected with the following formula:
where:RCOVatc = RCOVtc * TCCOVauto = RCOVtc * TCatc/TC- RCOVatc - requirements coverage by automated tests
- RCOVtc - requirements coverage by automated tests
- TCCOVauto - test cases coverage by automated tests
- TCatc - the number of tests with automated implementation
- TC - total number of test cases
- Overall Requirements Satisfaction Rate - the result we get after entire test set run showing which part of requirements are met at all. The formula combines previous values and looks like:
Where:ORSR = PassRate * RCOVtc * TCCOVauto- ORSR - Overall Requirements Satisfaction Rate value
- PassRate - is the relation between passed tests and entire number of tests executed
How to make this measure more precise and simple?
The above measures have some distortions and inconsistencies due to following reasons:
- Requirement is considered covered when at least one test is associated with it. But requirement can be too general and test may cover just some part of it
- Test case is considered covered with automation when it has at least one automated test associated. If test case involves several scenarios where just some of them have automated implementation it still counts but the coverage number is not precise
- Any coverage like this doesn't reflect possible cases which may happen due to technical implementation specifics
Requirements detalization
Each requirement is split to atomic item which requires just single check-point. In order to do better mapping between requirements and tests it's better to perform such split based on testing techniques used. Thus, we can identify range of valid ways, improper ways, border conditions etc. Once we have definition of expected behavior in all of those cases we already can make quite atomic and targeted tests. Thus, the above table will be transformed to something like:
Requirement ID | Test Case ID | Auto-test ID |
---|---|---|
REQ-1-1 | TC-1-1 | ATC-1 |
REQ-1-2 | TC-1-2 | ATC-2 |
REQ-1-3 | TC-1-3 | ATC-3 |
REQ-2-1 | TC-2-1 | ATC-4 |
REQ-3 | TC-3 | - |
REQ-4 | - | - |
Map auto-test to test case
Make 1:1 correspondence between test scenario and it's automated implementation so that it can be tracked easily. Thus, we'll get matrix like:
Requirement ID | Test Case ID | Auto-test ID |
---|---|---|
REQ-1-1 | TC-1-1 | ATC-1-1 |
REQ-1-2 | TC-1-2 | ATC-1-2 |
REQ-1-3 | TC-1-3 | ATC-1-3 |
REQ-2-1 | TC-2-1 | ATC-2-1 |
REQ-3 | TC-3 | - |
REQ-4 | - | - |
But we still have untracked areas where we don't cover anything. When we run testing our results wouldn't include information about requirements coverage. We should always track requirements and their correspondence to tests. Generally, this stage is quite OK and a lot of projects stop here. But it doesn't mean that it's really maximum we can take.
Make test cases and automated implementation as a single unit
The idea is that each test case is created in specific form which can be read and interpreted automatically by some test engine which would run specific test instructions based on test case steps description. This leads us to Keyword-driven testing where each test case is the set of keywords processed by some automated engine. Thus, we collapse test cases and their automated implementation into single unit where test case itself is just input resource for the automated tests. After such transformation our table looks like:
Requirement ID | Test ID |
---|---|
REQ-1-1 | KTC-1-1 |
REQ-1-2 | KTC-1-2 |
REQ-1-3 | KTC-1-3 |
REQ-2-1 | KTC-2-1 |
REQ-3 | KTC-3 |
REQ-4 | - |
Make requirements executable
Previously we've done unification between tests and automated tests which collapsed table just to 2 columns and 2 major items: requirements and tests. But what if requirements are created the way that tests covering them are generated automatically in a form accessible for automated execution? This approach is called Executable Requirements. Thus, requirements are automatically expanded into test cases and test cases are expanded to automated tests. Eventually, we'll get representation like:
Requirement ID |
---|
REQ-1-1 |
REQ-1-2 |
REQ-1-3 |
REQ-2-1 |
REQ-3 |
REQ-4 |
ORSR = PassRate * RCOVtc = PassRate
How do we cover implementation?
All the above was related to binding requirements to tests. But we didn't covered implementation at all. In some cases we may have some implementation parts which aren't covered by any requirement or some specifics which is not detailed in requirements but exists in the code.
Why it is important? OK. Let's just keep ORSR metric and use only it. In this case we may have 100% coverage even when all tests are empty so that they don't do anything. So, in order to prevent such situation we should also take into account the code coverage metrics indicating that each specific code item is invoked at least once during tests run.
Mainly we can take line and branch coverage values as the most frequently used. We can also use class and function/method coverage but that would actually be another reflection of line coverage metric. Also, we can involve some more complicated coverage metrics but it's a matter of separate chapter. For now we'll just take the most frequently used metrics. So, the Overall Code Coverage may be calculated as multiplication of all independent coverage metrics. Since, all coverage metrics show values from 0 to 1 (or from 0% to 100%), the final value would also fit this range. So, the formula is:
- OCC - overall code coverage as integrated measure or code coverage
- CCOVline - code line coverage
- CCOVbranch - code branch coverage
Now, we can combine this with Overall Requirements Satisfaction Rate to combine both requirements and implementation coverage. Let's name this unified metric as Overall Product Satisfaction Rate (OPSR) the unified coverage of requirements and their implementations which also can be interpreted as Overall Product Readiness. It is calculated as:
- ORSR = PassRate
- OCC = CCOVline * CCOVbranch
- OPSR = ORSR * OCC = PassRate * CCOVline * CCOVbranch
Is that enough?
No. Despite we the coverage we measure is already complex and covers different aspects of system under test there are still gaps which may lead to inconsistent and wrong interpretation of results. One thing which is left non-covered here is the tests themselves. Next paragraphs will describe this moment in more details.
How can we identify that our tests are really good?
When tests can be bad?
Let's take a look at some small example of requirement, its' implementation and test covering it to see why OPSR metric is not enough to say that system under test is of good quality. Let's say we have some system which has the requirement that states:
Subtraction: for the given input A and B the result C is received as C = A - BLet's assume we already described all necessary details regarding input format, acceptable values and we already have tests for all those parts. Now we're concentrated on the operation itself. The implementation of it may look like:
double subtract(double a, double b) {
return a + b;
}
And now let's assume we have test which covers the implementation:
void testSubtract() { subtract(2, 3); }Firstly, not that the implementation sample uses + operation which is opposite to subtraction. But also notice that we have test which simply invokes the operation without checking the result. If we measure overall coverage we'll see that test covers all lines of implementation, it also covers the requirement. But you see that functionality is wrong and test doesn't detect that.
That's why the quality of our tests must also be estimated.
How can we detect that test is good?
There are several criteria indicating that each specific test is of good quality:
- Test does what it's supposed to do - sometimes test is designed for one thing but actually checks something else. This may happen either for bad (mistake during automation) or good (result of test case update without automation implementation changes) reason. Nevertheless, we should be able to control such situations;
- Test operates with valid data - when we design our tests we should make sure that we use proper input and proper expectations for the output. In some cases we may operate with improper data or we may set improper results as expected (especially during test automation when some people are more targeted to make all tests pass assuming the data is correct rather than verifying data consistency).
- Test has sufficient number of check points - it is very frequent case when our tests have some check points but they are not enough to check all output items in the entire output. So, we should make sure that our tests may detect any potential errors in output results;
- Test fails if the functionality under test is inaccessible or changed at all - obviously if system under test doesn't work at all the test interacting with this system should fail. Or if we replace working module with something that doesn't work there should be at least one test which can detect that something goes wrong;
- Test is independent - test runs the same way both separately or in any combination of other tests. So, it's independent to other tests. This is important as a lot of test engines (like any of xUnit engine family or similar) do not give any guarantee regarding sequence of tests to be performed. Additionally, we may need different set of tests running for different situations. And finally, if there's a test which depends on results from another test isn't that more correct to treat those 2 tests as one?
- Test runs the same way multiple times with the same result - each test should be predictable and reliable. At least it is useful to be able to reproduce the situation which happened during tests run.
So, what are the methods which may assure the above items? Some of them are:
- Review - the most universal way of tests quality confirmation at least because it can be done anywhere and can be applied to the widest range of potential problems. At the same time it's one of the most time consuming way and it doesn't mitigate human factor.
- Cross-checks - some tests may be designed the way that they make actions which produce similar or comparable results. So, additionally we can make some reconciliation of results by comparing relevant operations.
Example: Imagine we have some module supporting 2 operations: Operation 1: add(a, b) = a + b Operation 2: mult(a, b) = a * b We may add some tests verifying their functionality separately: Test 1: Expression add(a, b) = c is valid for a, b, c where | a | b | c | | 1 | 1 | 2 | | 2 | 0 | 2 | ... Test 2: Expression mult(a, b) = c is valid for a, b, c where | a | b | c | | 1 | 1 | 1 | | 2 | 0 | 0 | ... At the same time the above operations are relevant and multiplication can be expressed by addition, e.g.: 2 * 3 = 2 + 2 + 2 (3 times add 2).
- Resource sharing across independent teams - it's rather process item which means that input data and automated test implementation are done by different people independently. When 2 people doing work in the same direction but from different sides and their results are matching it increases probability that they do properly. At least it avoids the risk of adapting data to test from implementation side and at the same time strictly controls the data definition. There may be several examples of resource sharing:
- Input data for data-driven tests - test designer may prepare data sheet with inputs and expected outputs while test automation engineer may work on making common work flow based on some test samples.
- Keyword-driven or similar approaches - using this approach test designer creates test cases independently on implementation. At the same time test design and test automation here are separate activities. Thus, test does predictable actions with known and validated data.
- Mutation Testing - the testing type which is based on artificial error injection in order to check how tests are good at detecting potential problem we know about. This approach is quite time and resource consuming but it can be fully delegated to machine.
All those approaches have different way and area of influence. Also, some of the above items are approaches, some of them are process items. So, it's hard to put all the above items into one place to see the entire picture. But the below diagram shows how each of the above items cover requirements, their implementation and tests for them:
- Review is something that can be applied everywhere, not just tests and actually it can cover almost all aspects of functionality and tests for it
- Resource sharing and cross checks involve a bit all items to cover. Thus, we can make various cross-checks to verify consistency between requirements, we can make more detailed tests based on actual implementation as well as to verify consistency of our tests. But that's rather technical and process items and they are not applied everywhere
- Mutation testing is targeted to cover tests only
What can we measure there?
Generally, most of the items listed in this paragraph are more about how to do things. But only one of them shows some measurable results and says what should be covered and what's already covered and how much. It's Mutation Testing and the metric we can get from this practice. This metric can be called as Mutations Coverage Rate and it shows how many of potential mutations we can inject into the system under test can be found out by tests. We'll code this value as M%.
Having this value calculated we may say how thorough do we check any system under test code line we invoke. Thus, this value actually compensates the results given by Overall Code Coverage metric we've received before. But before we also combined Overall Code Coverage characteristic with requirements coverage and got joint Overall Product Satisfaction Rate value. So, now we can get new quality characteristic named Overall Satisfaction Rate indicating our assurance on requirements and implementation covered. This value can be calculated as:
- ORSR - How many expectations are met at all?
- OCC - Which part of the entire application code we invoked?
- OPSR - Did we check all capabilities of our system for expectations satisfaction? If not, which part of the actual system under test meets expectations?
- M% - How many potential problems do we cover and ready to detect with our tests?
- OSR - Which part of the actual system under test we are sure meets expectations?
- We can detect and measure which requirements are covered good enough and which require more tests
- We can detect and measure which functionality wasn't implemented (non-covered requirements)
- We can detect and measure what tests require more check points
Is that enough?
No.
Firstly, the above metric is coverage-based and we actually used near 5 coverage metrics in it. But, for instance, the ISO/IEC/IEEE DIS 29119-4:2013 standard states near 20 coverage metrics which can be applied depending on different techniques we use. And yet, even if we integrate all of those metrics we still just minimize probability of leaving something non-covered as there always can be some coverage item which is superposition of already used items.
Secondly, it cannot be absolute quality metric as it doesn't cover such technical aspects like maintainability, testability and many other software characteristics (here is an example model for maintainability).
So, we always have a space for activity. But we have restricted budget and we always should think not just about absolute coverage but coverage of acceptable level.
How can we keep quality control on our automated tests?
What to test in tests?
This is another main topic of the chapter. Since automated tests are another form of software it should have similar practices applied. And testing shouldn't be an exception here. Logically, we should apply similar approach. But ... but subjectively testing for testing looks like an overhead. Imagine, we do testing for software, then testing for testing, then (if we still keep similar logic) testing for testing for testing and so on. It's insanity! We are not making software for the purpose of testing it. It is initial software which is the product we make but not the tests for them which are just targeted to simplify our lives but not making it more complicated.
What should we do here? The simplest way is to forget about such testing, everything works fine, I've checked that. Yes, we always can use excuse like that. But in this chapter I'm looking for some objective criteria stating that our testing solution is of appropriate quality. OK, before we've described entire way to measure system under test quality. So, now imagine out testing solution is that system under test and we should apply the same approach just for lulz to prove how our theory can be applied to some specific cases.
Overall automated testing solution structure can be represented with the following diagram:
- Engine - the core driver of the system which is responsible for tests organization, execution, reporting and event handling. In some cases it's completely external module (e.g. any engine of xUnit family). In some cases this is something custom-written (even based on existing engine).
- Core Library - the set of utility libraries and various wrapper functions which are still irrelevant to application under test but operate with higher abstractions than engine. Typically that can be various data conversion functions, some UI wrapper libraries, additional functions which are not specific to application under test but just made to minimize copy/paste
- Business Functions - a set of functionality which reflects application-specific behavior and actually reflects actions to perform with system under test
- Tests - final implementation of test scenarios
- Technology-specific - the group of components which is not really bound to application under test and can be applied to similar applications or applications using similar technological stack
- Application-specific - the group of components which reflect application under test functionality and cannot be used somewhere else outside of the application under test
How can this all be tested?
Each of the test automation solution structure component types can have individual approach for testing. But mainly testing can be applied the following way:
Group | Structure Component | Testing Approach |
---|---|---|
Technology-specific | Engine |
There may be 2 major ways for testing this part:
|
Core Library | Since core library is also a kind of software which can be used outside specific project we can treat it as separate library and apply the same unit, integration, system tests to it considering that we're not bound to any specific application | |
Application-specific | Business Functions | Business functions are actually reflection of application under test functionality. So, tests themselves are some kind of unit, integration, system or whatever tests for all those business functions. |
Tests | Normally each test is some kind of function which doesn't return any value and doesn't accept parameters (or at least we can expand it to that form in case of data-driven tests). The test result is either pass or fail depending on whether we encounter an error during execution or not. So, if we imagine hypotetical test which tests this test it would be a single instruction call without anything else. But it doesn't make any difference from normal test run. So, if we want to make tests for exactly tests we just need to make trial test runs on some test environments |
- Lines highlighted with red show test automation solution components which do not require any additional tests to be created. Testing solution tests itself
- Green highlighted lines reflect components which can be treated as separate software and we can apply all similar practices we use for testing our application under test. So, test solution components which are not specific to application under test can be treated as separate software components which should be tested separately.
NOTE: |
---|
Actually it's not really correct to say that any application specific functionality and resources do not require separate testing activities. There may be different cases. E.g. in one of my previous projects we used to practice tests verifying that our window definitions are up to date with current application. That was done for GUI-level testing and it was some kind of unit tests for such test type. But normally, if we talk about GUI testing there should be separate test which just navigates through different screens with minimal business actions and verifies that all controls which are supposed to be there actually exist. So, it doesn't break anything told above, it's more about proper interpretation of tests |
How can we identify if our tests are of acceptable complexity?
Good. We know all what to test and how to detect when we have enough reliability level of our tests and all subsidiary components. Thus, we are not just confident about our system under test quality but also we're confident about quality of tools we use. But due to this confidence we shouldn't forget that our main goal is system under test development, not tests for them. So, if testing activities take more resources than actual development, well, probably there's something wrong with it. From technical side this problem may be caused by testing solution complexity. In order to control the situation and prevent such problem we may need to measure this complexity.
If we talk about code complexity we can use metric named Cyclomatic Complexity. For each function it shows the number of possible flows the function can be performed. There is common practice stating that each method/function should have Cyclomatic Complexity Number (further CCN) value less than or equal 10. If CCN is between 10 and 20 the method is moderately good. If higher the method is treated as non-testable. This is good metric to keep granularity of our code. But also, we can use it for complexity comparison between testing solution and application under test.
Complexity of tests
In previous paragraphs we've defined some criteria of good tests. And one of them sound like:
- Test runs the same way multiple times with the same result
- Each test has CCN >= 1
- In the most ideal way all tests have CCN = 1
- The more tests with CCN > 1 we have the less TSR value we have
- TSR - tests simplicity rate value
- CCN(i) - CCN number of test with i index
- TCatc - the number of automated tests
With the above calculations we may express tests complexity with TSR value which has 100% rate when all tests have just one flow and value near 0 if tests are too complicated.
Complexity of subsidiary testing solution components
For subsidiary testing solution components like Engine or Core Library there's 1 major criteria of acceptable complexity: the subsidiary module should have less complexity that application under test. And this criteria is applied only for modules developed as a part of the project, so that e.g. we don't need to measure complexity of JUnit if we use it. But as soon as we write our custom extension of any JUnit class we should take this into account while calculating complexity.
For better comparison we can aggregate CCN numbers for all the code of system under test and the same values for subsidiary module. After that we may get Test Component Simplicity Rate (TCSR) value using the following formula:
Is that enough?
No. The above characteristic was taken based on 1 factor value. But we can include much more to make measure more precise and visible. And main thing which should be of interest is the value which any testing effort bring. Everything spins around the value of it.
Where to go next?
In this chapter we've described several testing solution quality metrics which give us some visibility on how good we are with our testing. Eventually, we've managed to consolidate multiple metrics into one to give short and compact result. We may involve many other different metrics and consolidate them but we should always take into account the following:
- No matter how many metrics we add there're always areas we can grow with. So, if we didn't reach the top we should expand our testing to reach it. If we reached the top, we need to find some other metrics.
- We should always interpret results properly. 100% doesn't always mean perfect result
- Any number we get should be used for the purpose. We should clearly understand what each number shows and what it doesn't
No comments:
Post a Comment