Testing, 1-2-3, testing

Goal

Mainly people work from within whatever framework has been built at their work.This has the disadvantage that it is hard to try new techniques or imagine how new types of tests could be integrated. Furthermore, most frameworks have layer upon layer of abstration, so I made a repo that is meant as a playground of sorts. I wanted to split the task of learning in two, separate the how from the complexity of a full framework. This is meant to explore how to write automated tests against an incredibly simple system so that each test is more easily manipulated and can be held “in memory” more easily. These are not unit tests but instead integration API level tests.

Link to repo

Structure of repo:

toy_example

app.py
database_creation.py
requirements_toy_example.txt
README.md
tests
- load_testing_file.py
- test_faker.py
- test_property_based.py

Getting Started

Create a virtual environment
Run the requirements_toy_example.txt
Open the database_creation.py file and run the create_db_table function
Run the Flask app in app.py
test_faker.py and test_property_based.py are both meant to run from PyTest
load_testing_file.py is meant to run from Locust framework (more info later)

The App

The application under test is a very simple Flask based API backed by an equal simple SQLite database. The routes defined in the app.py are basic CRUD (Create, Read, Update, and Delete) type functionality. The subject matter is comic book artists. The only information that can be passed in are:

First name
Last name
Birth year
Primary key is an integer that iterate upwards

All calls are made to the same local endpoint:

http://127.0.0.1:5000/artists

The basic functions available are:

Get (show all records available)
Get by ID (show a specific record, based on the primary key)
Post (insert a new record into database)
Put (modify an existing record, requires primary key id)
Delete (remove a record)

Each of these map to a specific database function on a pretty close to 1-to-1 relationship. This setup is pretty similar to our own API at Broadsign. Almost every call that ATS is making to achieve our goals fits into one of these categories (with a few exceptions).

The idea is to make such a simple application that is not going to require more than a few minutes to understand what is going on and no stored functions that require lots of mental space to parse out. There is only 1 table right now and it is a dead simple table with very little data.

Regression Testing (Faker tests)

The main use case of writing tests with Faker library is to ensure that when given reasonable data that the app continues to perform the expected actions. These are the majority of the tests we run in ATS, we have established a specification and written a test to make sure that spec continues to be true over time as we change certain parts of DML. For these test we follow the format Arrange, Act, and Assert. Essentially we are giving structural information about what is under test, and what is merely incidental code. This is a common theme in BDD (Behaviour-driven-development) and sometimes uses the terms given-when-then from the Gherkin specification.

Each test has a specific state that it wants to push the system into (arrange). After which it will then perform a change to the system (act). At which point it will collect data about how the system responds (assert).

Ideally these tests should not be written with a lot of tie-ins to a specific implementation detail or if they need to be it should be abstracted out to a single method so when it changes we only need to change one place. These tests should be as fast as possible, however, the fastest tests will also generally cover the least code. A full end-to-end test will be slow and require lots of network activity, database writes, and provide lots of places for items not under test to cause false fails. So we need to strike a balance when writing these types of tests.

Here is a link to common fake data that the Faker library can provide. Generally where we are using this mostly is to create reasonable first and last names. This will not stress the system but it will ensure that “normal” data input does not cause the system to enter a bad state. Here is an example of how we generate a fake first name:

fake = faker.Faker()
first_name = fake.first_name()

I have stuck that code into the method create_faker_artistand that will then feed a JSON like object back to used in a REST call. After each call I check that it returns a 200 status code, which indicates that it completed successfully. The simplest test is test_insert, I provide all the required info in a hardcoded format, just for ease of inspection. The next simplest is test_list_all, which simply does a GET against the endpoint and returns everything in database. After that we actually start using the Faker library, we test_insert_fakerwhich will grab two fake name (first and last) and a random birth year somewhere between 1850 and 2022. I was pretty flexible in how I treat the birth year info because JSON treats it like a string anyways and sqlite does not really care about what data type it is either.

Now we move on to the more complicated tests, test_search_by_id, test_update_artist, and test_delete_artist. All of these require the creation of a specific object so that it can later be manipulated. For the search test I need to create a new comic book artist in order to search for it. So I start with a POST (create) and then I make a search call (GET) based on the ID returned by the POST. Then I assert that the object I sent in the POST (create) matches what was returned by the GET (search call). Updating and deleting are basically the same steps just instead of using a GET I do a PUT and a DELETE.

Exploration Tests (Hypothesis)

The hypothesis library offers a lot of options for creating random data. The basic premise is that you feed in a recipe defined using the @given decorator. The variable that you want to pass in (or inherit) for the test is the first variable passed in (test_first or test_last in this case). Then you define whether it is text or integer or something else, read more about it in this article here. Once you define the scope of what is acceptable input PyTest will then run dozens, hundreds, or even thousands of times. The number of times it runs and how long it will wait are controlled with the @settings decorator, with max_examples and deadline variables passed in. Each run is stored in a sub-folder and fails will be re-tried until the minimum version of what is causing the problem is found, you can learn more in this Pycon talk.

So now that we have looked at the headers and given a short idea of what this does, let’s tackle the why. We know that random data is not going to hit the exact sweet spot to cause a bug every time. So why are we only feeding in random data to our automation? The goal of Hypothesis is to find the exact right random data in the shortest amount of time possible. So it keeps track of what it has tried before and tries something else next time. Given that we can run this thousands and thousands of times it is clearly going to be most effective when the test is incredibly short, like in the milliseconds range, not in the minutes range. So this is a tool that will help us if we are making calls against an API or feeding unit-test level inputs into a class. Not something that we want to use while setting up a BSP to run for 30-90 seconds per test iteration. So why bother with this kind of test? It allows us to make sure there are no hidden gotchas hiding somewhere in the pipeline and we then do more minimal type tests at the longer and more costly level of involving BSP, BSS, and all the networking in between. It also means that we do not need to think of every case and write up a specific test for that input. We allow the computer to do all the drudge work and iterate over every possible input because that is what it is good at! We might not find very many new bugs with this method but it is a new tool in our arsenal that allows us to state with confidence that a feature will not break in this way. And it allows us to free up time in our head space to concentrate on other types of testing.

The actual mechanics of the test are very simple, after I define what to feed in I call insert_artist which accepts either first or last name. The json that will be feed into the API has two key-value pairs that can be overwritten, either first or last name. Depending on what value you pass in the hypothesis generated value will either be fed into one of those key-value pairs. Failure in this case is anything other than a statue code of 200. If we created a few helper functions for these types of tests we could cover the majority of our API calls in relatively few tests, each of which would differ very little. I don’t know if that is the most effective use of our time but as we continue to add new API calls I think this type of testing might be a good sanity check before we release them to the wild. There are other libraries that make use of hypothesis and swagger like this one. The goal here should be that we don’t need to know about and exhaustively test for every end state, we simply define the bounds and let the computer iterate through the somewhat limitless options.

Load Testing (Locust)

The goal of this library (and type of test) is to put strain on your end-points and see if they continue to return the right data. In this case the “right data” is mostly being determined as a 200 status code. This test cannot be easily run by pytest and the outcome of the test is generally not an assert type answer. We try different levels of stress and see if the code is capable of responding. Each iteration or change can then be re-tested to make sure nothing we change has a negative performance impact.

Example command line to run the file:

locust -f .\toy_example\tests\load_testing_file.py

This will spawn a web interface at localhost:8089 that allows you to control what is happening as well as see metrics graphed (and outputted if you desire).

Web Interface:

This is the main screen, it is where you indicate how many threads you want to have active and how quickly you want them to spawn. Also allows you to pass in a host endpoint. I am not using the functionality since everything is hardcoded but it exists.

Once you have created a new test run you will be presented with some output in a data table that looks like this:

Generally, the most important info is coming from the outlier data. So 99% or max but it will depend on what info you are looking for. We can also graph the outcome by clicking on the charts tab in the top left. Which will output something that looks like this:

In terms of the code there are a few items to note:

class ArtistSpawn(HttpUser):

This is where the actual requesting code inherits from, under the hood it is basically just the requests library but with some layers to help with threading and usage in the framework. If we wanted to swap it out for gRPC we could do so very easily but inheriting from a different class of requestor.

    wait_time = between(1, 3)

The wait time defines how much time there should be between calls on each of the threads. If we set it down to 0 then it will make requests as fast as possible. This is set to give 1 to 3 seconds between requests.

    def on_start(self):
        fake = faker.Faker()
        self.first_name = fake.first_name()
        self.last_name = fake.last_name()
        self.birth_year = random.randint(1850, 2022)

This is called once at the start of the run. The goal here is to generate fake data to pass into the insert call. We could hard-code it but this way we get a slight bit of variability so we can look at the database and see patterns if we want.

    @task(1)

This defines how frequently each method should be run. In this case 1 time. When contrasted with task(3)it will run once for every 3 times the other functions runs. So the inserts should only happen 1 out of 4 requests on any given thread.

The two functions are a basic insert and get call (returns all artists). We can define any behaviour we want so long as the base class we are inheriting from supports it. Then the threading library gevent will spam whatever we choose at our endpoint.

How to Prove Software is Ready?

Shipping software is hard, making sure that what is being sent to the end-user works as intended is extremely difficult. Do they use your product exactly as it was tested? Were the needs captured during the initial planning stages correctly? Are there use cases that were not originally obvious? Does the software behave the same under load? Does it work the same the 35th time as the 1st?

One of the ways as a QA (or tester) group can do this is by running exhaustive regression testing on it. It is possible to moved through every state of some small projects but becomes impossible with anything that moves beyond the trivial size. Generally we are constrained by time and number of personnel available. We also cannot accept quality dropping. So where does that leave us?

We need to select a sub-set of all possible states and run through them in a coherent and communicable way. This is where test cases can come in.

Regression vs. Feature Work

In a regression mode of testing the only thing we are looking for is, did it change? We literally do not care if there are other bugs in the software. Is there a UI element that looks wonky? File it if you have time, but we don't care. The workflow doesn't make sense? Do not care right now, if it is the same as last time, not a regression bug. All of the feedback or subjective opinions should have been expressed when the feature was being worked on. The only question right now is does it work in the way that test case says it should? Ok, then it passes. Move on.

Outcome

So what does this result in? After having written and executed all of these tests what is the hoped for outcome? The goal of writing and running Regression test cases is that the product is in working order and the end-users will not spot issues that block them from using the product. That is the whole goal, it should not take more than a few minutes to determine whether the test passed or failed. There should be no double or triple checking for minor glitches. It either did it or it did not. The only real question here is: can we ship?

Consistency is the Point!

Identifying and checking for a specific set of parameters means that you are narrowing the mental bandwidth of the testers to only need to pick-up very few items. This comes with risks, if you pick the wrong set of parameters it can lead to end-users being unhappy with the product (and maybe rejecting it altogether). So the goal should be that you re-use these parameters from release to release. Especially if the previous releases were successful. The template that we should be using is whatever is currently in Production environment and making the company money.

Ready to Ship

So now all the Regression test cases have been run, everything is green (or waived). The release now gets a stamp of approval and we can ship. All of the tests should be documented and updated to reflect the current state of the software at the time of shipping. That way they are all ready for next release.

Testing, 1-2-3, testing

Tuesday, 27 December 2022

Toy Examples