April 20, 2017

Performance Testing 101 - 5 min intro & example

When developing and deploying web services, apps or sites the following questions come up: "How will it perform?", "How many concurrent users will it support?", "If I tweak this setting, will it be faster?", "Do these new features effect performance?". The list could go on and on and on. Performance questions are common, solid answers are not.

Performance testing can take many different shapes, from dead-simple one-liners to complex setups, tests, tear-downs and analysis. While this article focuses on quick, easy and straightforward testing, future articles will address more advanced topics.

There are some great easy tools to get first ball-park answers to performance questions getting at the number of concurrent users as well as how the response time changes as load increases. Here I'll give a short intro to ApacheBench.

Let us begin with setting up some basic terminology, first let's refer to our machine under test as the host, this can be any kind of http-accessible server you have. Second we will want an agent machine to drive our tests from.
When performance testing, it is key to limit the number of possible variables which could distort our results. Ideally your agent is a separate dedicated machine and as close as possible (network distance wise) to you host system in order to minimize the amount of networking you test. This is especially relevant when you are testing applications hosted in a shared environment (ie cloud). The performance impact of noisy neighbors can be surprising, but that is a topic we will explore in detail in the future.

ApacheBench

ApacheBench is a command line tool (ab) which allows for simple load driving against HTTP hosts. It's great at producing large numbers of REST requests, capable of producing thousands of requests per second. Generally I find ApacheBench most useful for getting a rough idea of how many requests an application can handle. It's extremely simple to use and therefore a great tool while debugging configurations.

To install on Debian/Ubuntu:

sudo apt-get install apache2-utils

To Install on RHEL/Centos/Fedora

sudo yum install httpd-tools

Usage:

Usage: ab [options] [http[s]://]hostname[:port]/path
Options are:
    -n requests     Number of requests to perform
    -c concurrency  Number of multiple requests to make at a time
    -t timelimit    Seconds to max. to spend on benchmarking
                    This implies -n 50000
    -s timeout      Seconds to max. wait for each response
                    Default is 30 seconds
    -b windowsize   Size of TCP send/receive buffer, in bytes
    -B address      Address to bind to when making outgoing connections
    -p postfile     File containing data to POST. Remember also to set -T
    -u putfile      File containing data to PUT. Remember also to set -T
    -T content-type Content-type header to use for POST/PUT data, eg.
                    'application/x-www-form-urlencoded'
                    Default is 'text/plain'
    -v verbosity    How much troubleshooting info to print
    -w              Print out results in HTML tables
    -i              Use HEAD instead of GET
    -x attributes   String to insert as table attributes
    -y attributes   String to insert as tr attributes
    -z attributes   String to insert as td or th attributes
    -C attribute    Add cookie, eg. 'Apache=1234'. (repeatable)
    -H attribute    Add Arbitrary header line, eg. 'Accept-Encoding: gzip'
                    Inserted after all normal header lines. (repeatable)
    -A attribute    Add Basic WWW Authentication, the attributes
                    are a colon separated username and password.
    -P attribute    Add Basic Proxy Authentication, the attributes
                    are a colon separated username and password.
    -X proxy:port   Proxyserver and port number to use
    -V              Print version number and exit
    -k              Use HTTP KeepAlive feature
    -d              Do not show percentiles served table.
    -S              Do not show confidence estimators and warnings.
    -q              Do not show progress when doing more than 150 requests
    -l              Accept variable document length (use this for dynamic pages)
    -g filename     Output collected data to gnuplot format file.
    -e filename     Output CSV file with percentages served
    -r              Don't exit on socket receive errors.
    -m method       Method name
    -h              Display usage information (this message)
    -Z ciphersuite  Specify SSL/TLS cipher suite (See openssl ciphers)
    -f protocol     Specify SSL/TLS protocol
                    (TLS1, TLS1.1, TLS1.2 or ALL)

Example usage:

 ab -c 1 -n 1000 http://example.com/

Putting ApacheBench to use:

The above section should be plenty to get you started, but lets look at a quick example of testing caching. Below I've setup a simple flask server example, which on request calculated a random Fibonacci number between 1 and 30.

#!/usr/bin/env python
#
# To start this server, you must have python and flask installed
# Start server: python testserver-fib.py
#
# To install flask use the pip line below:
# pip install Flask
# or visit: http://flask.pocoo.org/docs/0.12/installation/

from flask import Flask
import random
app = Flask(__name__)

# snagged from: http://stackoverflow.com/a/499245
def F(n):
    if n == 0: return 0
    elif n == 1: return 1
    else: return F(n-1)+F(n-2)

@app.route('/')
def hello_world():
    r = random.randint(1,30)
    fib = F(r)
    # ApacheBench expects constant output
    return 'fib({0:02}):{0:06}'.format(r,fib)

if __name__ == "__main__":
    app.run(debug=True)

Now let's see how we do performance wise. We set the concurrency to 1 using -c 1 and specify the number of requests to 500 by setting -n 500. Note that we are using the simple flask dev-server, which is single threaded.

m@test:~$ ab -c 1 -n 500 http://127.0.0.1:5000/
-- snip --
Concurrency Level:      1
Time taken for tests:   17.821 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      85000 bytes
HTML transferred:       7000 bytes
Requests per second:    28.06 [#/sec] (mean)
Time per request:       35.642 [ms] (mean)
Time per request:       35.642 [ms] (mean, across all concurrent requests)
Transfer rate:          4.66 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     1   35  88.2      1     471
Waiting:        0   35  88.2      1     471
Total:          1   36  88.2      1     471

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      5
  75%     16
  80%     27
  90%    107
  95%    277
  98%    447
  99%    461
 100%    471 (longest request)

In the above example, we see the average request time was 34ms, the median was 1ms and we had 28rps (requests per second). What happens if instead of a single connection we have 10 concurrent connections (setting -c 10)?

m@test:~$ ab -c 1 -n 500 http://127.0.0.1:5000/
-- snip --
Concurrency Level:      10
Time taken for tests:   18.579 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      85000 bytes
HTML transferred:       7000 bytes
Requests per second:    26.91 [#/sec] (mean)
Time per request:       371.583 [ms] (mean)
Time per request:       37.158 [ms] (mean, across all concurrent requests)
Transfer rate:          4.47 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     3  360 294.4    294    1470
Waiting:        2  360 294.4    294    1470
Total:          3  360 294.4    294    1470

Percentage of the requests served within a certain time (ms)
  50%    294
  66%    443
  75%    529
  80%    579
  90%    746
  95%   1018
  98%   1136
  99%   1160
 100%   1470 (longest request)

While the RPS remains very similar to before at ~27rps, our response times have gone through the roof (mean of 371ms and median of 294ms)! Here we have a situation, where multiple parallel connections get serialized and processed one at a time, while the overall rate remains unchanged, the quality of service delivered to each client degrades by a factor roughly similar to the number of concurrent connections.

Let's see if we can do better. Since we repeatedly calculate the same 30 Fibonacci numbers, let's add some caching into the mix. Generally, if you have long-running requests that will always return an unchanging value, it is a good idea to cache these. With the caching in place, the first few requests will still have the same slow response time, but all of the following requests will benefit from the cache and therefore be as fast as our cache lookup. See the modified code below:

#!/usr/bin/env python
#
# * To start this server, you must have python and flask installed
# * Copy this into a file named testserver-fib-cached.py
# * Start server: python testserver-fib-cached.py
#
# * To install flask use the pip line below:
#      `pip install Flask`
#   or visit: http://flask.pocoo.org/docs/0.12/installation/
from flask import Flask
import random
app = Flask(__name__)

cache = {}

# snagged from: http://stackoverflow.com/a/499245
def F(n):
    if n == 0: return 0
    elif n == 1: return 1
    else: return F(n-1)+F(n-2)

@app.route('/')
def hello_world():
    r = random.randint(1,30)
    if r in cache:
        print('hit')
        # ApacheBench expects constant output
        return 'Cache Hit!  fib({0:02}):{0:06}'.format(r,cache[r])
    else:
        fib = F(r)
        cache[r] = fib
        print('miss')
        # ApacheBench expects constant output
        return 'Cache Miss! fib({0:02}):{0:06}'.format(r,fib)

if __name__ == "__main__":
    app.run(debug=True)

Now let's run our single connection, 500 request benchmark again:

-- snip --
Concurrency Level:      1
Time taken for tests:   1.680 seconds
Complete requests:      500
Failed requests:        0
Total transferred:      91000 bytes
HTML transferred:       13000 bytes
Requests per second:    297.55 [#/sec] (mean)
Time per request:       3.361 [ms] (mean)
Time per request:       3.361 [ms] (mean, across all concurrent requests)
Transfer rate:          52.89 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     1    3  27.7      1     497
Waiting:        0    3  27.7      1     497
Total:          1    3  27.7      1     497

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      7
  99%     85
 100%    497 (longest request)

The results are quite impressive: the mean is down to 3.3ms, the median down to 1ms and the request rate is at 297rps! That is 10x faster. Once the cache is initialized and our benchmark no longer includes the cache seeding time, we get even higher performance, which at this point is likely limited only by the cache-lookups. My local testing gets me up to 1100rps with median and mean both less than1ms. While this is a simple example for demonstration, it is important to note that part of what we are seeing is a misleading flaw in how most load driving tools generate requests and record latencies, this is known as the coordinated-omission problem, but that is a topic for another day.

This concludes our short introduction into performance testing, but soon to follow will be more articles addressing more complex setups, benchmarking methods and types, metrics to be evaluating and considerations for repeatability.