Coverage Guided, Property Based Testing


[MUSIC]>>Really happy to welcome Mike Hicks back to MSR
to give us a talk here. I’m especially happy because Mike is my PhD advisor and it’s great
to have him here again. I think you all know Mike very well.>>I said I just never
go. I’m always there.>>So all of you probably
know Mike pretty well. So I was thinking about
like a couple of things I could tell you about Mike
that you maybe don’t know. So one he’s a pretty bad
ass, prog rock drummer.>>Bad ass.>>You see him like tapping really fast when you’re talking
to him in meetings like whacking out some crazy rhythm
on the side. That’s cool. But also maybe more technically, I think Mike was one of the original designers of the
cyclone programming language. In particular, it’s linear typing
discipline among other things and that was about 20
years ago already by now.>>Fifteen, yeah.>>2000-ish.>>2003, yeah.>>You’re getting there. I think things like cyclone back then was way ahead of its time and
now these ideas are seeing major adoption and
things like Rust and so on, the basic mechanisms were using control in low-level
programming languages. Ideas that might come up with
several others a long time go. So it’s great to have you here.>>Thanks.>>You’re here about testing.>>Right. Here for the DeepSpec
Workshop at PLDI. Thanks, Nick. That was probably the best
introduction I’ve ever gotten. High bar for the next person. So this paper is with
these wonderful people. So Leo is a postdoc at Maryland, now Benjamin is at UPenn and Leo was Benjamin’s student
and now Leo’s my post-doc. So I’m going to talk about how we can do the Reese’s peanut butter
cup thing with fuzz testing, coverage guided fuzz
testing in the style of AFL and property testing in
the style of QuickCheck. So that’s the question.
Is it a good idea to combine these two things
and the answer is, of course, yes or I
wouldn’t be standing here. So let’s talk about what they are. So fuzz testing is extremely
popular these days. It’s a form of random testing
that’s surprisingly dumb. It’s slightly smarter today. The most dumb incarnation of it
came about in the early ’90s. Bart Miller is credited
with starting it and basically what they found
is if you send lots of random inputs to Unix utilities, they would crash and
those crashes would reveal important bugs that they had. Things have gotten a
little more sophisticated. Actually, it’s a little sad. You can still do the
really dumb thing and find bugs in programs today. But modern processors
like AFL or libFuzzer are a little bit smarter in
the bit-level mutations and such that they do to generate
these particular inputs. That is they look at what’s
called the coverage. So looking at the layout of choices
you can make in your fuzzing, you could do the Bart Miller thing
that’s completely random here. No user effort. But maybe in
the broad scheme of things, not as good at finding bugs. Then you could also do
something like here in the upper right as done by many tools these days where you have some custom input generator that it tries to be smarter
about the inputs that it sends. So instead of using
completely random garbage that you send to a program
maybe you have a grammar, so that you know that
it accepts XML files. So I should send it only XML inputs or it accepts JSON or
something like that. So I’m going to get better
results rather than having my inputs rejected all the time over and
over and over again. If I adhere to this grammar, then I’ll do a little bit better. Now the most popular version
of fuzzing today, I believe, is this style that’s exemplified by AFL which
is no user effort at all. Same as this except that there’s a little bit more
smarts involved in what the fuzzer is doing and that’s watching how the program
executes to keep track of what branches and edges and control-flow box and
such that it reaches and using that information
to be a little smarter about deciding what the
next input should be. You can also imagine enhanced fuzzing that uses
more sophisticated techniques. So for example as Patrice Godefroid’s pioneered using symbolic execution
in combination with fuzzing. Exemplified by the Driller paper which was a few years
back where you do random fuzzing for awhile and if you notice that you’re starting
to get a little bit stuck, you’re not going to new control
flow paths or what not, you can use symbolic
execution to try to find a new control flow
path and then revert back to the more dumb fuzzing. That also seems to work
well and it also has this nice feature that it requires no additional
user effort to employ it. So the kind of fuzzing
I’m interested in, at least for step one for the
combination with quickcheck, is this AFL style. So let’s look more carefully
about how it works. So here’s the program
that I’m testing. What I’m going to do is I’m going
to have an initial seed file. So in the limit, you can imagine that the seed file is empty.
It’s an empty file. So maybe this program is Acrobat Reader and it’s expecting PDF inputs but
I won’t give it a PDF. I’ll just give it an empty file and then Acrobat Reader will reject that file and as a result that
information, the rejection, will go back to the fuzzer and the fuzzer will then try to
figure out okay how can I learn from that rejection
to try to come up with a better input that gets me closer
and closer to finding a bug. What it tries to do is it
looks at the input and it says did I see any new paths
that I haven’t seen before, any new control-flow edges, any new basic blocks and
if so I’m going to retain that input file as the
basis of a mutation and those mutations are going
to be things like well let’s double the contents of the file and then x or half the file with
the other part of the file. Let’s lop off parts of the file. Let’s flip some bits. Just do very weird interesting things and then we’ll use that
as the new seed and then feed it back in
again and then, what? See how many new paths were created? Were any new paths created? If the answer is no, no
new paths are created, we’ll just throw away
that new mutated seed and we’ll stick with the
one that we already had. There’s a scheduling algorithm that
decides how long to stick with one particular seed and how often to try to generate something
that’s completely new. So it goes around for a long, long time so a fuzzing campaign will be at least a day
if not weeks or longer. Here, at Microsoft, the fuzzing is used regularly for
Microsoft products and basically the fuzzers are
running all the time trying to find wacky inputs to send to a
PowerPoint to get it to crash. This is animating, sorry. What’s QuickCheck?
What’s property testing? So property testing
was popularized by these two gentlemen from
Chalmers for Haskell, and the idea is that
instead of writing, if we step back and we think the tests that I write
to test my program, oftentimes they’re not in and
of themselves interesting rather there are instances of some high-level idea
we have in our heads. So if we write a sorting function, what we might do is
you might say, well, let’s look at one boundary
case, the empty list. Let’s look at a size one list, let’s look at a two element
list where I already have it sorted or I really
don’t have it sorted, and we think through the
way that we write tests to try to cover all of
these different cases. So what the property testing
tool will do is you can specify these high-level
properties logically, and then you can ask the
property testing tool to synthesize inputs that satisfy that to see whether or not
that property is satisfied. So that way, you’re raising the level of abstraction of your testing. You’re still considering
correctness as you’re writing these properties
out but you’re not responsible for generating the instances of the tests
that push those properties. So it’s an attractive midpoint between unit tests in
formal verification. You can specify the
high-level things, but they’re going to be tested
rather than proved correct. So for our study, we were interested in a version
of property testing called QuickCheck that’s for the
Coq development environment. When I heard about this, I thought this is the most brilliant
thing ever which is what do I do when I program in Coq or an F-star
or any of these environments? My goal is to prove stuff correct. The problem is, oftentimes when you write all
these definitions down, they’re not correct, they’re wrong. They don’t satisfy that property, and you only realize
this when you’ve put in a huge amount of
time trying to figure out whether this is true
or not and then you realize you can’t prove it and you have to go back to the drawing board. Maybe that effort is reusable, maybe you can get back to
where you were quickly but it also might be entirely frustrating and you have throw
away a lot of work. So the idea of QuickCheck is the
properties you want to prove, just state them and then
attempt to falsify them by generating random instances that
show that that property is wrong. If you can’t falsify it, then
that’s a good candidate for something that you actually
want to prove is correct. So all of this is
integrated into Coq. This picture here shows that it’s a module in the so-called
Software Foundations’ framework. So you can go online to
Software Foundations and go through this module, and
learn how to use QuickCheck. So this is an example of a
property that you might attempt to prove about your sorting function. I have this sorting
function, I hope it works. What I want to show is that
when I take a particular input, is it the case of that
input is sorted when I pass it to my sorting
function is sorted true? So the way that QuickCheck will
work is it will have a generator for inputs that you’re
universally quantifying over in your property. So in that particular case,
we set four all lists. Is it the case that sorting
that list yields a sorted list? What will happen is the generator
will randomly generate lists, the universally quantified thing, and then it will test that property. Is it the case when I
sort it, is sorted holds. If the answer is yeah, it’s good, it’ll go right back around and
generate a new random input and then will go right back around
and generate a new random input. It’s not doing mutation,
it’s generating inputs from scratch using so-called
generator functions. If it turns out at some point, it generates an input that
when you sort that input, it’s not actually sorted,
then it will say, whoops, there’s a bug in your sorting routine and it reports that bug to the user. Here again. So that’s great. People love property
testing, QuickCheck. It was the Haskell version. It took advantage of Haskell
type classes to make generators easy to use
and easy to write. But now QuickCheck style property
testing is all over the place. They have them for Java
and Scala and OCaml, and pretty much any language that you can think
of. So that’s great. Where does this not work so well? Well, it doesn’t work
so well when you have what we call a sparse property. So if your property has a
precondition that has to be satisfied before I’m going to consider whether the thing I’m trying to prove works, for example my insertion
routine there. Then what will happen
is QuickCheck will generate many possible random inputs, see whether they satisfy the
precondition and if they do not, then it will see whether or not
the post condition is satisfied. The problem is that
if it’s very hard to generate an input that
satisfies the pre-condition, you’ll just waste a
huge amount of time generating tons and tons
of inputs that never get around to doing the part
that you care about which is to see whether your insert
routine actually works. So what’s the solution to that? Well, the solution
to that normally is, you have to write a smart
generator that does a better job of satisfying
that precondition in advance. You have to do something that’s
specific to that property. So Naive Random Testing is
here, basically no user effort. You can use the built-in
type classes and whatnot for the generating
of your inputs, and it’s going to do a
fine job of finding bugs. But if I really want to do a
great job for sparse properties, almost certainly I’m going to have
to do a handwritten generator, and actually it turns
out that writing handwritten generators is not
obvious how to do correctly. There are lots of pitfalls
in getting that right. So hey, that looks like the
chart that we had for fuzzing. Is there a way that we can, up in the upper right
there for fuzzing, we had this so-called
grammar-based fuzzers where the inputs that we generated, say for our PDF reader, lots
of the time the PDF reader is just going to throw
away those inputs because their total garbage. It notices that it’s not
really a PDF file and it throws it away and so we
never get into the guts of the actual Acrobat Reader program to see whether it’s rendering
PDF files correctly. That’s the same problem that
we’re dealing with here. The way that they solved it in
the fuzz testing world was, “Hey, maybe we can add a little
bit of smarts in generating inputs based on our observation
of the program execution.” I wonder, could that
same idea work to solve this sparse precondition
property for property testing? So that’s the key insight here. Can we use the same tricks
that fuzz testing uses for fuzzing normal executables to property testing
sparse properties? Our idea is that instead of using bit-level mutations and
one size fits all stuff, What we’re going to do is we’re
going to use synthesized mutations that are specific to datatypes
inside of your property tester. So we’ll get two examples
of this in a minute. But if you think for example of our sorting routine, we have lists. A mutation for a list is take that list and concatenate
it with itself. Get rid of the first
element of the list, get rid of the last element, change one of the elements of the list. I can mutate from a
particular input that I started with that seem
to gain some coverage, I liked that input and so I’ll
mutate from it rather than the way property testing works
now which is just generate lists from scratch
every time you run. So here’s the complicated
merged picture between fuzzing and property testing as exemplified in QuickCheck
and we call it FuzzChick. So how are things different? So first of all, we’re going
to start off the process by using the generator. So remember in Fuzz
testing what happened was, some person picked a random seed to start the whole process of with, and then all of the
mutations started to happen. Maybe the easiest seemed
to pick as the empty file. For FuzzChick, what we can
do is just run the generator to pick an initial random input
to start the process off, and then that goes into our system under test and our
particular property. Now, if there was a bug, we
report it just as normal. Otherwise, we check, did we see any new paths using the coverage
information that we gathered? If we did see some new paths, this is a good seed to use
as the basis of mutation, and we’ll throw it
into our seed pool. Now we can use our so-called
semantic new cater to look at that seed and
add the list to itself, remove an element change
an element and so on, and then we can go back around the loop once again to
keep on trying this. There are some interesting
options that we have here that you don’t have in regular
fuzzing, which are that, if at some point the semantic mutator gets stuck
and it’s not making any progress, you can always just fall back
to the regular generator. So in some sense, I think
of this as like with classical SAT-solvers when
you do reboots or resets, you’re working your
way down a tree and you’re kind of stuck
in a local minimum, you can always just start over. So that’s not something that normal fuzz tester is normally
have available to them, because they have no way to generate a random input that’s sensible. So let’s look at the
semantic mutator. I’ve already been waving my hands, so now I’m going to
wave them but I’m going to show you with pictures instead. So here’s like a
tree-based data structure or graph-based data structure. I want to consider all of the ways that I can mutate
this data structure. So what might it look like? Well, one thing I
could do is I can just modify each of the elements
at the particular nodes. So over there I have zero
instead of one, for the root, I have a five rather than
a four at this location, so that’s one kind of mutation. Another kind of mutation is I
could ditch one of the nodes. So I can kill this one,
I could kill this one, I could split pull out
an individual node and just use it on its own. Another thing I could do is I
could synthesize a new node, and I could add this tree as
a subtree to that new node, or I could break up parts of the tree and synthesize a new
node to add those two parts too. So we see that all of
these examples over here. So we have a paper
that’s on my web page that goes through formally how do you come up
with these invariants. So basically, the way
to look at it is, if you have a datatype definition like you might have in cork for
the all the possible variants, there’s a way that
you can work through. There’s a definition
in the paper of all of the single-step y’s
variance for that datatype, and our mutator just randomly
picks one and uses that.>>I have a question.>>Yeah.>>Is that those sense
of mutations only for zero order types
like no function inside?>>That’s correct. Yeah, it’s
data types no functions inside.>>Okay.>>Yeah. Well, that’s
a good question. We certainly have not tested with
things with functions inside. I wonder if what we would do
is just treat the function as a black abstract and then
just shuffle it around. But we have not tried that. So that’s the idea. Let’s see how well it works. Well, so not quite a year ago, I was a co-author on a paper that criticized fuzz testing evaluation, so I heard my own advice. So we were curious about how to
evaluate this in a systematic way. So what we did was, we took a system that
we knew was correct, that we had developed in cork, we already proved it correct, and then we’ve tried
to systematically mutate elements of the
system to introduce bugs and then see how effective ours and competing strategies
were at finding those bugs. That system that we approved correct
was, they were two actually; abstract machines whose goal, the abstract machines goal is
to enforce non-interference. So those of you who might remember
the DARPA crash safe program. The goal of this
program was to develop secure processor where a feature of the processor was to
have support that would allow you to prove
non-interference style properties, security properties about code
that would run on those machines. So what Benjamin and Leo had been doing over the
last several years is designing these machines and
then coding them up in cork, and proving that
indeed they do satisfy the noninterference property
that you want them to satisfy. So we have these two machines
at our disposal that we could use as the basis of an evaluation. One of the machines was a very simple stack-based
machine and the other was a more sophisticated
register-based machine that look much closer
to real hardware, and we evaluate it on
both of these machines. So that’s the manual creation
of these buggy variance. I guess the way that we tried to
be systematic about it is that, essentially, in the machines, there are a bunch of cases when you evaluate an instruction that their side conditions that are there
for evaluating security, and those side conditions you
could put in a big table. Basically, what we did was we mutated those side conditions to introduce bugs and then we can
see whether or not the fuzz testers would find the bugs. So we compared our stuff
to the purely random case, that is the normal QuickCheck, and we also used
handcrafted generators. One of these generators
took an entire year to write and was the subject
of its own research paper. So that’s what we assume
is the best you can do. This is the status quo and then this is what we’re trying to
make an improvement. So this is showing you the
more sophisticated machine, just to give you an idea
of what it looks like. So it’s going to have a register file and it’s going to have a heap. We’re going to be able to
annotate data inside of our abstract machine with
whether it’s public, and so the adversary is allowed to see it versus whether it’s secret, and so therefore we’re trying to keep that information from the observer. The way that we can state
that property is using this ubiquitous
non-interference property, and the idea is to say that
any values that are labeled as secret as far as the adversary can
tell from observing execution, they should not be able to tell the difference
between whether a secret, for example, is actually one
or whether it’s actually zero. So if I have two states
that are low equivalent, that’s what the little
tilde is saying, then those states agree
on the public stuff, because that’s the stuff
the adversary can see, but they may disagree on the secret stuff because the
adversary can’t see that anyway. What we want to show is we want
to show that low equivalence is preserved by execution. Therefore, the adversary is basically never able
to tell the difference between system that has one is it secret and a state that
has zero as a secret. That property is written like this. It’s saying that if I have two machines and those machines
are indistinguishable, so they agree on everything
other than the secrets, then if I take a step
in those two machines, so they evaluate an instruction, then the result is also
indistinguishable. So then this is a very strong
non-interference property because it’s considering
each individual step of that abstract machine. Then we’re going to compare
these three things; we’re going to try to fuzz
this property to see to run a bunch of tests after
we seed with the bugs, and we run this particular property, will we find the bugs? Yeah.>>So I have a question about this, it seems like this
particular definition, I mean it’s as
super-strong definition, but it also seems maybe
super helpful for fuzzing.>>Oh, really?>>Well, just in the sense
that you’re going step-by-step rather than doing end-to-end
not interference, because you could have
a gigantic state-space that you have to explore to be able to put it off at the
end whereas here you’re just focused on one
small step at a time. I’m just asking whether
this is a standard approach or if this was maybe biased
towards making it easier to prove?>>That’s a good question. So I think that this property
is the property you want. Well, this is not a
nonstandard property, I guess in the PL literature, it might be that that’s
the reason they chose it. I’m struggling to come up
with a good explanation for why it wouldn’t be okay
to do the other thing. I think it would take longer, because then step would
take a lot longer to run. I guess the question is, if you had that longer multi-step
big step function, would you find more bugs somehow? I don’t know why you would. I don’t know, I have
to think about it. It’s an interesting question.>>From whatever the main challenge in trying to force
the problem like this is to find better inputs that
validate the precondition?>>This is true, I’ve lead you down the primrose path so that you’re
thinking that way, well done.>>Right.>>So the problem is that
if I can satisfy those.>>Yeah.>>So let’s imagine
that I can’t satisfy, let’s say that that part is trivial.>>Yeah.>>The fact that step just
takes a single step means that running a single test
is going to be very fast.>>Yeah.>>Now, let’s imagine that
step takes two minutes, I’m not going to be able
to run that many tests. So even if I’m really fast at
getting past that precondition, fuzzing might be a terrible
way of finding bugs.>>Yeah.>>So then the question is, so I’m looking ahead to the
next question which is, well, are we losing anything by
doing the single-step property? I think that single set property is a pretty standard way of defining an operational semantics
for these machines. So I don’t think that this
is cheating in any way, but I’m thought about it. Okay. Those are the results. So these are the 33 variants of
the machines that we fuzzed. Sorry, gosh. The target machine
where we added these mutations, each one of these things
is a distinct change to the machine that introduces
a bug, it’s a distinct bug. So the Y axis is the time spent, the mean time to failure
for finding the bug. So what we did basically is we ran
each. I didn’t write this down. This is the average of 30
runs for some period of time, I think somewhere around 10
hours or something like that. I guess it’s not 10 hours, it’s
10,000 seconds at the very top, and we did the average of 30 runs, and we did mean time to failure, the average mean time to failure. So it’s an average of an average. Mean time to failure is
just what is fuzzing doing. It’s generating a random
input, if that didn’t work, it generates another random
input over and over again, failed, I found the bug.
How long did that take? When I use the
handwritten generators, the ones that took a
whole year to write and got an ICMP paper out of it, you can see the blue lines very fast. This is a log scale on this side. It’s taking less than a second
to discover these bugs. On the other hand, the naive case, that’s the green bars, basically it’s never
finding the bugs before timing out in every
case but two cases. Whereas the coverage guided case, the performance of it is crappy
compared to the handwritten case, but actually it’s perfectly
reasonable in terms of, do I actually want to
be able to find bugs? It’s somewhere between
10 and a 100 seconds in most cases and then it is
taking longer in some cases. They’re in the upper
right and it is the case that some bugs are
never actually found. So we think that’s good. If you look at the difference between the bug finding ability
and how long it takes, and also in terms of the
usability of the system, it’s no less usable than the
completely random testing. I don’t have to write
any generators by hand, I do all the same stuff that
was there before and yet, I find the bugs a lot faster. Okay. So somewhere on one of the
slides which I didn’t highlight, I said that the generator
was almost automatic. So actually, it isn’t
zero user effort. For this particular property, there was one key insight to
making this work which is that generating two
indistinguishable states, what’s the best way to do that? The best way to do that is well, there’s actually several
ways that you could do that. In this original generator here, how do I start the process off by generating two
indistinguishable states? So I can only start mutating once I have a seed that will possibly work. What’s going to get me
to that seed better? So I could do two things. I could generate two random states and I could ask
whether they’re equal, or I could generate one random
state, make a copy of it, and mutate the other one, and then ask whether
or not the two things are equal or indistinguishable. This is way better. This, you’re
not going to waste all kinds of time generating two random states
not finding the right thing, and this was an insight that was in the handwritten generators paper. So basically for the FuzzChick case, what we do is we do write a handwritten generator
that just works like this. That just says take one state, and then make a copy
of it, and that’s it. Take those two states,
they’re literally the same thing and then
mutate that instead, that’s what starts off the process, and then FuzzChick
takes it from there. So if you wanted to have
the completely naive case, FuzzChick would also not do as well. But I should say in
the other direction, even if you used this particular case which we did do for the
purely random case, you’re still screwed because
there is no notion of mutation in the purely
random testing case. So you can’t use this approach, you’re forced to use this approach. Okay. So I want to point out
that there’s another way that you could combine coverage guided fuzzing
and property testing and that’s exemplified by some
work that’s called crowbar. Actually, there’s another paper
that just came out in ISDA which I’ll talk about in a second called Zest that does something similar. The idea here is that instead
of doing semantic mutation, what you do is you focus your
mutation on the source of randomness into your random
generator in the first place. So here, I have a normal
property tester and I have my generator and my
system under test and so on, and that’s the same thing as before. Normally, what happens
is the random API that you use for normal random
testing is actually random. You like sample from Dev
random or something like that, so you get some random
bits every time that are the basis of you
generating a random input. However, you don’t
actually have to do that, you could actually pull
away the real random, the backend of the
random API and instead, you could put a file there with
a whole bunch of random bits, and then you could mutate
the file so that now, your source of randomness
is this file that it’s bits that you draw coin flips
from continuously change. So that’s what’s
happening in crowbar. It’s looking to see whether
it gotten new paths, and then the seed is
this file of randomness, and it will mutate that seed, feed it back into the random API
and then just continue going. So this is a way that you can much more easily retrofit coverage guided fuzzing into an
existing property tests. You don’t have to implement this semantic mutation idea
that I was showing you before, where I look at actual trees
and I mutate the trees. Instead, I’m just doing bit flips and everything else that
AFL normally does, but now on this random file. Okay. So in our paper,
I’m not showing it here, this doesn’t work as well.
I think I have a slide. Right. I guess this
doesn’t work as well. Why doesn’t it work as well? I mean, I think,
maybe you can imagine the inside of why it
might not work as well. The problem is that when you’re
doing semantic mutation, you have some sense of the structure
of the data you’re mutating, and you can mutate it in a sort
of semantically meaningful way. I can take one tree and
turn it into two trees, and why might that matter? Well, probably the one tree was instrumental in causing a
certain amount of coverage. So a tree-level mutation is going to keep the coverage that
I had before in some sense, but then make some variation to it. Whereas if I mutate the
source of randomness, I could get a completely
different data structure the next time I run. So it’s unclear that you’re retaining as much prior information as you
might like to when you mutate the source of randomness because
you’re just going to end up generating potentially
very different inputs. So there’s actually a paper
that just appeared to this Zest paper that uses
the crowbar approach, and they find that it
does work better than purely random testing when we
purely random property testing. When we compare crowbar against our system in
this work and FuzzChick, we find that it’s not very
much better at least for these two bench mark cases
than random testing. Our hypothesis is the
reason I just gave you. Well, another thing that’s nice
about Zest in addition to it using the mutation of the
randomness approach is that it can tell
the difference between discarded and non discarded runs. So normal fuzzers, you
can’t tell the difference when Acrobat Reader
rejects your input is being ill formed from it is unhappy with your
perfectly well-formed input for some other reason or that it actually succeeds
and just returns normally. There’s no way to tell the difference between
those two out of the box. AFL basically says, if it
terminates normally, that’s fine. If it doesn’t terminate
normally and it’s a crash, that’s cool, that’s a bug. But all the terminate normally cases, whether it’s for rejecting the input or properly
processing the input, it can’t tell the difference. So therefore, there’s no notion of a discarded run that is a run that didn’t satisfy the precondition. So what Zest does is Zest is able to instrument the target
programs to know once you’ve gotten past the semantic
checking phase and you’ve noticed that the input
is well-formed, awesome. Now if that program terminates, I know it’s sort of
an acceptable run as opposed to a discarded run. What they find is that knowing the difference between
discarded runs and acceptable runs does make a huge difference in terms
of overall performance. So our approach does
does this as well. Okay. So here we are, here we had naive random testing, there we have the
handwritten generator and here we have FuzzChick
which is way better, but not exactly the same here at least in our
particular case there was a tiny bit of writing that you had to do for creating
your generator, and it is important for performance to be able
to do that sometimes. So of course now that
we’ve done step one, you can immediately
imagine let’s bring in all of the advances from buzzing. Let’s look at symbolic
execution base cases, let’s look at ones that involve
a grammar or write all of these advances that improve on AFL, we could potentially bring
into property testing as well. So who knows what
that might look like. That’ll be something
that we work on next. We’re also thinking about other
ways to generate mutators. I didn’t really talk at all about the complicated loop and deciding
when should I do generation, when should I do mutation? Which mutations should I do? How long should I do
those two things? When should I discard the seeds? There’s lots of policies that
you can imagine exploring that we haven’t really done yet
that might make a big difference. Same thing, there’s a
bunch of insights from the regular fuzzing community
that can make sense there. Okay. So in summary, coverage guided property-based
testing is a combination, pretty straightforward combination of random testing and
coverage guided fuzzing. The key insight is the notion
of a semantic mutation. Though even more naive combination
is the crowbar approach and our improvement to that is to realize that semantic mutations
are likely to do better. We implemented this in FuzzChick
and showed that it works well. Actually, one thing I should
have mentioned that we’re doing now is we have a whole corpus of other examples that we’re working
on to further evaluate this, basically to test those different
scheduling strategies and so on. So for example we have a
development of system F, and we’re seeding your
system F type-checker, and operational semantics
with bugs, and seeing, can we find those bugs when
trying to falsify type soundness? I think that is all I have. Yeah. So thank you.>>Professor, the coverage part is used for me to
decide whether you want to add this generative input
into the seed pool or not. But other than that, if a new coverage part is discovered, that coverage
information is not being used in the mutation itself?>>No. That’s correct. It’s just whether to keep it in the seed pool.>>Yes.>>Yeah.>>That could direct the mutation.>>So at the moment the coverage information is really simple in order
for it to be fast. All that it’s doing, the way that AFL works, is it instruments the program, so we’re using the
AFL style coverage. The way it instruments
the program so that there is a global variable that is a dynamically determined
indicator of an edge in the control flow graph and basically every time you
reach a basic block, there’s a seeded number
for that basic block and the edge is the XOR
of those two things, and then that’s thrown into a Hash Table to keep track of
how often that edge is covered. So the upshot of this
is we have no idea what that edge is actually doing
inside of the program and so knowing that there is that
new edge doesn’t tell us what about the input
precipitated that new edge, what that part of the
program is doing. So I think we would have to
figure that out if we were to use the fact that it was that edge
to decide what mutation to add. But that’s an interesting idea. Maybe changing the way coverage
works to be a little bit slower, but maybe a little smarter communicating more
information could help.>>That sounds like
something like Sage, right?>>It does actually.
Yeah. That’s right. That’s basically what a
symbolic executor would do.>>Oh. Yeah. Yeah. Great.>>Okay. So there’s a lot of work on sampling out of constraint spaces. They are trying to do uniform sampling out of some Boolean
formula, or something like that. Is any of that useful? Is it just too slow to
apply in testing framework?>>So you’re thinking that
precondition on the left-hand side. That’s a state-space. Can I
sample out of that state?>>That’s right. There are
techniques you could use to sample out of that
space uniformly.>>I don’t know. That’s
an interesting idea.>>Well, so these
programs that you are testing are effectively taking a
single input producing an output. So has this kind of
technique been used in anything for any kind
of reactive program? Something where you put
in an input and you get an output and you have to
respond with another input. Can you imagine using a
mutation-based technique for that?>>So you can make your properties. There’s still finite but you can make your properties work that way. This is kind of to Emory’s point that you just then have a much longer, you either have to break up the
property so that you’re looking at some step-wise invariant that you’re trying to test against and so then
you have multiple properties. Maybe one considering the first step, one considering the second step, or you have to actually run multiple steps which then
slows down your testing time. So my suspicion is that most
of the random testing work is for as quick as possible properties that might
be for a reactive system. But they’re of unit testing
elements of that system that rather than looking at
the actual end to end performance of that thing and
I think that’s just because the testing works better when the
tests run as fast as possible.>>I guess what I’m imagining
is you get an output and you have to produce
responsive input, right? Precondition will depend
on the previous outlet. So now if I had my previous test, it might be the mutation of that
previous second input is not useful because the new output that
I got is completely different.>>In some ways the way that I’m
thinking of it is that you’re saying that it’s a stateful system and then the output implies that the next input is going to be processed
into that new state. So I could translate that
into a single property by including in my precondition
the system is in this state, here’s the input, what happens. I’ve another property the
system is in this state, here’s the input, what happens
and I break it up that way. I imagine what happens is, that’s what property testing, people who use property
testing are doing. Is they’re thinking about
those different states, they’re adding them
to the pre-condition. So I could take an idea from
your suggestion which is, “Hey, could I have even a
higher level property which is about the system as
a whole and could I break it into parts that I can test individually in order
to make it quick?”>>Is there another
question with the Emory?>>Yes. So I have a
random meta question. So this is unique since we
are talking random all-time. So I know this isn’t in the context
of the approvers and so on, but it seems like there could be some guidance in a sense from the way fast testing is used to how
programmers write code. So have you thought about that? Like here’s a methodology for
writing code or producing APIs or something that would make
them more readable to fuzzy.>>That’s interesting.
I’m a little bit reminded of I don’t remember the
name where this paper came from. So one problem with building resilient systems is that when
you have lots of components that have sparse preconditions
and you think that this invariant is being satisfied and some component of the
system is not doing it. So locally those components
are not resilient. I assume my inputs
are always non null. I assume that it’s this number that’s within this range and I just take on faith that that’s true and
if I get something that’s not that way then the
whole thing falls apart. So you could imagine writing systems that were
not that way that you did redundant null checks because
you didn’t trust yourself. You didn’t fully formally proof
the system to see whether or not your inputs do satisfy the precondition you had
otherwise been assuming. Certainly if you did that, that would make fast testing
easier because it would make it so that more inputs would be acceptable to those
functions in the program. But now that I say that, they’re
not going to be interesting because then they’re just
going to be discarded right away once you
enter the function. Yeah. I don’t know. I don’t have any good ideas
other than that bad idea.>>So you mentioned that you
are working now on doing the same for calculus simplify
something to falsify types of ID. Are you hopeful this
would be a great thing? Are you hopeful that you’ll be able to satisfy these
perhaps stronger preconditions, or I guess to rephrase the question, how does the precondition
that you had for this thing compares
in terms of difficulty, or whatever what, say, this programs will take?>>So it depends. So we did a case study with the
simply type lambda calculus and we found this is a good thing but it’s a little
disappointing for us that totally random testing was fine. That for simply type
lambda calculus it was actually easy to
generate terms that were well-typed right from
the beginning and you could very quickly find the bugs. It wouldn’t be
nanoseconds but it would be eight seconds or something
like that, it would find the bug. Now, for system F that’s not true. It’s much harder to
generate stuff that’s a well type system F term
that will catch like Wacko, De Bruijn indices errors, or things like that that
you could very easily make. So in that case the
random tester is garbage. We’re still in the
process of evaluating whether our technique
is better than garbage. It seems like it is. Could
we do more than system F? Well, if we did the whatever the name of that calculus is
that you work on for Scala.>>The Vector Calculus?>>Yes exactly. I don’t know. I think that’s where we’re headed. I would like for that to
be the case and those are certainly going to be case
studies that we use going forward.>>Yeah.>>In the 33 for example statement shows us those
were bugs that you had generated. So have you done any analysis of whether those bugs were things that you were likely
to actually encounter if I will.>>Right. So Leo, the first author, wrote those systems
and did those proofs, participated a lot in the
proofs and could attest that some of those bugs were
bugs that were in the system. So they did this paper in 2013. That’s where the hand
written generator came from. They started doing this testing
stuff then because they realized, “Gosh, we’re introducing these bugs. We keep trying to do the
proofs, they’re not working. I wonder if we could do
testing to find these bugs.” So many of those bugs in that list were actual
bugs that have happened before and then we were trying to be systematic on how to
produce those bugs. I don’t know if somebody
else doing the proof might have produced one
of the other bugs or not. It’s a single-person case study. We did a little bit of
work looking at student written interpreters for my programming languages
class where they have bugs, but unfortunately we had
the same problem with simply typed lambda calculus that this interpreter was not
sophisticated enough. That coverage guide us and provided any advantage that normal
random testing just was. It could satisfy this
preconditions well enough that even with
real student bugs, normal random testing
would just find them. So ideally what you would do is you would be
able to test on things that were interesting
in the real world like these real bigger programming
languages you’re trying to proof stuff about and then you
could test them on real bugs, and hopefully we’ll get to that. I just don’t know how many
examples there might be.>>I’m wondering if this could
be applied to QuickCheck itself rather than on Coq. It’s a bigger community of
people doing QuickCheck.>>So for sure the reason that we did it in
QuickCheck was purely pragmatic and that is there is a OCaml compiler extension that adds the AFL style coverage
and we could therefore do this comparison against
Crowbar directly and we could also extract the Coq code to OCaml and then instrument
that extracted code. If we did it for QuickCheck, we would need an instrumentor
for Haskell programs to add all this coverage information
and that might also be, I mean maybe that
would be cool, but it would be a little
interesting given that Haskell is lazy as well with that. With that would mean, maybe it
doesn’t make any difference. So it just isn’t one of
these instrumentors. So it was a pragmatic choice.>>Let’s thank Mike
again. Is there one?>>I just want to
make one comment that I worked with online
QuickCheck primarily.>>Oh. Okay.>>I believe [inaudible]
has it built in. You can do it distributed and non-distributed and you can get
coverage while you’re running the tests and we
actually use this to run QuickCheck in a distributed mode
and then aggregate the coverage. So you would have
that facility there.>>Great. Okay. Well,
then we should try that.>>Yeah. So that might be
another alternative to Haskell.>>Send me an e-mail.>>I will.>>Great. Thank you.>>Great. I’m glad I waited
for that. All right.>>Let’s thank Mike again.>>All right.

Leave a Reply

Your email address will not be published. Required fields are marked *