One-Button Change

Preface

We all want to code quickly. If someone gave us the code we wanted right now, we would gladly accept it. We also want to design things that are easy to understand, modify, and reuse. That includes refactoring existing code so that it is all of the above. Similarly, we want to understand unfamiliar code bases of the type we find when starting out on a new part of some project so that we can get to work. In the process, we also want to test code so that it is more robust and debug any errors we get as efficiently as possible.

What causes these things?

How to Code Quickly

Summary

Make one encapsulated change at a time to a working product to get one additional property, setting input and getting output quickly, perhaps duplicating to avoid testing for regression and deducing the output instead of running tests every so often.

Make one small change at a time so you don’t waste time debugging, i.e., narrowing the causes of the buggy output. Don’t change stuff deep down in the code. Do it only at the top-level - at the level of the domain language. For example, switch between two scheduling algorithms using one high-level change instead of three changes scattered through the code. Similarly, don’t add a feature and refactor it into an abstract, beautiful design at the same time. Of course, you’d write the cleanest design you are capable of generating without any errors. For example, I can write a foldr call in Haskell without breaking a sweat whereas a novice might want to start with an explicit recursive function and then abstract it into a foldr.

Have a working implementation at all times because you can say useful things only when you have a positive exemplar. Specifically, start with a Hello World program that does the bare minimum thing from end to end. And settle on your high-level design before you have a lot of low-level code and, if you can’t avoid that, start with a blank slate and migrate the old code so that you always have something working.

Implement and test a new function in isolation so that you can change the code, set input, and get output quickly. Use first-class values so that you can set the input and test the output easily. Have a pipeline that automatically creates the final output from your code so that you can quickly see the effect of each change on the output instead of having to run a dozen steps before you see anything. Similarly, work in the final medium of a PDF or a publishable essay or a report that your machine learning project is supposed to generate. Experiment on a tiny version of your project so that you can alter features cheaply (this is about speed of changing, whereas the earlier suggestion to change the high-level design in isolation was about knowing what causes bugs.). Otherwise, you would have to change a lot of the low-level bits even though you get the same high-level feature change. Similarly, if it takes a long time to compile a LaTeX file, have a separate, small file for experiments.

The number of properties you want to satisfy determine how long it will take to implement. When you want to write a new function and you’re struggling or out of time, add one test for one property you need to satisfy (or how many ever properties you can comfortably implement right away) and make it pass. Don’t wait for the day when you will perfectly satisfy all properties on the first try. Minimize the number of callers so that you are expected to satisfy fewer relations.

If you find it hard to refactor a function before adding your new feature, duplicate it, write your test, pass it, and then refactor. When you duplicate, you’re guaranteed not to regress (which is what takes a long time to ensure), but if you don’t refactor soon, the code may diverge. This may not be the best way. I need empirical evidence.

Predict the behaviour of some code by deduction, not by blank-slate empiricism. You know the language well enough to tell what a sequence of familiar statements will do.

TODO: Learn powerful abstractions so that you can satisfy a bunch of properties in one step instead of having to walk through multiple iterations. Also, powerful abstractions mean less code.

One, Small, Encapsulated Change at a Time

Coarse-grained Tests can cause Costly Debugging?

Hypothesis: Debugging seems to take up a lot of time.

The effect you observed was that you spent a lot of time on variations that still failed your test. But what was the cause? What did you observe in advance that could have predicted those failed attempts?

Hypothesis: If it’s a test you’ve never passed before, that’s implementing a new feature. If it’s a test you’ve passed before but which fails now, that’s debugging.

When implementing a new feature, a test that needs multiple spec features will cause more failed attempts than individual tests for those spec features.

When debugging a test that newly fails, having more changes since the last known-good version will cause more failed attempts.

So, we should have finer-grained tests when implementing a new feature and finer-grained changes when trying to debug a failing test.

Of course, there’s also your knowledge of how to implement the spec features.


Test: Suppose you’re writing an arithmetic expression calculator. One test could be that “1 + 2” should return 3. Another could be that “1-23+4/56” should return -5. If your calculator implementation fails the first test, you know it’s because you’ve wrongly implemented addition. But if it fails the second test, you don’t know if it’s because you’ve wrongly implemented addition, subtraction, multiplication, division, or operator precedence. You may have to check all of their implementations and try several variations till you pass the test.

Test: Say you’ve passed the test for “1-23+4/56”. When implementing general operator precedence, if you change the code for multiplication and the test fails, you know that the latest change was at fault. But if you change many things and then find that the test fails, you won’t know which changes to blame. You’ll have to try toggling many of them to pass the test again.

What if you have a smaller test like “1 + 2”? Can’t you ignore the changes outside the addition code? I guess you can if you know how to map the spec to the code. If you know that all the addition code is a particular file and that there has been only one change there, that must be the culprit. So, it’s a question of the number of changes to the part of the code implementing that spec.

Observation: Debugging - trying variations that still fail your test - seems to be pretty hard and time-consuming. Lot more confusing and frustrating than normal coding, which is probably trying a variation that passes your test. Notice that I don’t have an advance prediction of when I’ll have to debug and when I’ll have to code “normally”.

17 out of the 30 hours I spent on OS HW2 were lost in debugging.

Test: [2019-03-02 Sat] One of my worst debugging sessions in the last year (but still not as bad as my debugging sessions with XINU before I started writing unit tests for my OS code).

Turns out that you were supposed to load and store tagged values as such (duh! That’s their whole point) but I didn’t pick up on that and got all kinds of errors.

I’ll note that testLibLists11 was an end-to-end test, not a unit test. It involved all kinds of irrelevant and unfamiliar library code. Didn’t know which features I’d implemented wrongly - block-set, block-get, literals, or something else.

First session:

testLibLists11 and a bunch of other lib lists tests failed

5m - the problem seems to be what I’d noticed earlier - BlockV in the interpreter seems to take three arguments when I give it only two and it is, in fact, defined to take only two 20m - caught up on Piazza - realized that BlockV is defined twice in the interpreter 22m - decide to narrow the test case 40m - restart 56m - lots of fails because I couldn’t print the CPS tree 57m - realized that a list is a bunch of conses 1h - need another hour, I think

CLOCK: [2019-02-26 Tue 17:29]–[2019-02-26 Tue 17:52] => 0:23 CLOCK: [2019-02-23 Sat 16:50]–[2019-02-23 Sat 17:27] => 0:37 :PROPERTIES: :Effort: 3:00 :END: [error] Test miniscala.test.CPSValueRepresentation_Blackbox.testLibLists11 failed: scala.MatchError: (block-tag,List(IntV(57)))

Second session:

debug!

2m - realize that it is a null-terminated linked list - probably fail because my block-alloc is not returning the right type of value - fail twice when bringing in the next line because of putChar vs putchar and trailing semicolon 5m - l.tail.head works - probably fail because of tail.tail - but why? 6m - again the semicolon shit 6m - actually CPSHigh interpretation worked and the CPS code looked correct, so it probably failed because of the CPS low code 9m - realize that I need to extract this into a general function for later use 10m - start with a positive exemplar because I’m smart like that [Ed: Doing this after 1h10m of hair-pulling - yeah, real smart.] 12m - haven’t tested the return value of block-set 13m - fail because block pointer needs to be tagged with 00, which I saw from the unit test I wrote 21m - coded block allocate and fixed multiple bugs in the same iteration 22m - fail because I still missed a valp 28m - code block-tag but fail because of v\[3 instead of v\]5 29m - start block-length 30m - write failing test 32m - done copy and pasting 33m - I had known this would be a problem - “TODO(pradeep): Confirm whether you need to untag and tag here.” 35m - failing test done 37m - test passed without a hitch 39m - failing test done 42m - failed because I didn’t replace blockName with blockPointerUntagged and because I used v\[1 instead of v\]3 - both cases of not fully propagating changes 43m - different fail in the list black-box test because block-set is getting pointer as 0 - maybe I should convert the block-alloc return value to block-pointer not integer 49m - block-tag seems to take in a tagged pointer 50m - as do the others - why is it failing?! 52m - start analyzing the CPS low code 57m - block-set of O itself is getting 0 as a pointer - fail because of that and because I didn’t expect to be debugging an hour later

CLOCK: [2019-02-26 Tue 18:44]–[2019-02-26 Tue 18:55] => 0:11 CLOCK: [2019-02-26 Tue 17:52]–[2019-02-26 Tue 18:39] => 0:47 :PROPERTIES: :Effort: 1:00 :END:

valp v_11 = block-set(l_1, i_13, v_12); valp x_3 = id(v_11); valp v_8 = block-set(l_4, i_9, l_1); valp x_6 = id(v_8); valp v_5 = block-set(l_7, i_5, l_4);

Third session:

debug

5m - block-set is passing in an untagged block pointer which is somehow 0 8m - long cycle time - if I want to find out what block-set returns, I can just test it out in isolation 13m - realize that I don’t have a CPS parser - have to decompile my CPS low code to MiniScala code 16m - fail because I can’t deal with block-pointers directly in MiniScala 17m - try writing the CPS low tree by hand 22m - a simple bit of documentation could have saved me all this work 23m - fail because the interpreter can’t see size - fair enough since its value isn’t defined 29m - ok, the slides for block-alloc and block-tag don’t untag the block pointer 30m - damn, did I make a bunch of changes without external feedback? 41m - ah, block-get can sometimes return an int and sometimes a block pointer - that’s why l.head worked but not l.tail.head - so maybe block-get shouldn’t tag its output as an integer 49m - confused by the extra block-alloc-2 in the low CPS code 50m - ok, it’s just the variable declared 52m - rather, it’s Nil 54m - I think I get it - failed because the value passed to block-set is untagged like an integer even though it’s a pointer 57m - low interpreter printed apostrophe (39) instead of O (79) 1h - even char is not being set properly - it’s being treated as an integer 1h03m - I guess we save the tagged value itself! 1h15m - all black-box tests pass apart from those that use composition, which needs closures good estimation - I really didn’t expect it to take even an hour - took 1.5 hours failed repeatedly because I wasn’t clear whether we saved tagged or untagged values took unnecessarily long because I was coding against the broad feedback of a test involving an unfamiliar library - I should have translated that into a simple failing test using only primitives

CLOCK: [2019-02-26 Tue 18:55]–[2019-02-26 Tue 20:19] => 1:24 :PROPERTIES: :Effort: 2:00 :END:

Test: [2019-03-02 Sat] Another contender for worst debugging session of the last year - when I wrote a great unit test for my implementation of Naive Bayes but wimped out at the last minute by just copying whatever model accuracy I got on the test input and using that as the expected output. Turns out that that killed me and cost over four hours of debugging over two days to recover from. I changed all kinds of factors like the input file, the number of columns, the number of rows, and so on. I measured correlations between individual columns and the output variable. Nothing pinpointed the error. Realized at last that my likelihood calculation was wrong.

Test: [2019-03-02 Sat] Changed a bunch of things in my Emacs configuration on my lab computer and pushed the code. Will it work right on my laptop? Don’t know.

Encapsulate your Change

Hypothesis: One key factor in programming efficiently - aka processing evidence about our program - is to make one high-level change at a time before you get evidence.

This means that you should be able to choose between your two design choices, say process scheduler A and process scheduler B with just one small change. No changing variable assignments in three different places, no hasty commenting of 15 lines of code, no tweaking of multiple configuration files - just one button press.

Basically, you’re giving two different inputs to the function, analogous to two different input strings. Now, you can either switch between them by changing lots of words in that string manually, or you can press a button to swap the entire first input with the second input (while not wasting time on copying the common parts). In other words, you get fast state transition.

(Why would this weird condition help write programs faster? We’ll see.)

How will this affect the way you code?

For one, you can’t change multiple lines of code within some other function. Because then, if you wanted to roll back those changes, you would have to edit multiple lines.

Corollary: Any change you make must be encapsulated within a new function before you run the program.

Now, what if you wanted to switch from scheduler A to scheduler B? Sure, A is within a function that you can comment out (or in a class method that you can change at runtime via polymorphism). But B could be a bunch of lines of code strewn across your project. So, to switch to B, you would have to uncomment all those lines. Not allowed.

Hypothesis: Adding a new feature in an encapsulated way gives you a high-level variable that you can debug more efficiently. If the code fails after adding it, you can try toggling it quickly. If the code now passes, then at least one culprit is within the new code. You have narrowed down the suspects to the new code. If the code fails even without the new code, then you don’t have to suspect it unduly.

I guess the difference is in the speed with which you can eliminate factors. In the encapsulated case, it takes one step to eliminate or narrow it down to the newly-added code. In the shotgun case, it takes several steps to toggle all the newly-added lines of code and eliminate or narrow it down to them.

Conservative-focusing gives you guaranteed information about a variable. A high-level variable allows you to get guaranteed information about a bunch of variables. Your alternatives are to (a) toggle a bunch of variables in one step using a high-level variable, (b) toggle a bunch of variables manually in several steps, or (c) toggle one variable manually in one step.

I suspect that the option (b) of taking several steps to get one piece of feedback will not happen too often. Options (a) and (c) should be fine because every step gets you feedback. It’s just that (c) eliminates factors too slowly and will thus require a lot more iterations before hitting the right answer.


Test: [2019-03-02 Sat] I can generate with a simple command two versions of a midterm exam that I designed: one with just the questions and one with the questions and answers. All I have to say is make midterm-exam.pdf and make midterm-exam-solutions.pdf to get the two versions. I don’t need to go in and comment out the answer each time I want to get just the questions. I just wrap the answers in a macro and disable the printing of stuff within that macro when I do make midterm-exam.pdf.

Test: [2019-03-02 Sat] I could turn Laplacian smoothing on or off in my Naive Bayes implementation by changing its flag. I didn’t have to go to each place that differed based on whether smoothing was on or off (including the test code).

Switch between Versions using a Button that Applies the Diff

Question: How long does it take you to switch from one version to another?

Hypothesis: If you arrange a button that applies certain diffs, you can switch to a different input with less effort than if you apply the diffs manually.

Keeping two separate versions in their entirety will also let you switch with one button, but it will require double the changes whenever you change your design and will cause bugs when you update one but not the other.

One factor is the number of times the differing piece of code is used; another is whether you know precisely where it is used; yet another factor is the effort required to apply a diff as opposed to a one-button change.

Applying a diff using version control history takes more work in finding the right diff than a one-button software change.

Hypothesis: Inheritance allows you to apply a diff of new methods and instance variables by changing from the parent class to the child class.

Polymorphism allows you to apply a diff of a class of one type to another class of the same type without changing the receiving variable’s type.


Test: One concrete example is that of a function abstraction:

squarePlusCube x = x * x + x * x * x
print $ squarePlusCube 7

Imagine if you didn’t have function abstraction:

print $ 7 * 7 + 7 * 7 * 7

Now, if you wanted to change 7 to 8, you’d have to go and change all five occurrences, which you can do with a search-and-replace here, but not when the variable is distributed over a large library. Contrast that to changing squarePlusCube 7 to squarePlusCube 8. That’s one-fifth the effort.

Test: Another example is when the variable is not duplicated but is buried deep in a function:

def f(W, N):
	for i in xrange(0, N):
		for w in xrange(W/2, wt_list[i] - 1, -1):
			dp_arr[w] = min(dp_arr[w - wt_list[i]], dp_arr[w])
f(sum(wt_list), len(wt_list))

If you wanted W to be the product of wt_list, you could change it right in the function call. Contrast that to:

def f(N):
	for i in xrange(0, N):
		for w in xrange(sum(wt_list)/2, wt_list[i] - 1, -1):
			dp_arr[w] = min(dp_arr[w - wt_list[i]], dp_arr[w])
f(len(wt_list))

You would have to search for the place where sum(wt_list) is being used. That takes time (and is error-prone).

Test: You’re applying a diff throughout the program. In the function abstraction case, the diff was just a one-variable change. In the general case, it could be many variables and many lines of code changed.

Contrast

sumOfEvenSquares = sum . map square . filter even

to

productOfOddCubes = product . map cube . filter odd

It’s not just one value or even one function name that changes. It’s three function names in three different positions.

Imagine if you didn’t name those two chains separately. Instead, you had just f = sum . map square . filter even. Now, if you wanted to switch to the other formulation, you’d have to go in and apply the diff manually. Change sum to product, square to cube, and even to odd. That’s three times as much work as changing sumOfEvenSquares to productOfOddCubes.

Test: Hardware versions vs software versions. There are two ways you can go back to a previous version of your program. You can either go up your version control history and retrieve the old version or you can change a flag so that the program executes a different function. The problem with the former is that the old version may no longer type-check given the other changes you have made.

For example, if your old version was sumOfEvenSquares = f . sum . map square . filter even, and you’ve later changed f so that it accepts a String, not an Integer, then sumOfEvenSquares won’t work as it is. You’ll have to re-apply the diff for f that you applied to productOfOddCubes. If sumOfEvenSquares were part of your code, your type-checker would warn you about it when you changed f.

More importantly, you can’t switch between the two versions with one change in the input. You have to search the version control history and apply the diff, which takes more steps.

Test: One alternative to keeping just the diffs is keeping the inputs in their entirety. So, instead of having one high-level function map f = foldr ((:) . f) [] and passing different arguments like in map sum and map product, you would keep two functions mapSum = foldr ((:) . sum) [] and mapProduct = foldr ((:) . product) []. Now, this too allows you to switch between the two versions with just one change. However, it duplicates the code! Tomorrow, if you decide to update the function to work on all functors, not just lists, you would have to update both the versions, for double the work.

Test: In OOP, you would do this via interfaces or inheritance, where you simply switch the class that is implementing the interface, and the caller code is none the wiser.

Consider

class Employee {
    var salary = 100000
}

class Programmer extends Employee {
    var numBugsFixed = 9001
}

We can switch between a Programmer type and an Employee type just by typing the different name. Keeping the two versions in their entirety would look like:

class Employee {
    var salary = 100000
}

class Programmer {
	var salary = 100000
    var numBugsFixed = 9001
}

So, inheritance allows us to switch between two largely similar types using just a change in their name instead of a tedious change in all the differing instance variables and methods.

Of course, it also helps in polymorphism. How does that save us effort when switching between versions? It means we don’t have to change the caller’s type. One less thing!

val fooEmployee: Employee = new Employee()
// to
val fooEmployee: Employee = new Programmer()

Without polymorphism, we would also have to change the variable’s type:

val fooEmployee: Employee = new Employee()
// to
val fooEmployee: Programmer = new Programmer()

Test: Higher-order functions allow you to switch between two functions without much effort, as in map sum and map product.

TODO: Implementing in Isolation

When you implement in isolation, you get a separate function. Now, that’s a clear diff from the existing code, so any failed tests must be because of that function. You can isolate the input that goes into that function and change it till it returns the right output. This makes debugging easier. Next, it implements only one feature of the spec, which means that you can write a test for just that feature, with a direct input and a direct output. You can try to pass that test by changing only this function, not any other part of the code.

How would you get one without the other? Simple.

For one, implement it in isolation but insert it into the rest of the code instead of keeping it as a separate function or class. Now, debugging failed tests will be harder because you will have to look at a lot of changes in a lot of places.

Conversely, have it as a separate function but implement it using a coarse end-to-end test. Now, debugging of failed tests will be easier because you will know you’ve changed only that one function or class, but implementing will be harder, I think, because your new test could fail because of any of the features involved.


Test: [2019-07-30 Tue] Was trying to debug my implementation of choose. Actually, no, I wasn’t debugging, since I had never passed the test. I was actually trying to implement the function and pass the test for the first time.

let rec choose n xs = match n, xs with
  | 0, _ -> []
  | n, [] -> []
  | n, x::xs -> List.append (List.map (fun ys -> x::ys) (choose (n-1) xs)) (choose n xs);;

assert (choose 2 ["a";"b";"c";"d"] = [["a"; "b"]; ["a"; "c"]; ["a"; "d"]; ["b"; "c"]; ["b"; "d"]; ["c"; "d"]]);;

Given that this was implementation of a new feature, I needed to focus on making the test narrower.

Tried assert (choose 1 ["a";"b"] = [["a"]; ["b"]]);; but even that failed! Tried assert (choose 1 ["a"] = [["a"]]);;, which also failed. Finally realized that there must be something wrong with the base case, so I tried assert (choose 0 ["a"] = [[]]);;, which pinpointed the exact branch in which the problem lied.

So, finer feedback helped when implementing. Otherwise, I would have been changing multiple things trying to pass the original coarse test.

Test: [2019-07-30 Tue] Trying to pass this test.

assert (possible_groups ["a";"b";"c";"d"] [2;1] = [
      [["a"; "b"]; ["c"]]; [["a"; "c"]; ["b"]]; [["b"; "c"]; ["a"]];
      [["a"; "b"]; ["d"]]; [["a"; "c"]; ["d"]]; [["b"; "c"]; ["d"]];
      [["a"; "d"]; ["b"]]; [["b"; "d"]; ["a"]]; [["a"; "d"]; ["c"]];
      [["b"; "d"]; ["c"]]; [["c"; "d"]; ["a"]]; [["c"; "d"]; ["b"]]]);;

It’s pretty big when implementing a new, unfamiliar feature. I need to test each part of the feature separately.

Took me a total of 53 minutes.

Have a First-Class Default

Question: What if you’re not creating alternatives, like two types of schedulers, but rather just one algorithm, like pipes for an operating system?

Question: What kind of change are you making at each point? What was the program like earlier and what is it like after the change?

Hypothesis: You should have a default stub that has the same type as your final code, but doesn’t affect the behaviour of the system. Then, you should have a one-button change that allows you to switch to the final code.

Corollary: Your unit test for a feature should fail when you switch to the default implementation and pass when you switch to the feature-filled implementation. As usual, it should be a one-button change.

Hypothesis: Add unit tests (or assert contracts) for everything that your new feature changes.

Hypothesis: Having a one-button change between feature and no-feature allows you to roll it back quickly to see what was caused by the feature itself as opposed to the earlier code.


Test: For example, my test for pipe deletion passed even though my pipdelete function was empty! I’d mixed up types in other places, which meant that my desired pipes were not getting initialized in the first place. So, the test passed whether or not I implemented my pipdelete feature. Good to know. Otherwise, I’d be scratching my head wondering which of my seven changes in different parts of the system had caused the bug.

[2019-03-02 Sat] However, that’s a negative exemplar. I don’t have a positive exemplar for a default stub. Specifically, when I say that the test must pass when I implement the feature and fail when I switch to the default stub, it’s not clear if I should switch to the pipes version just without this one small feature or if I should switch to the version without pipes at all.

I think you should have a way to get back to any point along your tree of development. Specifically, if you’ve added features A + B + C, you should be able to go back to A + B or A. You may not be able to go to A + C since that requires that C doesn’t depend on B at all, but that’s fine. And just going back in your version control history may not do the trick because you may have changed A to A’ sometime later and would want to go to A' + B and not your original A + B.

This needn’t be as elaborate as a global flag for every feature you add. Maybe if you implement pipdelete as an isolated function and create a stub pipdelete_empty that does nothing, you can pass in the latter during testing to switch to your implementation without pipdelete.

Test: What happens when you’re starting out with a new feature, like pipes in an operating system (completely hypothetical example; not related to my coursework at all)? If you press the button, you will switch to your neatly encapsulated function or class that implements pipes. But what if you want to go back to a pipe-less operating system, just to check for correctness or performance? You may not be able to, if you have pipe initialization and destruction code lying around, and if processes have information about pipes they may own.

That’s a negative exemplar. What would it take to make it a positive exemplar? (Note that it is controllable because it depends just on the code and not on my mental model.) You’ll probably need a macro because you will need to add or remove pipes from the struct definition of processes based on a variable. Fine. So, a macro PIPES_ENABLED that enables or disables pipes in the struct definition of processes and a boolean flag (set by the macro) that toggles between a version of the initialization or destruction code that uses pipes and a version that doesn’t. Now, you can switch between the pipes version and the non-pipes version with a single button press. And your default for every pipes-based function would be the original non-pipes version, such that you the boolean flag for pipes chooses between them.

Upshot

Hypothesis: Single, easily-reversible change -> any behaviour that is in the program with the feature on and not in the program with the feature off must be caused (at least partly) by the feature.

Contrast that to shotgun change, where you aren’t sure which change broke your code.

Hypothesis: First-class objects -> make it easier to switch.

For example, a value or function or object of the same type.

Question: Do you really need to rollback that often? Yes, you will want to toggle a new feature while implementing it. But after implementing A, B, and C in sequence, will you need to check out AB’C? Or will you treat the existing program as correct and indivisible?

Observation: Will need to rollback if you think of a test or find a bug after the implementation.

Hypothesis: Hard to come up with good tests or properties in advance.

Hypothesis: If you come up with a test or property in the future, you need to toggle back through your features to see when it stops passing.


Test: [2019-03-02 Sat] Added a failing test that depended on tagged values being saved directly to memory by my compiler. Changed my implementation of block-set and block-get. The test passed.

But, before that, I had changed a bunch of functions to always untag and tag before calling block-get or block-set because I was convinced that that was the problem. Not only did my failing test still fail, seven other tests started failing too. Which change had caused the problem? Was it all of them? Just one? Sure, I could tell that all my changes put together were responsible for the eight failing tests, but what would happen if I had made just one change? I couldn’t tell. (This is still a negative exemplar.)

Imagine if I had “encapsulated” them into one change (somehow; maybe hidden behind a boolean flag). How would that help, though? Even if I wanted to include only three of the six changes I made, I would still have to go and delete those bits of code manually.

What I think would have helped is finer-grained tests. I was going around changing six functions in order to pass one test. If I could have updated their individual unit tests (or written new ones, in case I had added a new branch), then I could have given myself feedback after every change. Then, I would know if change 1 had worked as planned, whether change 2 had worked as planned, all the way up to change 6, and then check if their combination worked as planned on the failing test.

In short, if you want to know the exact effect of a change, you need to have a unit test that isolates that function. Otherwise, if you have just one test with which to judge your code, you have to toggle different parts of your code manually to see the result before and after. So, toggling features madly is a sign that you haven’t come up with fine enough tests and are trying to optimize a big program with just a few feedback mechanisms (tests).

The problem, of course, is that someone needs to know how to write those unit tests. That would involve figuring out the input that reaches function A, which is easy enough to do, and then figuring out the output that should leave function A, which is what takes some thinking. For example, I would have to think about the CPSLow AST that would leave transformBlockTag.

This sort of white-box testing is much more work than just writing the black-box test saying that “makeAdder should print ‘OK’” because you have to come up with a more detailed output value than just “OK”. In the CPS case above, you would need to know how to represent a let-statement in CPS and how to write a function call and how to define a continuation. You need none of that when checking whether the output of makeAdder is “OK”. On the other hand, you don’t get stuck in horrible debugging sessions where you change a bunch of things and nothing seems to work anymore. You’re always just a few changes away from a passing test.

One Dimension at a Time

Speed: Coding Hat vs Refactoring Hat

Hypothesis: adding a new feature, one long chain of function calls -> will pass the test case (one branch), very little time, will not overengineer. Refactor -> easier to test going forward -> quick. h/t Kent Beck.

adding a new feature, abstract code (using separate functions or classes) -> will pass the test case (one branch), will take more time because you’re changing the design too, may overengineer. Debugging -> slow.

adding a new feature, optimized code (low-memory quick functions or detailed, convincing examples) -> will pass the test case (one branch), will take more time because you’re changing the design at the same time you’re making it faster; may overengineer. Debugging -> slow.

Tests that already pass, ugly code, refactor -> really quick.


Test: [2017-11-05 Sun] Tried to write quick and dirty code to pass one simple unit test. Nearly got a heart attack in the process. Here it is:

def batsman_total(match, player):
	d = match.match_details['innings']['1st innings']['deliveries']
	print(d)
	total = 0
	for ball_num, ball in d.items():
		if ball['batsman'] == player:
			total += ball['runs']['batsman']
	return total

I couldn’t bring myself to write total += anything. Felt so fucking gauche. Ditto for the for-loop and direct testing (if ball['batsman'] == player). I felt like reaching for a map and filter. But that would probably be overkill. I needed to bring myself to cut through to the actual output.

But this code is fine for my purposes! I usually overengineer. I can code faster by writing quick and dirty code and then refactoring, instead of throwing in abstractions like map and filter and so on, especially in a predominantly-imperative language like Python.

I can now write a quick test for it: self.assertEqual(game.batsman_total(match, player), 72).

Then, I can refactor quickly against that test:

def batsman_total(match, player):
	d = match.match_details['innings']['1st innings']['deliveries']
	print(d)
	return sum(ball['runs']['batsman'] for _, ball in d.items() if ball['batsman'] == player)

That was easy! After I’d gotten going in Python again, I would have probably written that the first time around but, when I was getting started, I needed to get some code working quickly even if it was dirty.

Contrast that to the case where I refactored as I added code. I would have to prevent myself from writing any statement unless it was already refactored. So, I wouldn’t write down d = match.match_details['innings']['1st innings']['deliveries'] without recognizing the (valid) complaints that the variable name was bad and that I was indexing deep into an object (violating the Law of Demeter). To address the latter complaint, I would have to extract a match.deliveries() method. No, actually, I’d have to extract match.first_innings() and then innings.deliveries(), which would mean a whole new class for Innings. And, at that point in my design, I hadn’t actually decided if I should write a class to capture the fields of the JSON object or if I should use the nested object as is. Wow, that’s a long detour before I can write my first line of code.

We’d have the same problem with for ball_num, ball in d.items() - I’m relying on the fact that d is an iterable object. With ball['batsman'], we again have the problem of naked indexing. I’d need to decide whether to extract a Ball class with a batsman() field. And suppose I wrote it as a map and filter, I’d have to write a lambda to test if player was the batsman facing the ball.

All this would take me much longer to get any code working. What if it turned out that I didn’t need those new classes at all? All this effort would have been wasted.

Debugging would be a pain because I would either have to write a new test for every single method I extracted or have to use one unit test to debug multiple changes.

Hmm… So, when you’re refactoring after passing a unit test with a quick-and-dirty implementation, you’re still using just one test to get feedback about multiple changes (i.e., the extracted method batsman.deliveries() and the original method batsman.total() that calls the extracted method). It’s just that because you know that your old method passed the test and your only change is your method extraction, any test failure must have been caused by that method extraction. If you had tried to do both without passing the test first with dirty code, you wouldn’t know if the failure was due to the method extraction or due to some logic fail in your original code.

That means an alternative way to ease debugging pain when refactoring is to write unit tests for each part that you extract. But that requires a lot of thinking about the expected output of each of those parts.

Test: Optimizing at the same as you’re coding: I might check whether the batsman was actually playing in the match before iterating over all the 600 balls of the match. What could go wrong if I decided to optimize things this way before passing my first test?

First, I’d have to go and find out how to test if a batsman was playing in a match. I’d probably have to find the list of players and check if the batsman was in it. That takes time. And suppose I got it were - I checked among the players in one team but not the other. All of a sudden, my test would fail and I wouldn’t know if it was because my batsman total calculation was wrong or because my test for “is batsman playing in this match” was wrong. Once again, I’d be using one unit test to get feedback about two changes. And, again, to avoid this problem, I’d have to write new unit tests for “is batsman playing in this match”.

Test: Want to write down my completeness proof in LaTeX - hesitating because I feel like I need to write a complete report that flows from start to end, and maybe have diagrams and stuff. No! All I need to write is the proof alone. Everything else can come later. – one thing at a time.

Test: Spent the last one hour coming up with a crystal-clear model that would show how \(U_Z\) and Y are related even though that’s not the point of my proof. Dammit. – I was trying to abstract? Was I? I was trying to optimize for a great proof - one that would convince any snotty critics. But I didn’t really need that. This isn’t my final draft; this is just a proof of concept.

Test: Others seem to finish assignments sooner than me. Is it because they’re writing different code than me or because they’re writing the same code faster?

Test: OS HW3 - I’m taking a long time to pass each test. Is it some essential difficulty of the problem (with respect to my current skill) or am I just being slow somehow?

Test: Refactoring is really cheap! - it’s taking me a matter of 2-4 minutes to extract functions from what looked like irredeemably-tangled code (that actually passed the tests). – huh. One thing at a time.

Why is writing Abstract Code time-consuming?

Question: What is the exact difference in coding time between passing the test and then refactoring vs passing the test with clean code?

Hypothesis: Pass the test with ugly code -> 1 test (though you may sometimes test different parts of the chain too). Refactor the ugly code that works -> m tests (one for each abstraction like function or class).

You would go from no code to code that works. 1 test. Then, you would go from code that works to code that works plus one abstraction. Then, you would go again from code that works to code that works plus abstraction. And so on till you’re satisfied. m tests.

Pass the test with clean code -> suppose you add m abstractions and the test fails. You will need to toggle many of those abstractions (which will take time) before you isolate the bug.

You would go from no code to code that doesn’t work plus abstraction. Then, from code that doesn’t work to code that doesn’t plus abstraction. And so on.

Even if you ran a test at each point, you wouldn’t know if it doesn’t work because the current abstraction is broken or because the prior abstractions are.

Corollary: This whole problem would be solved if you could add abstractions like you add code. Then, you could unit test each change and romp ahead.

Hard to Toggle Abstractions

Hypothesis: It’s easy to toggle tests.

I did it effortlessly with my pipe tests.

Hypothesis: It’s hard to toggle abstractions.

Except for the last few edits or so, our editors don’t allow us to roll back abstractions. It’s a pain even if you use it.

Hypothesis: Abstractions are usually set in stone.

Once you add an abstraction (like a class or function), it’s so hard to get back the original code that you stick with the abstraction.

But you don’t know if your abstraction itself was correct. So, you’re now uncertain about both your abstraction and your code’s behaviour.

Hypothesis: Hard to decide whether or not to go for an abstraction unless you know the potential use cases.

Corollary: You will dilly-dally. You won’t spend your time coding; you will spend it pondering.

One Thing at a Time: Abstract while Coding -> More Uncertain

For example, when testing a class in Python, I wrote the code to generate test data in the setUp method. But that meant that code was regenerated for test case, which was wasteful. So, I decided to move it to the class constructor. But that failed because the TestCase constructor expected two inputs (I think. I’m not sure. That’s the point). Which ones? I don’t know. So, I looked online for more details. People recommended that I use a setUpClass method. But that was a “class method”. I wasn’t sure what that was - a static method like in Java or something else. Anyway, I tried using self.foo everywhere and the code failed. Why? Because self wasn’t even visible in the class method (obviously). So, I replaced every instance of self with cls (since that was the name of the argument). Then my code worked.

Things would have been different if I had tried a Hello World program that simply showcased a setUpClass method. I would know exactly what a “class method” did and whether it used self or cls. There would be no confounders like the rest of my code.

Hypothesis: We are more uncertain about abstractions than about normal code. (I suspect this is true only if you have no positive exemplars.)

Corollary: So we are likely to make more mistakes when abstracting than when writing usual functions. That explains my struggle above.

Observation: When calculating the total runs for a player in the first and second innings, I wanted to abstract the loop into a function and call that function on the two parts. Instead, I forced myself to write two duplicate loops (cringe). But the code still failed the test. So, if I had abstracted the loop into a function, I would have been unsure if the problem was because of my abstraction or because of the code itself.

Observation: Holy crap. Refactoring was a breeze! I could see exactly what I needed to change (extract a Match::balls iterator) and did it without any worries about the behaviour.

Observation: Again, when trying to aggregate stats from 70 matches, I’m trying to look for the ideal solution where I pickle the YAML object for future use and so on. Instead, I should just run it in a quick and dirty way, no matter what the cost.

Observation: Lost my way after a while and reverted to my old over-engineering habits.

Test: [2018-10-30 Tue] Wanted to annotate the nodes of a tree with the path from the root. Struggled to do it using foldTree. Couldn’t imagine how to pass down information to a node from above. Finally just decided to do it the dumb, explicitly-recursive way. Done in minutes. I don’t think it can be done using a simple fold because it relies on information other than the current node and its children. I guess this is the tree equivalent of some morphism.

Default Values for Abstractions

Hypothesis: Have default values for abstractions -> can add an abstraction without making mistakes and without changing behaviour vs have to add an abstraction (perhaps making mistakes) and change behaviour too.

Maybe duplicate the concrete code you want to abstract and change one of them so that you can always revert to the other version for comparison.

Observation: Still hard to reverse an abstraction.

Premature Abstraction Considered Harmful

Hypothesis: Refactoring recursively (before or after you pass your failing test) is a relevant factor for finding it harder to debug and thus taking longer.


Test: What if I refactor indefinitely?

Let’s say refactoring is about getting the minimum set of relevant factors for some output variable. Roughly, it’s about minimizing the number of variables in scope while keeping all the variables you actually need. Just deleting variables at random is not refactoring.

So, I could clean up batsman_total by extracting batsman_balls and runs:

def batsman_total(match, player):
	print(d)
	return sum(runs(batsman_balls(match, player), player))

def batsman_balls(match, player):
	d = match.match_details['innings']['1st innings']['deliveries']
	return [ball for for ball_num, ball in d.items() if ball['batsman'] == player]

def runs(balls, player):
    return [ball['runs'][player] for ball in balls]

This leaves batsman_total pretty clean, but leaves some overly-concrete code in batsman_balls and runs. I could refactor them recursively:

def batsman_balls(match, player):
	return [ball for for ball_num, ball in first_innings_balls(match).items() if ball['batsman'] == player]

def first_innings_balls(match):
    return match.match_details['innings']['1st innings']['deliveries']

def runs(balls, player):
    return [ball_runs(ball, player) for ball in balls]

def ball_runs(ball, player):
    return ball['runs'][player]

Great. Now, batsman_balls and runs are clean, but I could refactor first_innings_balls and ball_runs further, and so on.

Quick and dirty way of doing it - total runs = map over deliveries in both innings and if batsman is the given batsman, add runs scored by batsman. – This gives you a sufficient set of factors for passing the test. Now, your job is to eliminate irrelevant factors, i.e., refactor the code. Specifically, you don’t need to know whether the temporary variable for summing over the list of balls is called total or whatever. You don’t need to have that in scope.

Observation: What about abstract code that still just passes one test case? For example, Match:get_balls() and Match:map_over_balls() and runs_for_ball().

Observation: The abstraction Match:get_balls() may or may not work. You may have made a mistake in defining the method or in the module scope (if Match is in another file) or whatever. Source of uncertainty - have to test it.

Observation: You have multiple design choices - different levels of abstraction. For example, you could deal with separate methods for get_balls() and map_over_balls(), but not a separate class for PlayerStats. Or you could go the whole way and have subclasses for BowlerStats and BatsmanStats (and of course FielderStats) and have, god forbid, visitor methods to do complicated processing. But, for my problem of extracting data for machine learning, that is simply overkill.

Observation: Each time you add a new abstraction, you have to test it (since you may or may not have implemented the abstraction correctly, or may have even picked one that doesn’t fit your problem).

Quick-and-Dirty vs Clean: Why I Procrastinate

Hypothesis: When you have just one source of feedback, releasing a version with a few properties is a relevant factor for getting a (low-value) positive exemplar quickly.


Observation: I had the idea that mock objects were actually unnecessary if you strictly separated pure code from impure code, like in Haskell or other languages. That’s why you didn’t see Haskellers go ga-ga over mocks. (In fact, you don’t see anybody except Agile fellows.)

But then I got caught up in writing some elisp code (which took over two hours). Then I felt like searching online for some precision teaching books. And finally I was late for a video call. So I didn’t write down my full ideas about mock objects and didn’t commit that code. And it weighed on my mind even as I was freshening up.

Why didn’t I simply jot down the rough idea as I did above? It would have taken me a couple of minutes, tops. It may not have been perfect, but I could commit the change and free my mind.

Jotting down my mock objects idea – I wanted to get a general hypothesis that fit with all my other “complex” programming ideas.

This is a negative exemplar of getting a quick-and-dirty, working product. A positive exemplar would have been if I noted down my proposed relevant factor, marked it as TODO, and committed it. Not that it would have a controlled experiment or fit in well with my other hypotheses, but at least it would have a few good properties (that of being committed in version control and being placed in the right section). Contrast that to not writing it down at all, where it would have no properties at all.

A close negative exemplar would be if I didn’t note it down in my essays but rather in some loose paper or a stray file somewhere (like my lab notes) or didn’t mark it as TODO (like my Keep notes) or didn’t commit it in version control (like my usual additions).

Premature Code Reuse Considered Harmful

Hypothesis: If you can pass your test without using an existing but unfamiliar function, trying to use it is a relevant factor for taking longer to pass the test. (TODO)

If you can’t pass your test without using an existing but unfamiliar function, testing it in isolation is a relevant factor for passing your test quickly. (TODO)

If there’s an unfamiliar function that seems to do what you want to do, but you can write code quickly enough, just pass the test and then try to use the unfamiliar function. Otherwise, you’ll be changing two things at a time. However, if it’s a function that you have to use, such as an HTTP request function, test your understanding of it in isolation.


Test: [2019-01-30 Wed] I was trying to use the method typeConform even though I could very well pass my current test case using a simple zip and map. And because that function was “unfamiliar”, I struggled with it for 15 minutes before realizing that it didn’t actually fit my purpose. – took much longer because I was trying to reuse code without understanding it fully. I would have gone along much faster with my own code even if it duplicated stuff. Refactoring it to reuse the existing method would be trivial after a passing test.

Again, this is a negative exemplar. A positive exemplar would be to pass my test case using zip and map and then refactor. But I don’t have an empirical test of that.

Have a Working Implementation at all Times

Code Transformation: Fail to Fail gives No Information

Hypothesis: Going from fail to fail doesn’t tell you whether your latest change was right or wrong.

Going from fail to pass tells you that your latest change is correct.

Going from pass to pass tells you that your latest change is correct. This is true if you changed the code being exercised by the test, but not necessarily if you added new code (you may have forgotten to test it).

Going from pass to fail tells you that your latest change is wrong.

Test: Code fails a test. You use the right sort function. The code may still fail if you made a mistake in a for-loop. Similarly, if code fails a test and you use a wrong function, the code will still fail. So, the fact that it fails even after a change doesn’t tell you whether the change was right or wrong.

How to Maintain a Working Product: Continuous Delivery

TODO: What is continuous delivery?

Hypothesis: Continuous delivery -> low risk and low tension when it comes to release day.

Release only the day before -> high risk and high tension.


Test: JWZ about Netscape - they released their browser for all platforms at the same time, whereas the conventional practice was to release Windows first and handle Mac later. – That’s what failed companies did.

The other big thing was we always shipped all platforms simultaneously; that was another thing they thought was just stupid. “Oh, 90 percent of people are using Windows, so we’ll focus on the Windows side of things and then we’ll port it later.” Which is what many other failed companies have done. If you’re trying to ship a cross-platform product, history really shows that’s how you don’t do it. If you want it to really be cross-platform, you have to do them simultaneously. The porting thing results in a crappy product on the second platform.

– JWZ, Coders at Work

A controlled experiment would be if Netscape decided to handle Mac later. That way, we would control for other differences between Netscape and other companies. The above example is not exactly controlled. For that, we can look at the example company from the Continuous Delivery book where they switched from last-day, manual releases to frequent, automated releases. That seemed to relieve tension and reduce errors.

However, JWZ’s case is a positive exemplar of someone releasing a non-crappy product on other platforms (according to his story above). So, if someone were to try to figure out some of the relevant factors for that, they might look at the key differences between Netscape and other companies, one of which is releasing on all platforms simultaneously.

Test: CS373 HW1 - Note how much time I’m spending just wrestling with R and CSV. I haven’t done any actual work yet. – Need to release a hello world before you do anything else. Here, it would have been a simple installation of R itself, then reading CSV in R, CSV with arrays, converting JSON to CSV, and then other stuff. All because I didn’t release sooner.

Continuous Delivery: Run Acceptance Tests in a Production-Like Environment

Hypothesis: Continuous delivery = running acceptance tests in a production-like environment.

(You still have to build and stage all the way to a production-like environment, running unit tests and integration tests along the way).


Test: Writing a research report in LaTeX - production-like environment is the LaTeX PDF (not your handwritten sheaf of papers); running acceptance tests - getting feedback from yourself by looking at it from the eye of a reader and by getting feedback from others. – yup. Lack of acceptance tests would be not looking at the final PDF from a reader’s point of view. Lack of production-like environment would be working with handwritten notes instead of a PDF.

Test: Negative exemplar - continuous integration where you simply run unit tests – neither is that a production-like environment nor are they acceptance tests.

How to Start with a Working Product: Hello-World Program with One-Button Installation

Hypothesis: A blank file is a negative exemplar because nothing works. You have to change a lot of things before it will work.

A Hello World program (set up with all the tools you want for your project) is a positive exemplar because it works after you install it. But you may have to run the installation steps manually, which leaves you with a negative exemplar until you get everything right.

A Hello World program with one-button installation is a positive exemplar right away.


Test: Wanted to test pre-commit. Could download their demo repo and test the pre-commit hooks with zero fuss. – easy.

Contrast that to downloading their code and installing their pre-commit hooks into my repo. I would have to work for a bit to figure out what went where.

Contrast that to writing my own pre-commit hooks from scratch. It would take me much longer to see the results.

Test: Node JS, wanted to use RESTful API, no working blank project - stuck for hours. Abandoned it. – spent all our time trying to get the RESTful API to work even though that wasn’t really the point of our project.

Have a Working System At Each Point

Hypothesis: Having a system that you know passes certain property tests is a relevant factor for knowing which properties a change breaks or maintains. It is also a relevant factor for narrowing down the relevant factors for each property that fails.

Hypothesis: Having a system where you know exactly which properties fail is a relevant factor for knowing which properties a change newly satisfies. It is also a relevant factor for narrowing down the relevant factors for each property that passes.


Test: Negative exemplar - designing the questions for CS373 HW3 - I initially dumped all the text from my notes and the past reference assignment - had a tough time deciding how to go ahead. It didn’t pass any property test like “look well-formatted” or “define every term you use”. (In fact, I couldn’t tell whether it passed that second property because it was all so messy that I couldn’t tell which terms had already been defined.)

If I added a couple of questions about decision trees, I wouldn’t know if I had used an undefined term or even if I had inadvertently defined the same term twice (which is another undesirable property). So, I couldn’t tell if the questions hurt or helped that property. Similarly, I wouldn’t be able to tell which other properties my new questions affected - maybe they added some weird formatting that makes LaTeX break; I wouldn’t know that they did so until I eliminated formatting errors from the rest of the assignment. Conversely, I wouldn’t know the narrow relevant factors for a failing property. That is, if I wanted to know all the places in which I had used undefined terms, I would have to read the assignment from top to bottom.

Positive exemplar - when I removed the old stuff, I could tell what properties my assignment now satisfied satisfied (such as “look well-formatted”) and what properties it didn’t (like “ask two theoretical questions about decision trees”). When I added two theoretical questions about decision trees, I could tell that it now satisfied one extra property (“ask two theoretical questions about decision tree” - check). I could tell that it maintained the old properties (such as “look well-formatted”) since they used to pass and they still passed after the change. When I added a question about “post-pruning” without defining it, I could tell that it failed the property of “define every term you use”. And since that property used to pass, I could tell that my latest change was the culprit. I didn’t need to read the entire assignment to search for undefined terms. I just had to look at my latest change.

Test: 2017 - Current notes in this essay - pretty unstructured - hard to tell how much I’ve improved by adding one more idea – fail to fail.

To Fix a Mess: Change High-level Design Separately and Migrate One at a Time

Hypothesis: When you check whether your project matches some high-level design, you have one source of feedback. Also, changing the high-level design involves experimenting with the design till it does what you want (i.e., till it passes the test you have in mind) and modifying your project till it matches the high-level design.

You have two options: (a) changing the design within your existing project so that it passes the test saying that the high-level design looks good and the test that the code matches the high-level design and (b) changing the high-level design on a toy project till it looks good and then changing your code till it matches the high-level design.

How long will each option take and what happens when you make mistakes (i.e., fail the test you’re trying to pass)?

If you make no mistakes in updating the code and make no mistakes in getting the high-level design to look good (and thus change the high-level design only once), both options take the same time.

If you make no mistakes in updating the code and make a mistake in getting the high-level design to look good (and thus change the high-level design many times), option (a) is a relevant factor for taking a longer time because you will have to change potentially all the code every time you change the high-level design whereas option (b) is a relevant factor for taking less time because you will have to change only the toy project.

If you make a mistake in updating the code (i.e., fail the test of matching the high-level design), option (a) is a relevant factor for taking longer because you won’t know which parts failed. You will have to change multiple things till they pass the test, like cracking a safe. (This can happen, for instance, if you’re writing a compiler that generates assembly code and the only feedback you get is whether the program worked or crashed, without specific information about which part caused the crash.) Option (b) is a relevant factor for taking less time because you will know that any error is partly due to the latest change, so you will have to massage only that part (and perhaps some of the earlier code) till the test passes. You won’t have to blindly change multiple things.

Finally, if you make a mistake in both the high-level design and the code migration, option (a) will take a lot longer because the mistakes in the code migration will prevent you from checking if the high-level design works.

Note that type-checkers change the game by giving you multiple sources of precise feedback. You don’t just get told that your code doesn’t type-check; you get told exactly which parts don’t type-check and why. Imagine having a compiler that told you “Compilation error(s) found” with nothing else. That’s the difference between having just one source of feedback and having multiple sources of feedback.

Unit tests do the same thing by giving you multiple sources of feedback. It’s the difference between “your entire project failed a test” vs “your function foo failed a test”. In the former case, the bug could be anywhere. In the latter, you know it’s only in foo (at least with respect to the tests you’ve written).


Observation: Want to format a long LaTeX file to match a given research paper template. All of a sudden, I have to deal with a lot of changes, like diagrams and tables and headings, all at once.

If I’d started with an empty research paper template, then I would have had to deal with each problem one at a time.

Test: Cleaned causal hypothesis essay - started from scratch - easy to tell how things fit together – pass to pass.

Test: Causal hypothesis essay - a giant dump of ideas - hard to tell how things fit together – fail to fail.

Maintain a Short Cycle Time

Why Cycle Time Matters

Hypothesis: Cycle time = time needed to change the code, set the desired inputs, get the outputs, and understand the outputs of a mechanism.

TODO: Show that total time taken is a product of the cycle time and number of iterations. Show that a working product with one change at a time leads to the least number of iterations because of faster debugging.

Test your Mechanism in Isolation

TODO: Refactor the two hypotheses below.

Corollary: Unit test that tests your new code directly -> minimum work to set the inputs, minimum time to get the outputs, minimum work to understand the outputs. However, if the output is not first-class, it might take more work to get the direct output.

Integration tests -> need to do extra work to set the desired inputs because you have to change the rest of the program and because you have to concoct overall inputs that will deliver your desired input to the module, lot of time to get the outputs since there may be other code before the final output, harder to understand the outputs because they could be caused by any part of the program.

Hypothesis: Got the type signature for a mechanism (using your integration test), write and test it in isolation -> less work to get to a correct implementation, can exhaustively test all the relevant input configurations -> quick, correct.

Got the type signature for a mechanism (using your integration test), write and test it as part of your overall program -> more work to get to a correct implementation, costly to exhaustively test all the relevant input configurations -> slower, may have errors.

(Note also that isolated mechanisms have smaller interfaces than in-place mechanisms. You’re forced to minimize what you will accept as explicit arguments.)


Test: My friend kept changing code within his (already large) functions. I had to keep telling him to extract the code and test it separately. (For example, plotting figures or using an array of arrays or getting all combinations of some configurations.) – changing code was easy enough, even though the interpreter would have been much faster; getting the output took longer because he had to train the classifiers first; didn’t know if the error was because of the code or something else.

Test: I ended up with a script that was costly to test when I was extracting features from a dataset - I couldn’t test the matrix-saving code until I ran the 15-minute script. And if it failed for some trivial reason after 15 minutes, like it did, I was fucked. – pretty easy to change the inputs; took ages to get the output; not too hard to understand the output.

Test: OS HW3 - inverted page table – had to do a lot of work to get its inputs (in fact, I didn’t even separate its interface; I just used it inline wherever I wanted); had to wait for the whole program to run before I could see the output; and had to parse the whole table output to see what was going on.

Test: OS HW3 - FIFO order – didn’t write any unit tests. Had to set inputs only via the tests I ran; had to wait for the output; couldn’t understand the output directly because I wasn’t using first-class reference strings. 1 hour 50 minutes right there.

Test: rectangle numbering in Emacs - wanted it to start from 1 instead of 0, but kept putting it off because it felt like it wouldn’t be worth it to dive into some foreign code and change what I wanted. Just realized that it’s my code that starts from 0. Changed it in a jiffy. – somebody else’s code - have to do a lot of work to set the input (in case configuration variables don’t exist). my code - can just dive in and change whatever I want.

Test: When trying to figure out how to send in a function pointer to my benchmark function, I decided to test it in a stand-alone program first. Fewer changes. In contrast, lots of stuff could go wrong in the benchmark code. – Faster output.

Test: Negative exemplar - wrote a long function when I could have easily extracted hasFoo(post) into its own function. – didn’t implement it in isolation.

Test: In my initial test for the block-get instruction, I had to wrap it in a let-block so that I could return the value I read from the array.

(Let("arr", IntType, ArrayDec(Lit(5), IntType),
  Prim("block-get", List(Ref("arr"), Lit(2)))), 2, compilerWithVals())
  -> List(
	s"movq $$5, ${defaultCompiler.avRegs(2)}",
	s"movq ${defaultCompiler.avRegs(2)}, ${defaultCompiler.avRegs(3)}",
	s"movq heap(%rip), ${defaultCompiler.avRegs(2)}",
	s"imul $$8, ${defaultCompiler.avRegs(3)}",
	s"addq ${defaultCompiler.avRegs(3)}, ${defaultCompiler.avRegs(2)}",
	s"movq ${defaultCompiler.avRegs(2)}, heap(%rip)",
	s"movq ${defaultCompiler.avRegs(2)}, ${defaultCompiler.avRegs(3)}",
	s"movq $$2, ${defaultCompiler.avRegs(4)}",
	s"movq (${defaultCompiler.avRegs(3)}, ${defaultCompiler.avRegs(4)}, 8), ${defaultCompiler.avRegs(3)}",
	s"movq ${defaultCompiler.avRegs(3)}, ${defaultCompiler.avRegs(2)}",
  ),

But that would be if I could run only an end-to-end test. We can do better with unit tests:

(Prim("block-get", List(Ref("arr"), Lit(2))),
  2, compilerWithVals(List(("arr", Left(1)))))
  -> List(
	s"movq ${defaultCompiler.avRegs(1)}, ${defaultCompiler.avRegs(2)}",
	s"movq $$2, ${defaultCompiler.avRegs(3)}",
	s"movq (${defaultCompiler.avRegs(2)}, ${defaultCompiler.avRegs(3)}, 8), ${defaultCompiler.avRegs(2)}",
  ),

Now, I can test exactly the assembly output of this one little instruction with 3 lines instead of 10.

How to Set Inputs and Test Outputs Quickly: Use First-Class Values

Hypothesis: First-class input -> can set the input without doing any other work.

No first-class input -> have to set the input by changing a lot of other variables.

Hypothesis: First-class output -> can test the output without doing much work.

No first-class output -> have to test in a round-about manner.

Corollary: Impure function (it can do I/O) -> input and output are obviously not first-class.

Corollary: Extract code into a function -> forced to make your input and output first-class.


Test: OS HW1 scheduler – couldn’t test it with a particular configuration of ready processes with different priorities. Had to change the whole system just to get that configuration right.

Test: [2019-04-06 Sat] Interpreter - I could evaluate a particular AST fragment with a particular environment. Before that, I needed to write a complete program to indirectly produce that environment at that point in the AST.

Test: pipputc – couldn’t really test with first-class inputs. Had to set up processes. Work.

Test: batsmen average – could just test using the 3-over mini-game. Little work. (Though I couldn’t really test it with arbitrary inputs. Had to work with what I was given.)

Test: OS HW1 scheduler – no first-class output of (next process, next ready list) that I could test directly. Had to run all the processes for several seconds and observe their steady-state CPU usage ratio. schedule_next_process needed to deal with an interface if I wanted to catch its output behaviour, since it deferred sometimes or evicted and stuff. Plus, you had to deal with interrupts. But perhaps that could still have been a first-class output denoting the action to take.

Test: pipgetc – could test the output of the pipe without even using another process.

Test: org drill - get session data - it inserts stuff into a buffer instead of returning a string (no first-class output). It also uses (current-time) - not a first-class input. – have to run the whole drill session to set the input; have to look at the log file to check the output. An alternative would be to break it up before and after the call to (current-time) so that you could test those parts separately (the functional way to do that would be to take in the timer function as an argument). For the output, make it take the output buffer as an argument too.

Test: Python - had a for-loop in main to try different things with print statements. Extracted it into a function - had to think about building and returning a string instead of just printing to standard output. – first-class outputs.

Multiple Manual Steps vs One Step

Hypothesis: Running multiple steps manually is a relevant factor for having to enforce constraints manually across steps, which probably leads to silly mistakes like not enforcing the constraint in some steps and having to try multiple combinations when fixing a bug because you have only one source of feedback.

Running a single step that encapsulates multiple steps is a relevant factor for not having to enforce constraints manually across steps and having fewer combinations of inputs.


Test: Joel Spolsky: – get feedback less frequently and find it harder to code and make more silly errors.

If the process takes any more than one step, it is prone to errors. And when you get closer to shipping, you want to have a very fast cycle of fixing the “last” bug, making the final EXEs, etc. If it takes 20 steps to compile the code, run the installation builder, etc., you’re going to go crazy and you’re going to make silly mistakes.

One example of making silly mistakes when there are multiple steps instead of one step: I wanted to get the importance of different features in my Naive Bayes classifier. For that, I needed to load the label encoder and tokenizer I had saved earlier so that I could convert the label numbers to the label names and convert the feature numbers to their names. I also had to load the classifier I had saved. But I had different versions of the encoder and tokenizer and classifier corresponding to different parameters and datasets. I mixed up the three from different versions and got weird results. I had to go back and try a few variations before I got sensible results. If I had had a single function that took as argument the version name, I could have ensured that the encoder, tokenizer, and classifier were of the same version. Also, the interpreter had a lot of variables in context, so I could have auto-completed the wrong variable and got a perplexing error.

In short, I was manually enforcing the constraint that all three should be loaded using the same version name. It’s only a matter of time before a human slips up and inadvertently violates the constraint; not so for the computer.

It also drove me somewhat crazy to see the nonsensical results. I didn’t know if my classifier was wrong or if I had muddled up the analysis somehow. Basically, I wasn’t able to isolate the source of the error. If I had a function for calculating feature importance and could check that it worked on a sample input, I would know that any error was due to my classifier.

Note how I had to try several variations. I had multiple suspects for the bug because I was changing more than one factor at a time (thanks to the multiple manual steps), and had to go through many of their configurations to hit a positive exemplar. If it were one function, then I would have to go through the configurations of its inputs, but not all the configurations of the code within.

It’s the difference between a function:

def get_feature_importance_for_version(model_version):
    label_encoder = load_label_encoder(model_version)
    tokenizer = load_tokenizer(model_version)
    model = load_model(model_version)
	return get_feature_importance(model, tokenizer, label_encoder)

get_feature_importance('10k-rows')

and a manual session in the Python REPL:

label_encoder = load_label_encoder('label-encoder-10k-rows.joblib')
# ...
tokenizer = load_tokenizer('tokenizer-100k-rows.joblib')
# ...
model = load_model('model-10k-rows.joblib')
# ...
get_feature_importance(model, tokenizer, label_encoder)

Notice how I mistakenly passed ‘…100k-rows.joblib’ in the second line instead of ‘…10k-rows.joblib’

The only possibilities with the function were the different versions: 10k-rows, 100k-rows, 1M-rows, and so on. But there were a combinatorial number of possibilities with the manual session: I could have given any triplet of versions when loading the label encoder, tokenizer, and model. Granted, I wouldn’t give wildly different versions to them, but if I wanted to give (A, A, A) but gave (A, B, A), then if I tried to fix the last call alone, I might make it (A, B, B), which would still fail. Then, I might change the second version to A, making it (A, A, B), and so on. It’s not quite combinatorial in the number of possibilities I actually try, but still much more than the simple iteration through different versions (A, A, A) and (B, B, B).

My actual session had even more complicated version names. I didn’t know what was wrong, so I had to try loading another version of the label encoder. Didn’t work. Change the tokenizer version this time; didn’t work. What was wrong?! Tried a couple more variations before I finally got it. I got results alright, but they didn’t make any sense. In fact, it showed similar features as important for different labels, which was weird as hell. So, yeah, multiple manual steps can lead to fun debugging sessions.

Deployment Pipeline: Automatically get Output from Input

Hypothesis: A deployment pipeline is when you can get the overall output with either one button or zero buttons by giving the overall input. You don’t have to manually produce the output from the input.

This is obvious for a program like square, where you can feed in 3 and get 9.

But it’s more subtle for a project where your output is a report. Any change you make to the project should create a new instance of the final report. For example, if you add more training data, that should obviously affect the report. If you add a new machine learning algorithm, that should too. In short, if you set up an automated pipeline, you can measure your progress simply through the final result (your report). Otherwise, you will have to judge a multi-factor project on things like the training data, the machine learning algorithms, and so on. It’s much harder than judging a single report.

Corollary: Daily build -> zero-button execution.


Test: Joel Spolsky - why developers always want faster hardware – faster cycle time.

A crucial observation here is that you have to run through the loop again and again to write a program, and so it follows that the faster the Edit-Compile-Test loop, the more productive you will be, down to a natural limit of instantaneous compiles. That’s the formal, computer-science-y reason that computer programmers want really fast hardware and compiler developers will do anything they can to get super-fast Edit-Compile-Test loops.

Test: Math proofs - long time to get the output. You have to be the human interpreter and run the proof line by line. Sometimes you may not know either way and may have to wait for an experienced mathematician to give you feedback. You can’t (without automatic theorem provers, which I don’t think most mathematicians use) get the output in an instant. In this regard, we programmers are fortunate. – more uncertain about your math proofs. That’s why they have more than one proof for the same theorem (according to Terence Tao).

Test: [2019-02-28 Thu] An extreme example of this is where you can toggle any part of your code and see the change in output instantaneously.

See Bret Victor’s talk Inventing on Principle. He shows how you can change the constant for the length of a branch in a program that draws a tree and see the branch lengthen immediately.

He can use a magnifying-glass feature to pinpoint which part of the output is caused by some particular input.

He can do the same thing in reverse to see which parts of the input caused some part of the output.

I can make these changes as quickly as I think of them.

Bret Victor, Inventing on Principle

FUCK MY ASS! He can pause execution and rewind and play after changing some settings. That’s what you’d want when, say, practicing some strategy in AOE - you’d want to pause it before some battle, change some settings, and see how well you can do.

And he can see the future effects of changes in code. That way he gets immediate feedback about stuff that would usually come after a cycle time of several minutes.

In fact, I bet I can fiddle with any part of this code and come up with an idea for a game.

You bet! That’s what a scientist does. He toggles variables to get novel configurations.

For binary search, he showed the domain-level concrete values corresponding to your abstract code. So, array.length - 1 showed up as 5 for his sample case. This sort of short cycle time is what we try to get at using unit tests, though without such interactive support. We can also get it, at considerable cost, by sprinkling print statements throughout our code.

Hypothesis: He’s doing away with all logical inference by eliminating the cycle time for all the things you want to see, whether it be the current state of the tree you’re drawing or the future path of your Mario character. This is what empiricism would like to be. However, we have to note that it comes at a cost. You can’t view the simulation of your career in your prospective field for forty years before you decide whether or not to accept a job offer. There are simply too many variables.

He’s bridging the cycle time between your source code in one language and your output in another language (maybe the language of pixels or numbers).

Perhaps someday this will change. Perhaps IDE makers will focus on dynamic exploration instead of static analysis, rich visualization instead of line debugging. Perhaps language theorists will stop messing around with arrows and dependent types, and start inventing languages suitable for interactive development and discovery.

That’s empiricism vs inference right there.

Test: WYSIWYG vs modal editing - zero cycle time.

Test:

A designer needs direct, interactive control over the independent variables of the system. We must not be slaves to real time.

We step up to an abstraction by devising a representation that depicts the system across all values of a parameter. This gives us a broad view of the system’s behavior, and allows us to discover high-level patterns.

Your program is a point in configuration space. You want to look at the value of some parameter across all the configurations. For example, he showed the time taken to reach the end of the road for different values of road curvature and car turn angle. What was more impressive was that you could drop back down to the other ways to visualize that configuration by clicking on a particular point in configuration space.

Test: [2019-03-10 Sun] Humble example: make submit instead of copying over files into directory and running turnin myself.

Automated Property Tests to Notice Failures Quickly and Surely

Hypothesis: If you have a “dashboard”, i.e., something that shows you which of your automatically-checked properties have failed, and some change breaks a property, you will always see a note about the failure quickly. (The same holds for a property-check passing.)

If you don’t check such properties regularly (let alone automatically) and some change breaks the property, you may not notice the failure at all. Note that such a property check is sufficient but not necessary to notice a failure, because you may have noticed the violation by looking at the code.

If you can’t check a property automatically and some change breaks the property, you will have to spend some time checking the property manually if you want to notice the failure.


Test: OS HW1 - tons of output data. – Can’t really observe variables I care about. Hard to parse the output. The feature I was implementing was scheduling according to the aging scheduling policy. I wanted to check if its output on 10 processes of different types matched what I expected. However, I couldn’t test that automatically. I had to sift through the logging output to see if the statements printed by each process were in the correct order.

Imagine if I had a function that would sift through the logging output for me and check the order of the statements. Then, I could speedily check if my implementation of the scheduling policy produced logging output that passed that function.

However, to notice whenever that property was violated, I would need to set the property check to run every time I changed the code. This is what they call continuous integration.

Work in a Production-like Environment to get Properties about Production

Hypothesis: Working in a production-like environment is a relevant factor for learning about the properties of your program in production. Otherwise, you may not learn about them (or may learn the wrong things).


Test: When I wrote a small Emacs mode (org-autoclock) for another user, the question I wanted to answer was: will this work correctly on the user’s computer? Got misleading feedback about that when I ran it in my Emacs and it worked. I acted like this meant it would work on the user’s computer. Got the true feedback that it doesn’t actually work when the user tried it out on their computer. I could have gotten the same feedback by running it in a blank Emacs instance (using emacs -Q).

Test: Research report - handwritten notes vs LaTeX notes. – handwritten notes - won’t get feedback from others; may not even get harsh feedback from myself because I don’t look at it from a reader’s point of view. It seemed like it would look good in a report format, but I wasn’t sure. The only way to know for sure if it would make a good report is to create the report and see.

Change High-level Code in Isolation: Less Code to Move Around

Hypothesis: If some feedback mechanism depends only on some property of the output, then the detailed output of a program for some input is conditionally independent.

Hypothesis: If some feedback mechanism depends only on some property of the output, then the parts of a program that produce detailed output are conditionally independent.

Hypothesis: If some design feedback mechanism depends only on the high-level code, then the lower-level code of a program is conditionally independent.

Hypothesis: If you represent the output for an input using a “spec”, you will take a short time. If you represent it using a program, you will take longer because each term of a spec corresponds to multiple terms of a program.

Hypothesis: If you change the output for an input in a “spec”, you will take a short time. If you change it in a program, you will take longer. Not exactly sure why.

Hypothesis: A design choice that applies to the high-level descriptions in a spec may not apply to the actual implementation of the spec. But a design choice that applies to a program - i.e., a refactoring that seems to apply - will probably lead to a valid program.

Hypothesis: If you get feedback about a high-level design using a spec or a skeleton of your code, you will arrive at the best design faster than if you get feedback using a full implementation because, in the latter scenario, you have to modify the rest of the code each time you change the design even though it is conditionally independent of the design. (Don’t have a concrete example.)

Corollary: You can still get those benefits after you’ve begun by experimenting in a new blank project and moving in your old changes.


Test: Imagine you’re writing a program to produce a supermarket checkout bill. The initial test might be to see if every item in the cart is included and whether the total is accurate. Now, your program will pass the test whether or not you don’t print the supermarket’s name or the current date on the bill. So, those parts are conditionally independent given the itemized total.

Similarly, the parts of your program that produce the supermarket’s name or the current date are conditionally independent of the test given the parts that produce the itemized total, because the program will pass the test whether or not they’re present.

Test: Imagine you’re writing a class hierarchy for visualization. Your feedback mechanism for the design might be whether the subclass relations make sense, e.g., whether ImageData is a subclass of RectilinearData. However, you would get the same feedback regardless of your implementation of the methods in those classes.

Test: Joel Spolsky:

Now, Mr. Rogers over at Well-Tempered Software Company (colloquially, “WellTemperSoft”) is one of those nerdy organized types who refuses to write code until he’s got a spec. He spends about 20 minutes designing the backwards compatibility feature the same way Speedy did, and comes up with a spec that basically says:

When opening a file created with an older version of the product, the file is converted to the new format.

The spec is shown to the customer, who says “wait a minute! We don’t want to switch everyone at once!” So Mr. Rogers thinks some more, and amends the spec to say:

When opening a file created with an older version of the product, the file is converted to the new format in memory. When saving this file, the user is given the option to convert it back.

Another 20 minutes have elapsed.

Mr. Rogers’ boss, an object nut, looks at this and thinks something might be amiss. He suggests a different architecture.

The code will be factored to use two interfaces: V1 and V2. V1 contains all the version one features, and V2, which inherits from V1, adds all the new features. Now V1::Save can handle the backwards compatibility while V2::Save can be used to save all the new stuff. If you’ve opened a V1 file and try to use V2 functionality, the program can warn you right away, and you will have to either convert the file or give up the new functionality.

20 more minutes.

Mr. Rogers is grumpy. This refactoring will take 3 weeks, instead of the 2 weeks he originally estimated! But it does solve all the customer problems, in an elegant way, so he goes off and does it.

The moral of the story is that when you design your product in a human language, it only takes a few minutes to try thinking about several possibilities, revising, and improving your design. Nobody feels bad when they delete a paragraph in a word processor. But when you design your product in a programming language, it takes weeks to do iterative designs. What’s worse, a programmer who’s just spend 2 weeks writing some code is going to be quite attached to that code, no matter how wrong it is.

When Mr. Rogers tried out the design in a spec, he could get feedback from his customer and his boss and change it twice within an hour. He estimated that he could do it in three weeks. If he had tried to implement the program the first way, he might have taken two weeks, after which changing it according to the customer’s needs would have taken another week, and another one after that to accommodate his boss’ design changes. So, he saved himself at least a week by getting feedback about the design using a spec and not a program.

What could have accounted for the time savings? Firstly, he could get feedback after 20 minutes and not 2 weeks. The more such changes he makes in the spec, the more time he saves in implementing fully and getting feedback. Secondly, he had to change one line in his spec, not a bunch of lines in his program. 20 minutes + 2 weeks for the right version < 2 weeks for the first version + 1 week to change it to the right version.

Note that you may not know the true costs or benefits of a design by writing a spec. For example, I wrote a spec for a programming assignment, with the names of the functions I was going to write and how they would work together. But when I actually started implementing the program, I found that most of them were already implemented better as library functions. I never looked back once at the spec. It was completely out of touch with my actual implementation.

Similarly, Mr. Rogers’ boss suggested having V2 subclass V1. Maybe the Save method is not the only one that differs, which means subclassing might be the wrong move. But you’re unlikely to know the exact set of methods that differ unless you implement both classes.

Test: I wanted to try different formats for a file that was going to be read by another program. Easy to do given that all I had in the file was one line. If I wanted to do it after I had a long file, I’d have to keep modifying the data in the file according to the new format.

Test: Rearrange sections in an essay - I find it easier to write just the top-level headings and reorder them and merge some of them based on how they look. If I tried to do that while moving the contents along with the headings as they got reordered and merged, I would have to do a lot more work.

Test: This is probably what writers call outlining.

Test: This is probably what filmmakers call story-boarding.

Spec vs Prototype

Test: PG on spec vs prototype:

  1. Start small. A program gets easier to hold in your head as you become familiar with it. You can start to treat parts as black boxes once you feel confident you’ve fully explored them. But when you first start working on a project, you’re forced to see everything. If you start with too big a problem, you may never quite be able to encompass it. So if you need to write a big, complex program, the best way to begin may not be to write a spec for it, but to write a prototype that solves a subset of the problem. Whatever the advantages of planning, they’re often outweighed by the advantages of being able to keep a program in your head.

What and how should not be kept too separate. You’re asking for trouble if you try to decide what to do without understanding how to do it. But hacking can certainly be more than just deciding how to implement some spec. At its best, it’s creating the spec– though it turns out the best way to do that is to implement it.

Explanation: Write a prototype to pass one integration test (such as evaluating some small Lisp program) and then abstract your code into well-separated modules. – You can test different high-level design ideas using your integration test without much trouble (because you don’t have too many branches).

CORAL: All business plans for startups turn out to be wrong, but you still need them - and not just as works of fiction. They represent the written form of your current beliefs about your key assumptions. Writing down your business plan checks whether your current beliefs can possibly be coherent, and suggests which critical beliefs to test first, and which results should set off alarms, and when you are falling behind key survival thresholds. The idea isn’t that you stick to the business plan; it’s that having a business plan (a) checks that it seems possible to succeed in any way whatsoever, and (b) tells you when one of your beliefs is being falsified so you can explicitly change the plan and adapt. Having a written plan that you intend to rapidly revise in the face of new information is one thing. NOT HAVING A PLAN is another.

Test: PG on the Viaweb editor:

The Viaweb editor must be one of the most extreme cases of incremental development. It began with a 120-line program for generating Web sites that I had used in an example in a book that I finished just before we started Viaweb. The Viaweb editor, which eventually grew to be about 25,000 lines of code, grew incrementally from this program. I never once sat down and rewrote the whole thing. I don’t think I was ever more than a day or two without running code. The whole development process was one long series of gradual changes.

Explanation: His initial integration test was passed by a 120-line program. He then added more integration tests and improved the design of the program as needed. – He figured out exactly which pieces he needed - the high-level design - before he implemented them in detail. For example, he needed macros to wrap HTML code in different tags, etc.

Test: Haskell programmers use type signatures to do this kind of thing. Sudoku - get a matrix filled with possibilities for each cell, give each cell the possibilities of its row, column, and box neighbours, remove certain possibilities, repeat till you get either a solved matrix or a matrix that doesn’t improve anymore. Each of these steps is doable. – you’ve got the high-level design for passing the test of “solving a Sudoku grid”.

Test: PG web page generator - if it initially created just one set of web pages, he may want to have code that generates new pages on demand, based on clicks. That would be a new integration test that he may not have thought of beforehand. – empirical usage gives you new ideas.

Add One Property at a Time

Can’t Get Feedback until you Implement something Runnable

Hypothesis: Implementing features so that your feedback mechanism can run on them is a relevant factor for getting feedback from it.


Test: [2019-02-01 Fri] [Making Your First Game: Minimum Viable Product](https://www.youtube.com/watch?v=UvCri1tqIxQ) - You can’t find out how engaging your game elements, such as jumping on tortoises or saving a princess, will be until you actually implement them and put them in front of a user (even yourself). For example, even if you implement a feature but don’t integrate it into the game, you won’t be able to see how much the user likes it. And if it’s just a bunch of words on paper, the user’s reaction to it may not match his reaction to the eventual game.

Additional Constraints delay Feedback

Hypothesis: The number of separate features you implement before you get feedback is a relevant factor for the delay until you get any feedback.

Hypothesis: The number of constraints you satisfy for a feature before you get feedback is a relevant factor for the delay until you get any feedback.


Test: [2019-02-01 Fri] [Making Your First Game: Minimum Viable Product](https://www.youtube.com/watch?v=UvCri1tqIxQ) - pare features like saving a princess or going down pipes from a full version of your Mario-like game so that you can immediately get user feedback about just running and jumping. Otherwise, you would have to wait until you finished the other features. Similarly, bells and whistles like game music and crisp graphics and fast rendering will delay your implementation of a minimum viable product. Even enhancements like double-jumping or jumping with a running start will delay feedback.

Test: Try to write a perfect schema the first time around - take really long – lots of constraints with high standards for each.

The Most-Used Feature is Probably the Most Impactful

Hypothesis: A change in the value of the most-used feature will probably change the value of your product a lot. A change in the value of a less-used feature will probably change it less.

So, implementing the feature that you think will be most-used will probably give you the maximum value for your time.


Test: [2019-02-01 Fri] [Making Your First Game: Minimum Viable Product](https://www.youtube.com/watch?v=UvCri1tqIxQ) - One way to do this is to keep only those features that will always be used, whether in your minimum version, an intermediate version, or a full-fledged game. For example, Mario will always run and jump, but he won’t always have a princess to rescue or even pipes to crawl down. So, have only running and jumping in your MVP.

Why? Because the player is going to run and jump a lot, and if that’s not fun in itself, they might not have too much fun playing the game. A poor running experience - say, where you have to press the key multiple times to keep moving instead of long-pressing it - will drag down the rest of the game. However, a slow or buggy pipe-crawling experience won’t hurt the game as much. (Note that we measure the value of a feature by seeing its effect in isolation when used only once or a couple of times.)

Adding Constraints makes it Less Likely to Happen

Hypothesis: Adding a constraint to a task makes it less likely that you will complete some task, i.e., satisfy its constraints. Note that adding a constraint means specifying that some aspect of the performance be changed to a level that would make it harder even in isolation.

Conversely, removing a constraint will make it more likely to happen.

Hypothesis: Raising the bar on a constraint for a task makes it less likely that you will complete some task.

I can predict that the higher the bar, the less likely you will be to complete the task, but I can’t predict how much less. It depends on the other constraints. Wow. That suggests that we don’t usually compute from scratch how long something will take. We usually judge it based on how long similar things have taken in the past.


Test: [2019-02-01 Fri] Take out the trash - easy enough. Take out the trash and get back into the house within 2 minutes - pretty hard. Note that adding the constraint “get back in within 2 years” does not increase the time taken. And the constraint “take it to the nearer trash bin” actually decreases the time taken. We have to look at the fact that just going out to the curb and getting back within 2 minutes is less frequently done (i.e., is harder) than doing it in 5 minutes. And doing it that quickly while lugging trash is even less frequently done.

Similarly, decreasing the time allotted from 2 minutes to 1 minute makes it even less likely to happen.

Test: Cooking right now - wanted it to taste great, be nutritious, finish as quickly as possible, and leave minimal trash. – too many constraints. Less likely to do it at all.

Test: Friends - don’t have as many constraints on what you can say or do together – More likely to do and say things that you won’t say in front of others.

Test: Learn about different types of people who read your blog – less likely to write (h/t Stevey).

Possible Inputs from Callers decide the Inputs you have to Handle

Hypothesis: The set of possible inputs from callers decides the inputs you have to handle to avoid errors.


Test: [2019-01-18 Fri] I don’t have to worry about the getNum function (for parsing a number from a string) being given “-3” because parseUAtom already extracts the “-” part.

If a negative number never reaches getNum, it would be a waste of time to handle that case or test for it. However, if there’s even a small chance that it happens, the entire program might crash unless you handle it. So, whether or not you test that input in getNum depends on who calls getNum and with what inputs. And since more cases means more work, it’s in your best interest to narrow it down as accurately as possible.

Test: Hard to change the behaviour of pipe_set without knowing how and where it’s been used. I don’t want to break its “contract”. If I assume that it will always be in the free state but it turns out to be in a connected state on one occasion, I might break something.

Number of Spec Features per Test determines your Speed

Hypothesis: Given a spec, the number of features of the spec exercised in one test is a relevant factor for the number of configurations you will walk through before you pass the test, and is thus a relevant factor for the time taken to pass the test.

You’re actually implementing an implicit compiler to translate the spec to an implementation of it.


Test: A spec doesn’t give you feedback about your translation of the high-level terms to the programming language. For example, take the spec for parsing an arithmetic expression:

<exp> ::= <num>
| <exp> + <exp>
| <exp> - <exp>
| <exp> * <exp>
| <exp> / <exp>

Precedence for (+) and (-): 1; for (*) and (/): 2.
All the operators are left-associative.

The two implementations:

def parseExpressionA: Exp = parseExpressionA(0)
def parseExpressionA(min: Int): Exp = {
  var res = parseNum
  while (in.hasNext(...)) {
    val op = getOperator
      ...
  }
  res
}

and

def parseExpressionB: Exp = parseExpressionB(0)
def parseExpressionB(min: Int): Exp = {
  val exp1 = parseExpressionB
  val op = getOperator
  val exp2 = parseExpressionB
  val res = ...
  res
}

both match the given spec or grammar. How do I distinguish the two? I need another source of feedback, namely a test. Even on a simple input 2 + 3, parseExpressionB would go into an infinite loop, whereas parseExpressionA would handle it alright. Note that somebody else has to give you the expected output for the test. You don’t know a priori.

I suspect that we use the test to decide how to translate the high-level spec into code.

Oh. Is it that the granularity of the test determines how quickly you can code up the spec?

Imagine that you had a test saying that "1-2*3+4/5*6" parses to Plus(Minus(Lit(1), Times(Lit(2), Lit(3))), Times(Div(Lit(4), Lit(5)), Lit(6))). It tests Plus, Minus, Times, Div, associativity, and precedence. If the test passes, you will have high confidence that your implementation of all those features was correct. But if the test fails, you will have a hard time finding out where you made a mistake.

Contrast that to "1+2" -> Plus(Lit(1), Lit(2)) and "1+2+3" -> Plus(Plus(Lit(1), Lit(2)), Lit(3)). The first one tests for a basic Plus expression and the second one tests for simple associativity. So, if the first one fails, you know that you messed up in simple Plus parsing. If the second one fails, you know you messed up in your implementation of associativity. You no longer have to search. So, you can use these narrow tests to learn how to translate each part of the spec into code, in isolation.

The tradeoff between the two test styles depends, of course, on how often you make mistakes and on how long it takes to write the narrow tests.

Notice how you’re imitating a compiler while implementing the spec. The final program can be seen as a translation of a program written in the spec language to one written in Scala. Writing the spec in an executable language would make the compiler explicit. For example, the configuration file for Emacs is itself written in Emacs Lisp. It could have been a simple YAML file with fields and values, where you would have to write an interpreter to change the parameters of the programs based on the config values or write a compiler to translate the configuration into the language of the program. But writing the config in Emacs Lisp obviates the need for that interpreter or compiler.

What you’re uncertain about, then, is the implementation of that spec compiler. After all, if you had such a compiler, you could simply run it on the spec and hand away the translated program without doing any work. This means that the narrow feedback you need to get is about the different branches of that compiler.

Wow. That suggests that a “programmer” is simply a human compiler for specs.

Similarly, knowing how to program in a language probably means knowing how to translate a given spec to a program in that language. For example, when I saw the requirement “get all the phrases within square brackets”, I automatically wrote awk -F'[][]' '{for(i=1; i<=NF; i++) if (i % 2 == 0) print $i}' and got it right on the first try. High skill probably means that you can translate it from memory, without much experimentation.

Test: (TODO) When the problem “seems somewhat hard”, I don’t break it into pieces and test my implementation of each part, i.e., I don’t get intermediate feedback. I also don’t implement it all and check whether it works, i.e., I don’t get overall feedback. Basically, it’s a question of how big a change I’m trying to explain or simulate with my program. Clearly, if it’s a vast change, spread across multiple variables and multiple files, I probably won’t be able to simulate it all. If it’s a small change, such as squaring each element of a list, I can do it with one statement.

So, what I don’t do is break up the big change into smaller changes. Note that this changes the problem from having one overall source of feedback to multiple intermediate sources of feedback. Getting a positive exemplar for a multi-variable problem from scratch with just one source of feedback is always going to take time and lots of experimentation. Getting a positive exemplar for a problem with fewer variables will take less time and fewer iterations.

A simple example is when you need to implement a pipeline of compiler operations, such as scanning, parsing, type-checking, and compiling. You can come up with the intermediate tokens, AST, and types and work on each operation in isolation till they generate the right immediate output. Contrast that to sending in an input program and changing all the operations in your compiler till they generate the right ultimate output of assembly code. The latter will take a lot more time and experimentation before you get it right.

How much more time, though? Well, the pain will be in debugging. If you didn’t make any errors at all, there would be no difference between the two approaches. But if you do make a mistake, you will have fewer suspects in the former than in the latter.

That was an example of breaking up a long chain of actions into smaller chains. Another way is to decouple the parts of the output and get each part separately. For example, implementing a Mario game would involve controlling the movement and rendering the graphics for the same. You could do them separately by having a rectangle for Mario and later filling in the rectangle with an appropriate image. Or you could do them together, which might involve changing the graphics every time you change the way he moves or changing the movement when you change the graphics.

Similarly, when trying to “understand” an existing program that looks “pretty complicated”, I don’t break it into pieces and “understand” the data flow across each piece. In other words, I don’t get narrow feedback to help me induce exactly what each part does.

When do you Need to Test Code?

Hypothesis: If your response to a particular part of a spec has been tested in the past, you can use it without testing it now.

If your response hasn’t, you will probably have to test it in isolation now so that you save time in implementation and debugging.

Hypothesis: If you take on a problem that is too hard for you but refuse to admit it, then you won’t run enough tests (since they would be a waste of time for someone who actually was that skilled) and thus will take longer to debug your program. (TODO: rewrite this using the above hypothesis.)

If you take on a problem that is too easy for you but don’t realize it, then you will run too many tests and waste your time while others zoom past you.


Test: If someone asked me to “count the lines on which ‘foo’ didn’t occur in file bar.txt”, I would write grep -v foo bar.txt | wc -l. (Somebody else might write grep -vc foo bar.txt, which also works.) When I write that, it usually works. Why don’t I need a test to check if my implementation of the spec is correct? That part of my implicit spec-to-Bash compiler has been tested several times. If I had made any mistakes, I corrected them. So, the response that has survived is the one that passes the tests. That means it would be a waste of time to test that in isolation and not just go ahead with the larger program.

The same holds for “get a histogram of the lines”. I would write sort | uniq -c | sort -k1,1nr and be done. In fact, I created an alias for exactly this purpose: alias -g hgram="sort | uniq -c | sort -k1,1nr". This way I can translate the spec “get a histogram” directly to the command hgram.

Test: However, if someone asked me to “count the lines on which ‘foo’ occurred but ‘bar’ didn’t”, I would have some trouble writing it. That is, there would be an isolated test that my command wouldn’t pass. Since I don’t have a response that has been well-tested in the past, I would have to write an isolated test for this occasion, perhaps “foo barfoo” to see if the answer was 1 and not 2 or 3. If it failed, I would know exactly where the error was. Otherwise, if I accepted my default implementation of that spec and went ahead with the rest of my program, it might fail and I wouldn’t know where the error lay.

Once I got the answer, of course, I wouldn’t have to test it in isolation in the future for a similar spec. I could just use it as such.

Test: I have a decent solution for the frontend-backend connection idea. But I’m hesitating to actually go and implement it.

I don’t know the best plan for testing - should I do just the frontend part and check if the middle part gets the data (and do the same with the middle part to the backend)? Or should I build a complete thing and test whether the real object was created (so that I need to run fewer tests)? Basically, running tests for each part feels like I’m wasting time, like a skilled programmer wouldn’t waste time running that many tests. But not running intermediate tests feels like I’m bound to end up with errors (given that there are so many moving parts) and will find it hard to debug because I don’t have a close positive exemplar.

Huh. That’s a nice Catch 22. And it’s all because I’m not realistic about what problems I can solve comfortably. If the problem is realistically too hard for me, then it’d be a big mistake to wait to test until the end because I wouldn’t be able to debug my inevitable mistakes. However, if the problem is too easy for me, then it’d be a waste of time to test each part of my program because it’d probably work right most of the time. In short, tests would be highly informative in the former and uninformative in the latter.

And we know that time taken to implement something increases with the number of tests. (Do we know that?)

When do I not break up a big change into a smaller change? When do I break it up?

Basically, if you have translated that spec feature to code in the past, then you can go ahead and translate it this time without a new test. But if you haven’t translated it before, you should probably test it first.

Change without Narrow Feedback needs Lots of Reverse-Engineering

Question: How many isolated sources of feedback will you need when you’re “perfectionistic” and how many when you accept any combination that works as a whole?

Hypothesis: Changing the spec of a working implementation without narrow sources of feedback - past code for a similar feature or a unit test that checks exactly that first-class data manipulation - means that you will have to reverse-engineer the mapping from spec feature to code from different projects (for which you may not have the specs and which may intertwine many features in the same code) or experiment with the first-class data till you pass the test or just be stuck.

Also, if you get a working product and try to update the spec, you can do it quickly even without narrow feedback. Not so if you try to improve each part without a working product.


Test: [2019-05-13 Mon] “Perfectionism” - I needed to implement a new project. There was already another project that did much the same thing I was trying to do, but it used some obsolete or verbose techniques for some of the parts.

One option for me would have been to use their techniques and get something working really quickly and then improve individual parts as much as deadlines allowed. It wouldn’t matter much whether there was documentation about each technique in isolation because I would be following whatever the earlier project did. In other words, I had one example of a spec and its implementation using humdrum techniques, which I could use to figure out what humdrum code to write for each part of the spec. That way, when it came time to implement my spec using humdrum techniques, I could use the mapping learned above to translate each part of the spec.

But I didn’t do that. I tried to look for the best way to do each part. And when there wasn’t enough documentation to help me implement the snazzy, new, compact technique I heard about somewhere, I got stymied. Needless to say, the whole project took me a long time, and I despaired often. Notice how the task changed. I had the same example of a spec and its implementation using humdrum technique and could learn the same mapping from each part of the spec to humdrum code. But I was no longer trying to implement my similar spec using humdrum techniques. I was trying to implement my spec using awesome techniques. So, I could no longer use the humdrum mapping above to translate each part of the spec. And since I didn’t know how to translate each part to code that used awesome techniques, I got stuck.

Even when I found code examples of the technique being used by some other project, they tended to be different from my own use case. This meant that I had to learn the mapping between their spec, which I sometimes had to guess, and their code, and then check if that mapping was enough for me to translate my spec to code. So, if my spec used some feature that theirs didn’t, I wouldn’t know how to translate it to code. This way, I would have to learn about and mix features from all different kinds of projects to do mine. Contrast that to having a project whose features were almost identical to mine and whose mapping I could use without having to read any other project’s code.

Great documentation - one that told me how to translate every possible spec feature to code - would have helped me translate each part of my spec. But since this kind of documentation was not always available, I got stuck often.

Also note that, in the first approach, once I had a working implementation, I could experiment with the awesome technique using the overall test as feedback. If it works the same even after I switch to the new technique, then I know I’ve used it correctly. I could upgrade even without great documentation. However, in the second approach, I would have no such recourse because I wouldn’t have a working product. I would need good documentation or be lost.

All this because I decided to use a better technique in one part of the code. That one additional constraint made me need a lot better documentation than you are likely to find in infrequently-updated wikis.

Test: [2019-06-08 Sat] My laptop screen went black and didn’t turn back on even when I restarted it. People online suggested removing the battery, holding down the power button for 30 seconds, plug back the AC cord, and powering up the laptop. Which worked. However, my laptop battery is now outside the laptop. Should I wait till the end of the day, when I shut down my laptop, to replace the battery or can I somehow replace the battery while the laptop is running?

Needless to say, I have been searching online for the last five minutes to see if the latter is possible. I’m not able to get a clear answer. The first question I saw was about a laptop battery that wouldn’t charge under AC power, which is not my case. So, I’m not going ahead with their suggestion, even though they seem to have swapped the battery back in under AC power.

In short, I want to answer the unanswered what-if question: what if I replace the battery while my laptop is running? Will I get shocked or will things go smoothly? Now, I don’t have sufficient data to answer that and I don’t want to risk getting shocked. So, I won’t be able to answer that question any time soon. I should go ahead with the perfectly reasonable option of waiting till tonight to replace the battery. But that seems sub-optimal. What if, in some hypothetical extenuating circumstance, somebody couldn’t afford to wait till the end of the day and wanted to replace the battery right away? Then I wouldn’t be able to answer whether it would work. The horror!

Notice how I’m rejecting the other perfectly reasonable option of shutting down the laptop, replacing the battery, and resuming my work - a course of action that would take no more than a couple of minutes and that I’m practically 100% sure will work. I want to know the answer to a question whose answer won’t really save me any time and about which I’m extremely unsure.

So, this insistence on answering questions that I can’t reasonably answer and on rejecting perfectly reasonable options is what is delaying my work. Sure, there may be some benefit in answering all these what-if questions. They may help me in the future. But I could answer them after I’m done with my current task. Curiosity is great for gaining skill, but not at the expense of my immediate work. Right now, I’m holding up everything else because I refuse to go ahead unless I have a perfect solution.

Sometimes you just have to come to peace with questions that you cannot answer.

How Perfect do you Want It to Be? (TODO)

Hypothesis: The time taken to find the cleanest design possible is the time needed to go through every existing implementation! That’s why I take a long time to understand a code base. I feel like I need to understand the best design currently used (if not the best design possible). TODO

Corollary: That is the true cost of perfectionism. You have to go through all existing implementations to find the best current design.

Hypothesis: A satisficer would stop after the first positive exemplar. A perfectionist would have to go through all positive exemplars.

Corollary: When there’s more than one way to do it, I become paralyzed because I feel compelled to go through all of the possibilities and figure out the smallest set of sufficient factors.

Corollary: This could also be why I try to decouple a new feature as much as possible so that there is just one way to implement it (since there are no extraneous factors) and I don’t have to waste time experimenting with a lot of the possibilities.


Test: Also, I fear that there is a cleaner way to do it, which I can find by reading even more code, and therefore I don’t go ahead with a half-assed design that would actually work.

Basically, I won’t go ahead until I’ve found the cleanest way to do it. Which would require me to go through all the existing code and check all of their designs. If any of them has a cleaner design (which is my output variable), then I will ignore the other options and go with the better one.

This is why I am such a slow learner.

Acceptance Test: Clear Focus on What Matters (TODO)

Hypothesis: Going from not knowing how to implement a spec feature and not having a narrow test to having a narrow test for the spec-to-code compiler will speed up your time a lot. Ditto for learning how to implement the spec feature.

Hypothesis: Failing acceptance test -> you know exactly what needs to be done; you know it matters; need to remember only your current failing acceptance tests.

No failing acceptance test, write unit tests or just implement code -> you don’t know exactly what needs to be done; you don’t know if it matters; have to remember a lot of TODOs.

Lesson: Writing a TODO - instead, make a failing acceptance test for it.

Lesson: What to do next - make a failing acceptance test pass.

Question: What is the ideal behaviour for programming (on, say, the phone number problem)?


Test: Phone number problem - I knew that 10/783--5: neu o"d 5 wasn’t working yet and I needed to get on it. – No need to waste time on pointless “design” of beautiful types. Get that test passing ASAP. And I knew it mattered because that was what they would test. Just needed to focus on that failing acceptance test.

Test: Scheduler OS HW1 - no failing acceptance test (no real tests at all, in fact) – didn’t know if I was working on the right thing; didn’t know exactly what to do next (?). Had to remember a lot of to-do items in my Org file.

Test: Want to write a to-do item for “handle multiple processes accessing the same virtual memory address” – instead, just add a failing acceptance test and set it aside. Don’t need to keep track of lots of to-do items.

Test: Unsure what to work on right now - take up a failing acceptance test - such as “how to refactor ideas from different essays I’ve written”. – know exactly what to do.

Premature Abstraction

Hypothesis: Trying to abstract beyond your means (i.e., compress more branches into one than you know how to) is a relevant factor for being stymied. You may not know if some of those branches are necessary and, even if so, you may not know how to do it all off the top of your head.

Trying to handle just as many branches as you know how to do easily is a relevant factor for making progress.

TODO: Controlled experiment to show that you would go fastest if you handled the maximum branches you can handle effortlessly and slower if you handled any fewer or any more.


Test: [2019-01-25 Fri] When parsing array assignment a(1) = 3, I’m worrying about edge cases like f(1)(2) = 3. Is that a valid expression? Don’t know. Maybe some function f will return an array to which you will set the value. But do we have arrays as types? No. Still, what if there is some edge case where that happens? – All of these thoughts are getting in the way of implementing a(1) = 3 as quickly and dirtily as possible.

Specifically, I’m hesitating to write the concrete match case App(Ref(arrayName), List(index)) instead of case App(exp, List(index)) because the latter is more general and may be the right way in the future.

In short, I’m prematurely abstracting. I know how to handle a(1) = 3 separately and I know how to handle f(1)(2) = 3 separately. But I don’t know off the top of my head how to handle both of them within the same branch. Plus, I don’t know if the latter expression is even valid.

Just struck me: why not handle them as separate cases? Have one case for the concrete version and one for the abstract version. But that’s code duplication, which my mind tells me is a cardinal sin!

Note that I’m trying to handle the maximum number of possibilities in the minimum amount of code. That could be why I’m dilly-dallying over this code.

To Prevent Regression, Change only within your If-Condition (??)

Hypothesis: If you add code only within the conditional logic for your feature, you can be confident that you haven’t affected any other feature’s behaviour.

If you add code without a conditional check for your feature, then you have to confirm that the other features aren’t affected too much by your change. The more features there are, the more time this takes.

Lesson: Always add new code under an if-condition for your feature.

Corollary: If you keep all your features under separate condition checks, you can switch among them with just one button.

If you don’t, you have to go and manually change the shared code that must be different for each of them.

Corollary: As long as you change code within an if-condition for your feature, you know you won’t regress other features. However, to ensure correctness for your feature, you still have to handle all possible paths that lead to your output. So, if there are two paths, you need to add your code under both of them. (Ideally you would merge them first.)

Corollary: If you want to change common code, first move it under a default if-condition and then add your changed code in a new if-condition for your feature.

Corollary: Code is hard to change when it’s used by multiple features and when you don’t add separate if-conditions (or can’t add them because you don’t know the set of features that use this code).

Corollary: OOP abstracts these if-checks by having the conditional logic for a particular feature at all points along the data flow within one class.

Corollary: If you add debug code only under a DEBUG check, then you can be confident that that code will be run only when debugging is turned and will not affect the normal logic.


Test: [2018-07-29 Sun] If you want to disable a “Submit” button until the form data is valid, you have to worry not only about disabling it only for your feature but also not accidentally disabling or enabling it for other features that use the same form.

If you add some conditions that apply only to your feature, then you can be confident that you’re not affecting any of the logic for the remaining features (since they will never even enter your code branch).

However, if you add some code without checking specifically for your feature, then it would be executed even for other features and thus you might affect their behaviour.

How to Get an MVP?

Hypothesis: Imagine the product without certain features (or with relaxed constraints). If it still satisfies some valuable acceptance tests, then those features are not part of your MVP.

Or maybe it’s about the end-to-end acceptance tests that you choose not to pass.

Hypothesis: Decide how much value you want in your MVP and then pack in as many high-value acceptance tests (or easy acceptance tests) as needed to produce that value.

TODO: Labels: “value you want in your MVP”, “value of acceptance test” - need positive and negative exemplars.

Hypothesis: Ask yourself “what would I do if I had to release right now?”. That will give you a current positive exemplar, even if it is low-quality (as opposed to a nebulous ideal exemplar, which only leads to a negative exemplar right now).


Test: Super Mario - the game “works pretty well” even without the princess. Well, decompose the output (in this case, your gaming experience). You are entertained by the jumping and running.

Test: Low-quality but immediate positive exemplar - what benchmark would I use for my equational calculator? Probably programming challenges.

Make a Change only when a Test Fails

Observation: Wasted a lot of time just staring at the screen hoping for a “perfect” solution to land in my mind.

Observation: I don’t have near-term tests for the changes I’m making to pipgetc. They are all abstract changes.

Observation: Long time since I ran any test (or even compiled the code). No feedback.

Hypothesis: Make a change only when a test fails. Duh!

It’s called red-green-refactor. I’ve been making changes willy-nilly.

Corollary: Otherwise, you will end up making blue-sky changes to your code.

Corollary: Don’t have to think about abstract cases. Don’t have to “contemplate the ifs”. Just make the current test case pass.

Hypothesis: Writing code without a failing test -> don’t know if you’re cutting through.

Corollary: Writing to pass tests -> consequentialism.

Implement the Valuable Test Cases First

Hypothesis: Implementing the happy path of most valuable features instead of agonizing over boundary cases you might have missed is a relevant factor for speed. (TODO)

How many valuable features at a time? For example, just the main conjunctive def f(x: Int){} or including multiple parameters def f(x: Int, y: Boolean){} or including currying as well def f(x: Int, y: Boolean)(z: Unit){}? I did all three in practically one shot. Maybe it depends on how much you can reliably do right on the first try.


Observation: [2019-01-25 Fri] It’s harder to track what I’ve completed and what I haven’t. Earlier, I could mark “BaseParser” as done once I’d handled the different cases, added boundary tests, and refactored. But now, even though I’ve done one branch for function parser, I’ve left other valid cases and boundary tests, so I can’t mark any part as done.

No Feedback for a Lower-Priority Constraint until you satisfy Higher-Priority Constraints

Hypothesis: Passing higher-priority constraints is a relevant factor for getting feedback about a lower-priority constraint.

That means you can’t decide if A is better than B until the higher-priority constraints have been passed.

Corollary: It’s a mistake to ponder about a choice until you have satisfied higher-priority constraints.

Corollary: If you try to get feedback about lower-priority constraint without first passing the higher-priority constraint, you will have to look at a lot of scenarios to find out its average value.

Corollary: When it comes to implementation and quality, the implementation is the higher-priority constraint and you can’t get feedback about the additional impact of some quality improvement until you’re done implementing.

Corollary: When it comes to a feature like being able to configure a setting, the rest of the program is a higher-priority constraint that that ability to configure. Clearly, you can’t get feedback about the impact of being able to configure if the program doesn’t work at all.


Test: [2019-12-26 Thu] Suppose I want to write a command-line Twitter client. My basic spec is: when I run the command, I should see PG’s last 10 tweets.

Questions that pop up in my mind: Should I get the username from the command-line or from a configuration file? How do I choose between the two? What sorts of things should I allow users to configure? Just the username or a list of usernames or the background color or the format of the displayed tweet or the number of tweets or the filters for the tweets?

On a different note, I’m currently writing these notes in some random file. Should I add more notes to this file or to my old programming notes file or to my scientific thinking file or somewhere else entirely? Don’t know.

Problem is that this choice is blocking the rest of my implementation. I don’t really have a feedback mechanism for this variable alone. The best I can do is implement some values for A, B, C, namely A1, B1, C1, make sure that it works, and then check A1, B1, C2. That way I get some feedback about C1 vs C2. But I have no way of knowing about just C1 and C2 in isolation. I get feedback only in the context of a baseline.

When can I ever get feedback about a variable without any other context? Well, suppose I were optimizing for speed and I wanted to compare bubble-sort and quick-sort. I could find out about them alone. That’s possible because total performance is a sum of the performance of individual parts of the code. So, if C1 is better than C2 at performance, then A1, B1, C1 will be better than A1, B1, C2. I think this is called congruence. Not sure. I was lucky that I could compare C1 and C2 in isolation and extrapolate to the overall performance.

But there are wrinkles. What if a variable has two dimensions? For example, adding a new note to file A might be much faster in the short run but may make it hard to find in the long run. Adding the note to file B might be slower in the short run but will make it easier to find in the long run. In other words, we face tradeoffs. There is not just one high-level dimension we’re trying to optimize and thus we can’t extrapolate from a comparison between two alternatives on that one dimension.

No, I don’t think the problem is the multiple dimensions. It’s the fact that other variables depend on your choice. Or not. For example, comparing C1 to C2 in all possible scenarios takes a lot of time. It’s lucky that, when it comes to performance, we don’t need to iterate over all possible scenarios, since the impact of this one variable is the same regardless of the surrounding context. If the performance of C1 > performance of C2, then performance of X, C1 > performance of X, C2 regardless of X. But if you want to compare filing a note in file A vs B, you have to consider the context. Say there are only a few scenarios and you know what they are, such as one where you rarely look up the note and thus don’t care where it is filed and another where you need to look up the note often and may need to write many more things about it and should thus have it in its own file. Then, you can test those cases quickly and come to the expected impact of A vs B. But if there are a lot of possible scenarios and you don’t know what they are or you don’t get feedback about their impact, you may not be able to calculate the expected impact of A vs B.

In short, when you get feedback only for the overall product, you may have to look at a lot of scenarios to get the average impact of choice A vs B. However, if you have a working product with A, you just need one more iteration to get the impact of the product with B.

Test: [2019-12-26 Thu] Suppose you want to collect rows from a real-time database that share the same session id. But you can’t find a COLLECT operator in the docs and are considering just collecting by hand. Doing it by hand would be uglier than using a built-in operator, but would probably get things done more quickly since you can’t find the solution in the docs. So, if you translate the overall goal of writing a clean program quickly into the sub-goal of writing a clean function quickly, you won’t find any function that passes those two immediate constraints. That means you won’t get any program quickly, let alone a clean one.

If, however, you relax the immediate constraint to be just writing a function quickly, even if it is ugly, then you will pass the overall goal of writing a program quickly, albeit a little messily. So, yes, a function that is written quickly and cleanly will lead to an overall program that is written quickly and cleanly. But if you can’t find such a function, then compromising will at least lead to a program that is written quickly.

In this case, it’s not that deferring the search leads to a quicker search. You do have fine-grained feedback for a function. So, in theory, if you absolutely need a program that satisfies A and B, then you will have to spend time searching for a function that satisfies A and B, whether you do it before or after writing a working program. The problem appears when you can’t find a solution. Then, you have to choose between searching further and compromising so that you can move forward.

You may have to compromise not just when some constraints are unsolvable but also when they are not worth solving. For example, it may take you several hours to search for some obscure use of COLLECT or to implement a library function for it yourself. That’s not worth your time when your entire budget for the program is just a few hours.

If you implement a program with the inferior version of the function, you can judge whether you have time to search for the better version of the function. But if you haven’t yet implemented the program, you can’t judge whether you can afford a few more hours on the function. The rest of the program may need more time that you expected and you may end up with an unfinished program.

What does your feedback mechanism look like? It has two levels: implementation and quality. Clearly, implementation trumps quality. An elegant but incomplete program is useless whereas an ugly but working program is still usable. However, an elegant working program is better than an ugly working program. So, while you can measure the quality of any function by itself, you can’t say anything about the overall impact of quality until you have a working program. For example, if every function in a program is written using the state-of-the-art design techniques but you add a simple type error before the program is released, then the program’s value will be zero.

Until a program is implemented, you can’t judge the additional impact of the quality of a function. Thus, you can’t judge whether you can afford to spend a few hours implementing it. In this case, having a working program and being under your deadline were relevant factors for improving the quality of a function. If your program didn’t work or if your deadline had elapsed, there would simply be no value in improving the quality.

Should you go from not working, ugly code to not working, elegant code or to working, ugly code? Yes, you’d like to go to working, elegant code, but you may not always know how to do that. Well, how much more elegant should you try to make your not-working program? Should you stop with one extracted function or should you go for a class hierarchy or a iterator-based solution or one that uses the state-of-the-art library for printing stuff on the terminal? You’d like all of the above, but you have to choose. Because the program doesn’t work, you don’t really get feedback about which one to go for. If the program did work, you could tell, based on the time left in your budget, whether to stop with the iterator-based solution or to go all in for the shiny, new library.

Note, however, that if you have a budget and a test for a function, you can optimize for quality as long as the function passes the test and you’re within your budget. That’s because once you have a working function, you can find out the impact of additional quality. How would you get such a budget?

Rest

The Cost of Induction depends on the Cost of Diffing

Hypothesis: The cost of inducing over several existing implementations to find out the smallest sufficient set of factors is the cost of diffing.

You’re given two implementations spanning several files and their output features. You have to diff the two programs and match the diffs to the diffs of their output features. And humans take a while to diff.

Corollary: Wow. Induction can be costly. This is why people should do it for you by abstracting the common parts and leaving only the specific details to each instance.

Corollary: Good abstractions shorten your code and thus speed up your diffing. This is why succinctness is power.

Corollary: A good grasp of high-level patterns also speeds up diffing because you can get away with shallow diffs.

For example, if you know that the tests folder contains non-critical test code, you can skip it. If you know that the object fields will be defined in a particular place, you can compare those methods with each other only (instead of comparing everything with everything else).

Hypothesis: Structure eases diffing by telling you which parts to compare with which parts.


Test: Paralyzed - had to check the different implementations of features because I didn’t want to miss out on a potentially way easier and way cleaner technique. Spent a hour and a half simply testing that hypothesis because I had to go over every function. – deep comparison to figure out which factors were similar and which different (including the output factors).

For example, I wanted to check if implementation B gave any newer output features than implementation A. So I had to compare the output variables deeply too.

Test: Good abstractions can save you time in comparing:

sum [] = 0
sum (x:xs) = x + sum xs

product [] = 1
product (x:xs) = x * product xs

vs

sum = foldr (+) 0
product = foldr (*) 1

Explanation: The differences are way more readily obvious in the latter.

Amount of Work is Proportional to the Equivalence Classes

Hypothesis: If you cover some equivalence classes with one implementation, even if you don’t try all combinations, you can abstract that code to allow you to do the other variations.

If you try to implement all combinations at once, you’ll think you have to do a lot of work.

For example, if I try to generate longer expressions with the identity laws, I can always abstract it to allow me to use the cascade laws. Similarly, if I implement graph generation for identity laws using just a few laws, I can always abstract it for the other types and the other numbers of laws.

Corollary: If this is true, I don’t need to do a combinatorial amount of work. Cover the equivalence classes with a few implementations and abstract.

Yes, running the code will take time proportional to the number of combinations, but implementing it will take time only proportional to the number of equivalence classes. I’m acting like implementing it will take a combinatorial amount of time.

Corollary: To figure out how much work is left, look at the equivalence classes left (and the amount of abstraction you’ll have to do). For example, to know how much I’ve left undone in a messy essay, collect equivalence classes from it.

Corollary: Abstraction means that you can implement something once and use it for n different cases.

What the Really Fast Coders Do: Run the Program in your Mind

Hypothesis: Fast coders -> get feedback sparingly.

Hypothesis: Focus-gamble - try more than one thing at a time.

Hypothesis: Me and people who do TDD - we’re getting dependent on fine-grained feedback. Right now, I’m running tests after every couple of lines or so. I feel like I don’t know what would happen otherwise. But I’m missing out on the feedback that I would get by simply looking at the code without running it. Maybe that’s faster. That’s probably how fast coders do it.

When you run tests every few minutes, you don’t think about how your code works. You don’t have to, and that can save you time and prevent bugs. But you’re probably losing the skill of analyzing code without running it.

You don’t need to test each new library call separately. That’s wasted feedback. Once you know the general interface of that library call, you should be ready to use it.

Hypothesis: Fast coders, 20 lines of code -> can predict the output without running it.

People who have gotten used to running unit tests continuously, 20 lines of code -> can’t predict the output; have to run it.

Corollary: If that is true, then people who run the code frequently like that would get overwhelmed pretty easily by moderately large code. They would go around looking for understandable outputs instead of trying to understand the code itself.

Corollary: Basically, get feedback from your mental model instead of waiting for feedback from the mechanism itself.

Hypothesis: Depend on feedback from others or from tests -> don’t build a full model of the code in your head -> feel that there are some missing magical pieces that only others have.

Don’t depend on feedback from others -> build a full model of the code in your head -> feel that there are no missing magical pieces, that it is mundane and solvable.

Remember, mysterious explanations are formed when you ignore some external factors and start postulating some hidden causes.

Hypothesis: Dependent on feedback from the computer, write “complicated” code -> feel the urge to test each part of it in isolation because anything could go wrong.

Dependent on feedback from other people (like TAs), work on a tough problem -> feel the urge to get feedback about each part of your work because anything could go wrong.

Not dependent on external feedback, write “complicated” code -> don’t feel the urge to test each part of it; wait till you write a reasonable amount of code and then test it.


Test: Saw a guy write a simple JS snake game in 4m30s. He had obviously done this several times before. But the main thing was that he didn’t run the code even once. – He got no feedback (from the computer).

Test: New library - wasn’t able to follow the code at first go - kept looking for an understandable output of some function – kind of lost the habit of understanding code itself.

Test: Wanted to add my multiple-command exploit to the project - delayed because I was afraid I wouldn’t be able to integrate it correctly – All I had to do was read the code and build a half-decent model of it.

Test: Negative exemplar - interview answer - have to judge it by myself. Was pretty surprised at how far I was able to get when I actually walked through each line of my function with a trivial input. – your mental model can give you a lot of feedback.

Test: Wanted to replace each string in a one-liner with its minified counterpart. Wrote this in the Emacs Lisp minibuffer:

(r-evil-map (region-beginning) (region-end) (lambda (line) (s-replace-all '(("before_string" . "bs") ("ebp_minus_4" . "m4") ("ebp_value" . "ev") ("ebp_plus_4" . "p4") ("after_ebp" . "ae") ("main_address" . "ma")) line)))

(I wanted to minify the Python one-liner since it was getting too long to run on the remote computer, if you want to know the purpose. Don’t think you can write a loop in a Python one-liner. Sucks.)

I immediately felt the urge to run r-evil-map on a simple line to make sure it worked the way I thought it did (urge #1). Then, when I added the lambda, I felt the urge to test whether it could handle just one single string (urge #2). When I made myself keep going, I felt like I had to test whether the strings would overlap with each other somehow and mangle the result (urge #3). Finally, I felt like transforming my Python one-liner and running it locally to make sure that it still worked like before. (urge #4. I gave in here and tested it locally.)

Explanation: When I ultimately transformed the one-liner I was using on the remote computer, it worked instantly! I didn’t have to give into any of the above four urges for feedback.

Note that each of those urges were questions whose answers I wasn’t sure about. If I had stored the answer that, no, s-replace-all wouldn’t overlap the search strings and that r-evil-map worked exactly as I imagined it would and so on, I would be able to march forward with that long line of code.

Test: Every time I think of asking for the TA for help, I feel like there’s some mysterious essence of the problem that I simply cannot grasp and that he can help me with. But the last few times, when I’ve stopped myself from posting a message and instead tried to solve my problem myself, I found that the answer turned out to be pretty mundane (as it always is).

I had to use a different $ebp address on the remote server or brute-force different offsets to figure out where it might be or use Python wrapper for libc to generate random numbers the way C does or use a different shell code. It’s usually a problem that I can solve. But notice how easily I impute a mysterious solution to the problem.

(Yes, sometimes the TA does hold a missing piece that I couldn’t have known by myself, such as the fact that you have to use env - $(pwd)/a.out to clear the environment and get a better idea of the stack addresses or that %n writes to the address pointed to by the location on the stack, not to the location itself (even though I could figure this out in theory). But those are pretty rare, I think.)

Explanation: In short, I ignore certain external factors that I could have easily accounted for and instead postulate some mysterious missing piece that only people with the true vision know.

Test: Positive exemplar - turnaround - did it all by myself. – I know it inside out.

Test: Negative exemplar - bof - still not 100% sure – asked for a lot for help.

Abstraction requires Inducing Input and Output Variables

Hypothesis: Writing code in another scope is a relevant factor for having to induce the relevant input factors from the variables in scope at the caller.

Hypothesis: Writing code in another scope is a relevant factor for having to induce the output values for the caller from the variables in the code.


Test: [2019-02-14 Thu] Wrote a full program of around 45 lines of Python without any intermediate feedback. Granted, it was mostly about integrating a bunch of functions I’d already written, but it’s unusual for me.

Note that I don’t usually write 35-line functions with 21-line for-loops:

def main():
    input_file_name = 'dating.csv'
    output_file_name = 'dating-binned.csv'
    training_file_name = 'trainingSet.csv'
    test_file_name = 'testSet.csv'
    test_frac = 0.2
    training_frac = 1.00
    random_state = 47

    nbins_list = [2, 5, 10, 50, 100, 200]
    training_accuracy_list = []
    test_accuracy_list = []
    for nbins in nbins_list:
        discretize_columns(input_file_name, output_file_name, nbins)
        split(output_file_name, training_file_name, test_file_name,
              test_frac, random_state)

        training_df = read_csv(training_file_name).fillna(0)
        test_df = read_csv(test_file_name).fillna(0)
        for col in training_df.columns:
            training_df[col] = training_df[col].astype('int64')
        for col in test_df.columns:
            test_df[col] = test_df[col].astype('int64')
        model = nbc_module.nbc(training_df, training_frac, random_state, nbins)

        training_accuracy = nbc_module.accuracy(model, training_df)
        test_accuracy = nbc_module.accuracy(model, test_df)
        training_accuracy_list += [training_accuracy]
        test_accuracy_list += [test_accuracy]
        print('Bin size: {}\nTraining Accuracy: {:.2f}\nTesting Accuracy: {:.2f}'.format(
            nbins,
            training_accuracy,
            test_accuracy))
    plot_accuracies(nbins, training_accuracy, test_accuracy)
    return model

I used to write this kind of code a long time ago. Now, I don’t. I used to be able to write code without much feedback (without any unit tests, in fact). Now, I can’t. Maybe there’s a link.

Close negative exemplar: If I try to extract functions, it’s going to take me a lot more time. I’d have to “think”.

def evaluate_accuracy(nbins, input_file_name, output_file_name,
                      training_file_name, test_file_name,
                      test_frac, random_state):
    discretize_columns(input_file_name, output_file_name, nbins)
    split(output_file_name, training_file_name, test_file_name,
          test_frac, random_state)

    training_df = read_csv(training_file_name).fillna(0)
    test_df = read_csv(test_file_name).fillna(0)
    for col in training_df.columns:
        training_df[col] = training_df[col].astype('int64')
    for col in test_df.columns:
        test_df[col] = test_df[col].astype('int64')
    model = nbc_module.nbc(training_df, training_frac, random_state, nbins)

    training_accuracy = nbc_module.accuracy(model, training_df)
    test_accuracy = nbc_module.accuracy(model, test_df)
    print('Bin size: {}\nTraining Accuracy: {:.2f}\nTesting Accuracy: {:.2f}'.format(
        nbins,
        training_accuracy,
        test_accuracy))
    return (training_accuracy, test_accuracy)

def main():
    input_file_name = 'dating.csv'
    output_file_name = 'dating-binned.csv'
    training_file_name = 'trainingSet.csv'
    test_file_name = 'testSet.csv'
    test_frac = 0.2
    training_frac = 1.00
    random_state = 47

    nbins_list = [2, 5, 10, 50, 100, 200]
    training_accuracy_list, test_accuracy_list = zip(*[evaluate_accuracy(nbins) for nbins in nbins_list])
    plot_accuracies(nbins, training_accuracy_list, test_accuracy_list)
    return model

I had to think about how to break it up. Had to ensure that the loop body was truly independent of the rest of the code (like a map and not like a fold) before I could extract it as a function. Had to think about the function’s inputs and outputs. Earlier, it was all implicit. I knew that whatever it needed was available in scope. Now, I had to shrink the set of available variables from everything in scope to the few parameters of the function. Induction, in other words. Similarly, I had to think about the variables the remaining code needed from it (namely, the elements of the accuracy lists and the printed statement), whereas earlier I knew that whatever it added to scope would be available if needed. Induction again.

Test: [2019-02-14 Thu] Some of the fastest coders I know write giant Python functions even today.

Math: Get Answers without Getting Empirical Feedback

Hypothesis: Math is how you get answers without getting empirical feedback.

Hypothesis: Holy crap. Is this our surrogate endpoint? Is this how we get answers without running costly real-world trials?

It’s the classical debate between empiricism and deduction. You can calculate 37 times 14 in two ways: lay out pebbles in 37 rows of 14 each and then count the total or simply use your grade school multiplication algorithm to get 518. The former is empiricism and the latter is deduction (or computation or whatever you want to call it). In empiricism, you assume (or realize) that you can’t compute the final answer by yourself and you want to outsource it to nature, which has vastly more computational power. In computation, you take on the burden of evaluating each step of the function yourself.


Test: Decision tree - reduced error pruning - what if you go top-down and decide whether to split at a node based on the validation set? – either run the experiment or use your logic.

Test: Negative exemplar - empiricists like Newton and others would probably prefer to run the experiments, especially if they are cheap. – depend on feedback.

Test: Is my surrogate endpoints theorem right? Have to judge based on the math proof. – need to use math.

Test: Will the Haskell program work? – can reason equationally.

Test: Will my change to my Python data mining script work? Can’t wait five hours for the script to finish all the processing, reach that step, and crash. Would like to know right now. – have to reason using logic.

Test: “Armchair” economics - produce answers to real-world questions using economic models instead of actual reality. For example, Tim Harford predicted that most of the coffee shop profits would go to the landlords because what was scarce was their corner store location, not the coffee. – reason.

Test: Algorithms - cycle property - the costliest edge in a cycle can never be part of an MST. – logic.

Test: Algorithms - cut property - the cheapest edge in a cutset is part of the MST. – logic.

Efficient Processing: How often do you get Feedback? (??)

Hypothesis: Measure how many times you run commands in the interpreter or run your tests -> learn how efficiently you process information.

DSL: Only What you Need

Label: DSL = Set of features that will be needed in a set of use cases.


Test: R - will be doing a lot of matrix multiplication and vector manipulation – add that feature.

Test: Negative exemplar - Haskell matrix multiplication - cool, but you also have all these other language features that you don’t need for matrix algebra – extraneous features.

Fast, Correct, Duplicated Code vs Slow, Regressive, Succinct Code

Hypothesis: If you handle each type of input separately, you will make progress easily (because you will have few constraints at each point) and you won’t break existing features (because you won’t touch other branches) but you will duplicate code across the branches.

If you try to handle multiple inputs in the same branch, you will have trouble making progress (because you will have a lot of constraints to handle) and you may break existing features (because a change within one branch will affect multiple existing features) but you will have succinct code.


Observation: Duplication vs regression strikes again! Handling a new case by itself is pretty trivial (if tedious). But handling the new case without affecting the else case too much is very hard. No wonder I’ve been stuck on this.

Observation: When you duplicate and change, you get to start from scratch. You have far fewer constraints. Much more peaceful.

Big Functions are Hard to Reason about too (??)

Observation: Bunch of journal notes and essays. Don’t know exactly what they contain or what they advise for some problem I face. Clearly, a “chaotic mess”.

Observation: Unfamiliar function - can’t tell what it does.

Observation: If I have a variable that I can change to switch the scheduler from a time-share scheduling algorithm to a proportional-share algorithm, then I can kind of tell the difference in behaviour between the two.

Observation: But if I have three hundred lines of code spread over four files that are different between time-share and proportional-share scheduling algorithms, I find it much harder to predict the differences in behaviour.

Hypothesis: Lots of possible input states and thus lots of possible output states -> hard to infer forward or backward.

Why? Not quite sure. But if it’s an arbitrary function, then you have a lot of input-output mappings to remember. You’re liable to give up.

Restricting yourself to One-Button Changes simplifies your Model

Hypothesis: One-button change between alternatives -> simpler type signature (A vs B -> X vs Y).

Corollary: You won’t be able to make low-level predictions about, say, the exact priority of each process at each point of time, but you would be able to predict the high-level properties (such as, I/O-bound processes will go first in a time-share scheduler).

Corollary: One-button change between small set of high-level alternatives -> cannot choose some arbitrary intermediate state.

Corollary: So, the output will always be that corresponding to one of the high-level alternatives. Much easier to infer forward or backward.

Hypothesis: Allow yourself only one-button changes vs can change anything anywhere -> far simpler model of the system vs lots of possible inputs and lots of possible outputs.

You’re restricting your set of possible interventions. Because you know you haven’t changed anything else anywhere, the only possible inputs are those defined by your high-level categories.

Hypothesis: Allow yourself only one-button changes -> fewer possible interventions and thus fewer possible input states.

Hypothesis: Number of possible interventions -> number of categories.

Hypothesis: Number of categories -> number of possible input-output mappings -> number of possible causes for any output and number of possible outputs for any cause.

Handle all possible High-level Inputs

Hypothesis: To be confident that you’ve written your function correctly, handle all the high-level categories of your input.

Corollary: Errors crop up when you omit some cases.

The most obvious case is forgetting to check for NULL. Another case is to handle all possible states of your input argument, like a pipe that is not connected to anything or a pipe that has been freed.

Corollary: Static type-checking helps you by making sure you handle all the possible input values, especially in the case of sum types like in Haskell.

Corollary: You must yourself know what must happen for each input configuration.

Lesson: Write your code one branch at a time. Write a unit test up-front to make sure you’re actually adding a feature.

You should change code only within one of the possible input categories. If you want to do something for input A, you must do something (maybe a dummy default action) for inputs B and C.

Corollary: That way, at each point, you can tell which value was caused by the branch for A and which one by the branch for B.

What is the Next Unit Test?

Hypothesis: If you’re stuck, ask yourself what is the next unit test you want to pass.

Work Needed is proportional to number of Test Cases

Corollary: The total work you need to do depends on the number of branches you need to implement and the time it takes you to code and test each branch.

Observation: I didn’t even know how many branches I had to deal with. I couldn’t see them in my head when I thought about the problem of “implementing pipes for XINU”.

Corollary: Number of branches you had to implement = number of test cases.

So, use the test cases to determine the total work you had to do.

Corollary: Number of test cases is determined by the number of branches you need to implement. Make sure you have that many.

Hypothesis: If we assume that the time needed to make one unit test pass is constant, then the estimated time for the whole project should be proportional to the number of branches in each function (i.e., the number of test cases).

We can use that to estimate time needed.

Planning Fallacy?

Observation: I underestimated the number of test cases I would need to implement. Sometimes, I was so unsure that I needed to have unit tests for smaller things like adding a semaphore to my pipe entry, because I’d not used it before.

Observation: I didn’t even know what output I wanted from some function.

Why you Need to Roll Back Sometimes

Observation: Error when trying to create a semaphore. No clue why.

Experiment: Didn’t change anything in the lower-level functions. Just toggled the high-level tests to see which combination broke it.

For example, I knew that running the 20 tests gave the error. Running 0 tests didn’t give the error. Running the first 10 didn’t either. Turned out running two particular tests together gave the error. Cool. That narrowed it down. (I was able to find out that I was initializing too many semaphores and thus running out.)

Imagine if I’d tried this without the high-level functions and had made changes here and there. I would have been overwhelmed within a few minutes. Guess how I know that? Because that’s what I started out doing. Only after half an hour or so did I wise up.

Hypothesis: Need to roll back and toggle features when you have errors across functions.

Unit tests catch errors that are within a single function. For errors across functions, you need high-level toggling.

Backward Inference

Hypothesis: This is how you do backward inference. You narrow down some combination of high-level causes and test them till you find one whose effect matches the observed output. After, that you narrow it down at the next level, thus keeping the “complexity” or number of possible causes limited at all times.

Observation: The key was that turning on all the high-level causes on gave you a positive answer (error) and turning them all off gave you a negative answer (no error). So, you could tell that the error must begin with some combination of causes between zero and everything.

Which Input Configuration did I Fail To Handle?

Observation: Notice that I didn’t exactly break any of my older unit tests. What I broke was the underlying code (pipinit), which was exacerbated by the fact that I ran so many tests and ended up creating too many semaphores.

Observation: I added a new property to the program (semaphores), which caused problems.

Observation: I failed to ensure the property that the program used only a limited set of semaphores.

Question: Where could I have checked for that property?

Well, the number of semaphores created would be proportional to the number of times pipinit was called, which I assumed was just once during system initialization. Turned out that pipdelete also called pipinit. As did my test setup function (when trying to reset all pipes).

Observation: It’s simple. I should have handled the case where semcreate returns SYSERR. That’s all. That’s how stupid my mistake was. Cost me a full hour spent scratching my head.

Hypothesis: Bug -> ask “which input configuration did I fail to handle?”.

Observation: This would never have happened in Haskell. It would have given me a Maybe Semaphore, whose Nothing case I would be forced to handle.

Lesson: In particular, look closely at the input type you’re getting.

For example, in pipe_disconnect_writer, instead of worrying about all the possible states that the rest of the system can be in, I should look at just the input pipe. It can be in a free, used, or connected state. If I handle all three cases faithfully, I’ve done my job. Nothing else to worry about.

Observation: The reader process could be in a waiting state when somebody disconnects it. So, I have to handle the different process states too.

Observation: I don’t know what to do in the various cases there.

Look at Properties of your own Input, not of Other Functions

Observation: Have to look at callers of pipe_disconnect_writer because I’m not sure if there can ever be a case where the reader disconnects first and leaves writer calling disconnect later.

Actually, I don’t have to. I can see that pipe_disconnect_reader always leaves writer disconnected. If the pipe to reader was in the PIPE_CONNECTED state, it would disconnect writer. Else, writer would have already been disconnected. Inference!

Hypothesis: I suspect I could avoid writer having to look through reader’s code by checking the invariants on pipe itself.

Pipe doesn’t even have a state for “reader disconnected but writer not yet disconnected”. That’s it. End of story.

Hypothesis: A function shouldn’t need to look at the innards of other functions to decide what to do. It should just look at its inputs, their properties, and its own desired output properties.

Debugging Principles

Hypothesis: You need to have a really small test case.

My mistake with the scheduler was that I was working with a six processes all running at the same time doing god-knows-what. How are you supposed to debug that efficiently?

Observation: Not able to interpret the debugging output because there are too many printf statements and I don’t know what came from where.

Observation: Narrowed it down by toggling the high-level function calls. Things look much clearer. I know that the output is coming from a given two lines.

Observation: Narrowed it even further. Check only the semaphore part.

Observation: Right now, things “aren’t straight” in my head. I don’t know which configurations I’ve handled and which ones I haven’t. So, I don’t feel like I can reason about the behaviour of the program.

Hypothesis: Know which inputs configurations you’ve handled and which not -> feel confident about the behaviour of the program.

Hypothesis: Notice what changes you’ve made. Any change in behaviour is due to a change you’ve made.

So, look at your diff from the previous state.

Lesson: The moment you get an error, toggle back to a working state and then look at the differences.

Duplication Leads to More Input Configurations

Observation: The official code had a state field that could take the value PIPE_CONNECTED, etc. However, that was strongly correlated with the writer and reader semaphore being free or not-free. So, I may have to consider multiple input configurations that actually mean the same thing.

Hypothesis: Code duplication -> more input configurations that give the same output but which you have to handle separately in case they are given inconsistently.

For example, I would have to flag an error in the case where the variable says PIPE_CONNECTED but the semaphores are actually free. Unnecessary headache.

Corollary: What’s more, when you finally merge those redundant input states, you’ll have to do a lot of careful work to remove the separate if-cases.

Truly Pure Functions

Observation: Right now, some of my unit tests depend on more than one function. For example, testing disconnect depends on the correct behaviour of “connect”.

Hypothesis: Pure functions won’t be like that. You can genuinely test their input configurations alone.

Observation: But disconnect will still call pipe-related functions like reset or set writer or whatever.

Hypothesis: Either your higher unit tests will test multiple functions or you will have to use equational reasoning.

Abstract Types reduce the number of Input Configurations

Question: How can we reduce the time taken to implement some features?

Observation: Time taken seems to depend on the number of test cases (aka input configurations) and the time taken to pass each test case.

Hypothesis: Abstract types reduce the number of input configurations and thus reduce the number of tests.

For example, sortInts :: [Int] -> [Int] has way more configurations than the generic sort :: Ord a => [a] -> [a].

Lesson: If you want to code faster, design your types well.

Understanding = Knowing the Output for each Input Configuration

Hypothesis: Understanding = knowing the output for each input configuration.

Observation: I keep saying “I don’t understand surrogate endpoints (or some other concept)”. But that doesn’t help me move forward.

Hypothesis: Say “I don’t understand concept X” -> don’t know how to move forward. Say “I don’t know the output for a particular input type X1” -> can look for resources that give you the output.

For example, I don’t know what happens if you predict for the follow-up study in cancer trials using “principal surrogacy” or using “Prentice’s criterion”. I need to figure that out.

Forming a Technique to Solve Problems

Question: How do you come up with a technique to solve, say, the do-calculus problems that you find in exams?

Hypothesis: Come up with a hypothesis for each input type and test it on the problems you’ve seen (and further problems that you invent).

Confidence

Hypothesis: Confidence <- knowing the possible input configurations and then knowing that you’ve handled all of them.

With my journal notes, for example, I don’t know the input configurations for some mechanism I’m trying to figure out (like how we learn) and so I don’t know if I’ve covered everything.

Assorted Ideas (TODO)

TODO: Composition (pipe_is_writer_semaphore_free = pipe_is_semaphore_free . pipe_writer_semaphore). Should you know the properties?

Duplication is bad because it makes you test essentially the same branch multiple times. For example, pipe_is_writer_semaphore_free and pipe_is_reader_semaphore_free both just call pipe_writer_semaphore with different arguments. But I have to run different tests.

Hypothesis: When you have roughly the same test for two different functions, you have duplicated some code.

Hypothesis: If two branches give you the same output, merge them.

Hypothesis: Uncertainty <- number of branches because the answer for each of those branches could be something different. You have to know all of the outcomes.

Hypothesis: Duplication -> takes more work to test and more work to understand.

Hypothesis: It’s a mistake to have duplicate code because you end up doing more work to test your program.


Hypothesis: In deliberate practice (and thinking in general), chunks correspond to the branches of your model.

So, chess grandmasters who have learned 50k chunks have learned 50k different branches of their model of chess.

Hypothesis: There’s a lot of duplication in there. You could probably merge parts of several branches. (Maybe it’s necessary for optimization - because they can recognize formations in a flash.)

TODO: Figure out how you can represent each chunk - aka branch - as a flashcard, for easy learning.

Top-Level Interventions vs Surgical Interventions

Observation: In the scheduler problem, I had to change one part of the system to affect the overall behaviour. In the pipes problem, however, I had full control of my system (since I was coding for the unit tests).

Question: Are they different in kind?

When There are No Readily Available Tests

Observation: Writing a research project proposal for a course - don’t know what is good and what isn’t.

Question: How do I get feedback? How do I go about implementing “code” (aka models) to handle different inputs?

Hypothesis: Causes and effects all the way.

For example, come up with the causes and effects of surrogate endpoints.

Interface = Possible Input Configurations

Hypothesis: Instead of talking about the “interface” to some program, talk about the different input configurations that it may receive. That, together with the expected output, defines its “interface”.

In other words, talk about the test cases it must satisfy.

Corollary: Minimal interface = few input branches.

For example, sort :: Ord a => [a] -> [a] has fewer possible inputs than sortInts :: [Int] -> [Int].

Minimizing the Outputs

Observation: We’ve seen how to minimize the number of inputs to a function by combining those that produce the same output into the same category.

Hypothesis: If you can get only a few possible outputs, forbid the other output values.

For example, a function that tells you whether or not a semaphore is free will return either true or false. Capturing that in an Int makes your interface unnecessarily large because the caller would have to assert that the output is either 0 or 1. So, make it a Bool.

Testing a High-level Function: Shouldn’t be Blind?

Question: Can you tell how many branches (and thus total test cases) a function will have, based on its type signature?

Hypothesis: It depends on the definition.

For example, if I write fourthRoot from scratch, I would have to test several inputs cases like 0, 1, -1, 0.01, 4, etc. But if I know that fourthRoot = sqrt <=< sqrt, I feel like things become simpler.

Hypothesis: Number of test cases <- amount of code duplication.

For example, sortInts and sortLists would share most of their code, except for the List- vs Int-handling parts. It would be stupid to test the same quicksort algorithm in both cases.

Corollary: Amount of duplication among tests => amount of duplication among their functions.

Question: So how would you do it ideally? How to not duplicate tests between sortLists and sortInts?

Hypothesis: The answer would seem to be to extract their common code into sort.

Then, you would write sortLists = sort (or just use sort directly where you would have used sortLists).

Question: But say you did fourthRoot = sqrt <=< sqrt. What tests should you now write?

Hypothesis: You probably don’t need to check for invalid inputs like -1, etc. because sqrt already does that and returns Nothing.

Corollary: A tester who insisted on being blind to the code and coming up with “advance predictions” would end up duplicating a ton of work for fourthRoot. Ditto for sortLists and sortInts if he didn’t understand that they used the same sorting algorithm.

Hypothesis: This is why people use type systems, so that they don’t have to duplicate their test code. They can reason about things from the types themselves.

No Ifs, No Tests

Hypothesis: You need to write unit tests only if you introduce new if-statements.

For example, take f = uniq . sort. Why do you believe that that function will return a sorted list unique elements?

Positive exemplar: When I wrote pipgetc, I had to test the cases where the function was supposed to return SYSERR. They were separate if-conditions in the code.

Hypothesis: Could it depend on the size of the function too?

Hypothesis: You should know the properties of each function in the chain.

For example, if I gave you doSomeMagic . sort, you wouldn’t really know what that function did.

Observation: Ad hoc if-conditions in each function create exponential number of branches in a function chain.

In contrast, a return type of Maybe monadically composes to give you just two possible outputs at each point in the chain.

When can you Chain?

Hypothesis: You can chain together functions when each one “handles” all the possible outputs of the previous one.

For example, uniq handles the empty list and normal list outputs of sort. sqrt handles the Just and Nothing outputs of sqrt.

Corollary: Functions not in a chain -> there could be output configurations you failed to handle.

For example, possible NULL values in C.

Observation: If I were doing pipe functions in a chain, I could be certain that they handled all the intermediate configurations. But since I was writing an imperative function, I didn’t know if I had handled the cases where the pipe became free after a call to wait or got disconnected or whatever.

Exams: Input Branch = Concept Needed

Hypothesis: Input branch = question that I have to answer. Which depends on the concepts needed.

Hypothesis: Input branch = concept needed. I need to have the correct output for each concept. Output = concept definition.

For example, I need to know what a “proper causal path” is, in case they ask for it in the exam.

Why Refactor your Proofs and Code?

Hypothesis: Refactor -> find simple solution -> store that as your output for that particular input.

Reuse that solution for other inputs -> learn to recognize that input type based on this technique.

For example, HW2 8(c) took a lot of trial and error for me to find the answer. (Still don’t know if I’ve made a mistake somewhere.) Now, if I cleaned it up and got to the answer in, say, three steps, then I could extract general lessons from that clean solution.

Observation: Basically, I’m unable to “draw lessons” from a jumble of trial-and-error proof steps.

Why not?

Hypothesis: I can’t see which parts of the input led to which parts of the output.

So, my current “proof” (more like one page of arrows and scribbles) feels like it pertains only to problem 8(c). I can’t see what I would reuse for a variation of the problem.

Corollary: This is why you need to do proofs over and over till you get a succinct solution. (Ditto for programming.)

Hypothesis: You need to bring your proof into a form where you can see that if you toggled a high-level variable A, you wouldn’t need to do step 1 anymore. If you toggled B, you wouldn’t need step 2 anymore, and so on.

Why can’t I do that now? The proof is so large and hairy that I can’t imagine the alternative.

Confusion

Confused by the Options

Observation: Still confused (even though I did the proof a few minutes ago).

Hypothesis: Confusion = don’t know the conditions when you should choose option A vs option B vs option C (in order to reach your target).

For example, I can see that two back-door paths from X to Y are potentially unblocked if I condition on W1 or W2.

What options do I have? I could apply rule 2 on X, rule 3 on X, condition on Z1, condition on W2, condition on W1, or condition on some combination of those 3. Which one should I choose?

Next, if I choose to condition on Z, what should I do? Rule 2 on X, rule 3 on X, rule 2 or Z, rule 3 on Z, condition on W1, etc.

There are tons of options!

Observation: Experts somehow slice through this thicket of options and hone in on the correct one.

Hypothesis: Given the graph and the desired expression, there is a direct way to come to the correct step without much trial and error.

Observation: There are only three trails from X to Y.

Hypothesis: Maybe there is a simpler way to come to an ID expression without traversing the do-calculus expression tree. Use the trails somehow. (Don’t know how.)

Observation: Still feels like every step of the proof depends on every other step.

Basically, I’m still traversing the tree by trial and error. If I choose rule 2 on X, then the whole rest of the proof changes (it seems).

Question: Why not try all possible techniques?

Observation: Heavy resistance from my mind when I try to do that.

Hypothesis: I believe that trial and error is inefficient, so I have zero motivation for doing it. Correct solutions usually lead to the answers directly without much trial and error. That feels like the signature of a good technique, and conversely, trying things blindly feels like the signature of a bad technique.

That seems to have paid off in general. Looking for direct algorithms seems to have been a good heuristic in the past.

Corollary: That also makes me more confused than most, because I’m unwilling to accept trial and error where I can’t see why I’m using some technique, or use a half-cooked understanding of some technique. I need crystal-clear understanding before I take action.

Confused? Write a Test

Observation: Confused about whether a process waiting on a semaphore will get a return value of SYSERR or OK. Been thinking about it for the last 10 minutes.

Hypothesis: Run a simple test to find out the answer.

Hypothesis: “Confused” = don’t know the output of some branch.

Infer based on the Variables we Focus on

Hypothesis: Which conclusion we come to <- the variables we focus on.

Just focus on one variable, like how awesome romantic life is - feel like a life without romance is worthless. Focus on money and flashy cars and the other stuff rappers rap about - feel like a life without that is a waste.

Focus on more variables - get a more nuanced view. For example, celebrities may get way more compliments than you or me, but they also get a ton more hate. They get judged every single day for things they didn’t or even couldn’t do perfectly.

However, focus on too many variables - try to figure out a building’s architecture in one go (for a novice like me) - fail. Ditto for “complex” do-calculus problems - too much information for me to use.

Hypothesis: Confusion (one cause) <- focusing on too many variables.

Observation: I didn’t even know the type signature of “graphical condition” or “definition for surrogate endpoints” (when starting out on my surrogate endpoints project).

So, I took the wrong thing to be a valid “definition for surrogate endpoints” and found out a week later that it wasn’t actually valid.

Hypothesis: No type signature, lots of variables -> focus on too many variables.

No type signature, some other variables -> accept the wrong answer.

Type signatures -> may not know the answer, but won’t focus on the wrong variables.

Ignore Some Variables

Corollary: If you get convinced that all the variables you see are important, you will try to take all of them into account. And because you don’t have any high-level rule that covers a dozen variables at once, you will get “confused”. You won’t know which technique to use.

You could have done much better if you focused on just a few variables you “understood” (i.e., for whose configuration you knew how to respond).

Lesson: Sometimes, it’s best to reduce the number of variables you focus on. (When?)

Case Study: Edge of Tomorrow

Observation: Cage drops into the battlefield. No clue what to do. There are missiles flying every way, people fighting here and there, aircrafts exploding above. What to do for this input configuration? Turn left or right? Run or walk? Or just lay low? Massively confused.

Observation: He doesn’t know which events to pay attention to. He keeps looking around.

Hypothesis: Doesn’t know which inputs are relevant for the techniques he has.

For example, he can’t do anything about the aircrafts exploding miles away. So, ignore them. Ditto for the people already dead - no technique will bring them back. So, ignore them too. His gun can shoot only things that are nearby - so, focus only on targets that are nearby.

Observation: Guy spasming in his exo-suit. Cage gawks.

Observation: Hmm… A big part of being confused seems to be about wasting time on useless parts of the input.

Hypothesis: Given the techniques you have, you need to ignore all the parts of the inputs that are irrelevant. Otherwise, you will be overwhelmed because you have nothing stored in your mind for that exact “complex” configuration.

For example, Cage (along with us as the viewers) is overwhelmed by the fact that there are dozens of aircrafts hovering above doing crazy shit, missiles flying everywhere, and so on. We have no clear rule for what to do in this situation. But once you realize that all you can do with your weapons is shoot things that are nearby, you will ignore all of these factors and focus on any mimics who are close. You know how to respond to that situation: see mimic, shoot it.

Hypothesis: If you have a bunch of clear high-level techniques but you focus on too many irrelevant variables, you will feel like you don’t know which technique you need for that configuration. You will be “confused”.

Lesson: Ignore irrelevant variables. Basically don’t waste time on useless things, like Cage gawking at the spasming soldier. Take your mind off those things.

Lesson: Don’t pay attention to anything that is irrelevant to your target and weapons.

For example, don’t pay attention to the missile that just took out the guy next to you (apart from remembering to duck and walk in zigzag patterns or whatever). Don’t pay attention to the guy on fire - can’t do anything about it; will only confuse you. Don’t pay attention to the aircraft on fire (past ensuring that it’s not going to fall on you).

Observation: Gun - some error message in Japanese - don’t know what to do for this configuration (in order to get his safety off).

Hypothesis: What options does he have? He could press some combination of the buttons on the gun. One of them must turn the safety off. It is probably not some “complex” procedure because people have to do it quite often. Focusing on the fact that it is in Japanese and that you don’t know the language and that you’re screwed and so on is unhelpful.

Observation: Crashing aircraft. He came up with the correct response - running and jumping into a ditch.

Observation: Swordsmen are famous for having their attention completely on their opponent. They don’t waste attention on worrying about what others think of them (because their weapon can’t help them do anything about that).

Observation: The only weapon he had left was the explosive. What is the only thing he could do with it? Detonate it when the enemy got close. He did that. Every other detail was irrelevant.

Observation: He wakes up after dying. Confused. Doesn’t know what to do for such an input configuration. What can he do? Live life again, become stronger, and hopefully kill some more mimics. The rest is kind of irrelevant. (Though he should figure out how he was awoken.)

Observation: Rita dies after killing the first mimic; the second time around, he’s coming up with a better technique for the same situation. (The theme of the whole movie.)

Question: Given a particular situation and target, how do you improve your technique?

Hypothesis: Use the scientific method with the input as the situation plus your technique, and the output as the outcome of your actions.

Notice how different variations of your technique lead to different outcomes and infer a model of which technique configurations lead to good outputs.

Hypothesis: Basically, use the outputs to categorize your technique configurations.

For example, if trying to convince the Sergeant gets you nowhere, stop doing that. Eliminate that variable from your technique configuration space.

Hypothesis: For each input configuration, choose the high-level technique that gets you the best outcome. Categorize inputs this way.

Observation: He jumped off right away. He knew that other techniques would get bad results. Landed on his feet too. He’s running with a purpose now. He knows what to do for the current input.

Observation: Got hit by the jeep once. Not again.

Relevant Variables Only but Still Confused?

For example, I didn’t know whether to use G_-X_Z- or G_X-_Z- for rule 2 of do-calculus.

Hypothesis: I knew that I needed to use rule 2 in that situation, but I was confused about the definition.

Hypothesis: This is just confusion at one level down. For the input “use rule 2”, I didn’t know the correct output.

Instead of being confused about the high-level technique to use, I was confused about a lower-level technique within it.

What if You have a Lot of Low-level Techniques?

For example, the gun had some error message in Japanese. How could Cage go about testing the different combinations of buttons on the gun to turn the safety off?

What if You Have No Techniques?

Question: What if you don’t have any high-level techniques?

For example, if you asked me to fix a broken car engine, I would have no clue what to do. There too, I would be “confused”.

“I’m confused” vs “I’m confused about X”

Test: EB - told him I was confused about the linear programming part of ACE. He refused to back down until I specified exactly which line I didn’t understand. And when he did write it out, it was obvious!

Novice: Go Slow

Observation: Felt a lot more in control when I applied each do-calculus rule deliberately. Basically wrote out the entire condition for the rule so that I wasn’t caught by edge cases.

Unit Tests are better than Blind Integration Tests

Observation: For HW2 in SML, I went with a benchmark of tests, where I understood neither the input nor the output data.

Observation: I can understand a single concrete test better.

Hypothesis: Go for well-understood simple unit tests over blind integration tests -> actually feel confident about your code vs don’t know why it works or even what it’s supposed to do.

Classified Information: Everything Should Flow

Hypothesis: Aim of information-storage: Collect all your data into the same flow. A model must either be an alternative to or an extension of another model.

Corollary: You shouldn’t have information that’s just lying around. It should fit within your overall model.

Detect Redundant Models using Concrete Examples

We don’t want to store more than one hypothesis for the same problem type. Then, we would get confused about which one to use. Hypotheses must either be alternatives to each other (with well-defined unique inputs) or extensions (where you use one after the other).

Question: How to detect whether you already have a hypothesis for a given problem?

Hypothesis: Use a concrete example vs use some name for the model -> recall your existing model for that example vs may not remember existing models you have for the same problem.

For example, “confusion” and “being overwhelmed” and “seems chaotic” and so on refer to the same problem - not knowing which output to produce for a given input. Things become clearer when I give the example of a battle field (like in Edge of Tomorrow). All the above labels would apply, which makes me understand that I have more than one model for the same situation and I must mark them as equivalent (at least for this problem).

Corollary: Using lower-level concepts will do just as well.

For example, “boundaries” and “personal space” and “personal rights” all boil down to the economic concept of “property rights” - stuff that you control.

Static Types help you Abstract

Observation: Writing a Python function to flatten a YAML dictionary that had two redundant nested lists of dictionaries. Was going to write some hopelessly ad hoc function to deal with “innings” and “deliveries” (since those were the keys that had the nested lists).

Then realized that I should just write one generic function to flatten a list of dictionaries, and use that on innings and deliveries separately.

Would have been a no-brainer in Haskell. The type would have made it obvious (I think).

Hypothesis: Static types (a la Haskell) help you abstract, whereas dynamic types make it hard to see the patterns.

(Feels overly-general.)

Pass a Test with a Chain of Function Calls

Inference: A test will run only one branch. So, no if-statements needed. Probably just a chain of function calls.

Hypothesis: To pass a well-designed test, you will need only a chain of function calls.

One Test per Function in the Chain?

Question: How many tests for a function with no if-statements but a chain of function calls?

Hypothesis: When you have a chain f . g . h, you will need to show why you need each individual function (and if possible, each combination).

Test: In website-join-disjointed-paragraphs, I have

(my-map-paragraphs-in-buffer
   (-compose (-partial 's-join " ")
	     (-partial '-filter 's-present?)
	     's-lines))

I was puzzled. Why do I need that function in the middle: (-partial '-filter 's-present?)? I can see that s-lines turns a string with disjointed lines into a list of lines and that s-join joins them into one paragraph. When will you possibly need to filter with s-present??

I looked at my unit test for that function (I had just one) and couldn’t see why you needed that s-present? function. In fact, that test passed even without the middle function. However, another test, one that tested this function on a real-world input failed. Only then did I realize that that function was there to handle a trailing newline, which all my files have.

I tested that theory by adding a smaller, specific test with a trailing newline and saw that toggling s-present? made that test pass or fail.

Basically, I needed to show that my function definition needed that middle function in order to pass some test.

Compose Properties of Functions

It feels like that when we compose functions, we compose their properties to get the overall property.

Test:

(defun s-join-broken-paragraph (s)
  "Join broken lines in paragraph string S."
  (funcall (-compose (-partial 's-join " ")
	     (-partial '-filter 's-present?)
	     's-lines)
	   s))

We know that s-lines will break a string into lines, filter s-present? will keep only those lines that are not empty, and that s-join " " will join those lines with a space in between. So, does that sum up to “join broken lines in paragraph string”?

Note that a big assumption is that there is only one paragraph with broken lines. If the input looks like:

This is
the first
paragraph. Phew.

This is actually the second
paragraph. Uh oh.

the output is:

This is the first paragraph. Phew. This is actually the second paragraph. Uh oh.

which is not what I want.

Observation: I’m assuming that a “paragraph” is some “complex” thing with lots of possible cases that I will have to handle.

But if we assume that one broken paragraph is just a bunch of lines with none of them separated by a blank line, then our job becomes a lot easier.

If we assume that a broken paragraph is a bunch of lines, then s-lines will get us all of those lines. Right now, the filter s-present? is just there to handle a trailing newline, so let’s ignore it. The s-join " " takes the lines and creates a string out of them.

Hypothesis: To handle all possible configurations of some variable:

unpack it, do something to each part, combine them -> “handle” that variable.

Succinct Properties?

Test:

Prelude Data.Map> let d = fromList $ zip [1..] ['a'..'z']
fromList [(1,'a'),(2,'b'),(3,'c'),(4,'d'),(5,'e'),(6,'f'),(7,'g'),(8,'h'),(9,'i'),(10,'j'),(11,'k'),(12,'l'),(13,'m'),(14,'n'),(15,'o'),(16,'p'),(17,'q'),(18,'r'),(19,'s'),(20,'t'),(21,'u'),(22,'v'),(23,'w'),(24,'x'),(25,'y'),(26,'z')]
Prelude Data.Map> let d2 = fromList . map Data.Tuple.swap . toList $ d
fromList [('a',1),('b',2),('c',3),('d',4),('e',5),('f',6),('g',7),('h',8),('i',9),('j',10),('k',11),('l',12),('m',13),('n',14),('o',15),('p',16),('q',17),('r',18),('s',19),('t',20),('u',21),('v',22),('w',23),('x',24),('y',25),('z',26)]

We can see that toList will get us a list of all the key-value pairs, map swap will make them value-key pairs, and fromList will construct a new map from them.

Of course, one edge case is when there is more than one key with the same value. Which one will be lost? The one that comes earlier in the key ordering.

These two function chains were easy enough. When might function chains be hard to reason about?

You could have a really long function chain or you could have lots of possible variations.

For example,

Prelude Data.List> let xs = [1..10]
Prelude Data.List> take 3 . reverse . sort . nub . take 20 . sort . take 30 . cycle $ xs
[7,6,5]

Why is the output [7, 6, 5]? I don’t really know. But I can sort of follow the chain.

Aren’t there all kinds of possible configurations? I don’t think so. It’s a list, so there are two cases - the empty list and the non-empty list. If xs is empty, the output at every point along the chain will be empty. So, that’s a trivial case.

Now, you just have to take a non-empty list. Shouldn’t you have to consider a lot of possible lists, maybe those that have just one element or five elements?

Observation: You can’t make any sort of closed-function claim about the output list. You will say that it will have the output that is dictated by the chain.

Observation: In the key-value swapping example before, I could claim succinctly that the final map would have key-value pairs but in the opposite order.

Hypothesis: Some function chains lend themselves to succinct properties, some don’t. (How falsifiable!)

Observation: The long chain above, if encapsulated as a function, would have a very large interface. You couldn’t do better than just describing the chain itself.

Observation: When you mappend two strings, what you get back is still a string.

Question: Out of all the possible implementations of Monoid or Functor or Monad, why do we choose particular ones?

Test: What does it mean when you compose two Parsers? What properties are you composing?

abParser :: Parser (Char, Char)
abParser = (,) <$> char 'a' <*> char 'b'

Properties of Intermediate Values

Test: Poorly-written abstract function that I found hard to understand (part of a Sudoku solver I wrote):

boxes3x3' :: Eq b => Matrix b -> Matrix [b]
boxes3x3' = fmap concat . overMatrix (concatMap (replicate 3 . concatMap (replicate 3)) . map ( transpose . map byThrees) . byThrees)

What now?

Note, however, that the interface of the function is actually pretty simple. It needs to take a Sudoku matrix where each cell contains a list of its possible values and return a matrix where each cell contains the possibilities for all the cells in its 3x3 box. (A test case would have helped greatly.)

Hypothesis: You need to understand the properties of the data structure at each point in the function chain.

Ok. I understood the function. overMatrix lets you map over the rows in the Matrix. byThrees gets you three rows at a time. Then, consider transpose . map byThrees on the first three rows. map byThrees gets you the three column values at a time for each row. transpose on that for the first three rows gives you the first three 3x3 boxes in the Sudoku matrix. Basically, at this point, you’ve got a 3x3 matrix of boxes. You want to give the same box to all the 9 cells in each box, so concatMap (replicate 3 . concatMap (replicate 3)) turns the 3x3 matrix of boxes into a 9x9 matrix of boxes, where each cell has the 3x3 box that it resides in. fmap concat flattens each 3x3 box to give a 9-element list of its elements.

Observation: We can break this into three parts: the first part where you get a 3x3 matrix of boxes, the second part where you get a 9x9 matrix of boxes, and the final part where you get a 9x9 matrix of lists.

Hypothesis: Chain - get the data type at each point and ask for its properties with respect to your goal.

Question: How many tests would you need to be confident in your implementation of each part? How do you know that your code really will get a 3x3 matrix of the 9 boxes in the Sudoku matrix?

Well, the input numbers don’t really matter. So, that reduces a lot of possible input configurations.

Question: For any given input matrix, should we test for the entire output or can we just test a few cells?

Verify Properties using QuickCheck

Hypothesis: Function - verify its properties using QuickCheck.

Test: boxes3x3’ must return a 3x3 matrix of boxes corresponding to the input Sudoku matrix – we can use QuickCheck to verify that.

How Many Tests for a Chain of Functions?

Question: How many tests do I need to for a chain of functions?

Just one? No matter what the functions are? Seriously?

Hypothesis: I want to know the causes of #tests for chain of functions. What are the variables? Number of functions in the chain, whether it’s about the limits of my mind vs the actual uncertainty I have about the code, the number of branches of each function, and the size of their interfaces.

Deliberate Execution of Hypothesis

Test your explicit hypothesis at each step by asking it what to do. If it’s blank, fill it in with your intuitive next step. For example, use your “one-button change” hypothesis to write your course programs.

Hypothesis: Live life “deliberately” -> get an explicit hypothesis about a lot of things you do.

For example, should you watch a YouTube video now? What is the expected outcome? What else could you be doing? Read a book? Ok. Let’s say that’s what you need to do right now. Well, does it work to simply tell yourself “I need to read a book now”? Probably not. What does work? What passes the unit test “have read a book for an hour today”?

Refactoring = Induction

Observation: I got the right answers to the most questions in OS and SML. But I’m not confident about it. I felt like I just rote-learned most of the concepts the night before the exam.

Question: What’s the difference between rote-learning the night before the exam and proper learning over a period of time?

Question: What’s the difference between correct code that’s dirty and correct code that’s clean?

Observation: Refactoring gets you from dirty correct code to clean correct code. The final code is more general - easier to understand and change and reuse.

Hypothesis: The difference is induction.

Hypothesis: I’m confused about my course concepts and about dirty code because I haven’t got succinct general hypotheses from them. I’m still at the concrete level that’s hard to reason about because it’s verbose (and I don’t have a clear input-output relationship).

How Much Work is Refactoring?

Hypothesis: Two different concrete functions or hypotheses -> one function that does both [call this refactoring or induction].

example: merge get_strike_rates and get_averages into a generic get_features.

Question: How much work is it to refactor (inductively generalize) two functions (hypotheses)?

Hypothesis: First, check where they differ.

If they are nearly identical functions -> just abstract the difference.

Test: get_strike_rates and get_averages had the same code except for one function call. I could accept a function argument for that call alone.

Learning and Induction

Question: Is this what I’m doing when I’m reading a new book that talks about a topic I’ve understood before? Am I trying to merge my current theory along with the new theory and evidence in the book to get a unified theory?

For example, if I read a book about the cognitive psychology of learning, I have to figure out how it fits with my existing model of associative memory and exponential decay. That feels like quite a bit of work.

Don’t Change a Function. Duplicate it.

Hypothesis: Want to change a generate_stats function to return the average instead of strike rate <- make a duplicate function that’s specific for average. You can refactor it later.

Corollary: This way, you can switch between them easily.

Hypothesis: Function <- input and output types, output properties (i.e., the whole interface)

So, if you want to change the interface of the function (return averages instead of strike rates), even if the type is the same (list of numbers), write a separate function instead.

Corollary: To avoid dead code, have an if-condition that lets you switch between the alternative functions.

Writing Tests

Granularity of Tests

Question: How do you chose the granularity of the function for which you want to write tests? (Excellent question.)

Refactor Before you Commit

Observation: My generate_stats script is ugly and feels hard to refactor because it takes several minutes to run.

Observation: I didn’t refactor it the first time around when I was trying to just get the code to run. Yes, I want to pass the simple test as soon as possible, but I should refactor after that, so that my technical debt doesn’t pile up.

Hypothesis: Refactor before you commit -> easier to code in the future.

Don’t refactor before you commit -> end up with ugly code that’s hard to reason about.

Costly Feedback

Question: How to write code when running tests is costly?

For example, it takes several minutes to test whether my generate_stats script gives the same output as before. So, I hesitate to refactor any part of it, lest I break the code in some way that will cost me a lot to fix (where I’ll have to run the script for several minutes again to make sure I really did fix it).

Hypothesis: If you’re writing I/O code, separate the pure and impure parts. Now, you can quickly test and thus refactor the pure logic to your heart’s content.

Consider All Possible Configurations

Observation: Good security professionals seem to consider different possible scenarios.

Hypothesis: Think of contingency plans to most possible scenarios -> high security vs poor security.

Observation: I haven’t considered different possibilities about my career plans. What if I hate my research area after a couple of years? What if I never make a breakthrough in “metacognitive algorithms”? No contingency plans.

Program to an Interface, not an Implementation

Program to an Interface: Hypothesis

Observation: When extending XINU in C, I wrote a few linked-list manipulation functions. They repeated several well-known functions like extract from linked list and so on. I shouldn’t have had to write them from scratch. But I couldn’t find any existing functions I could reuse (or rather, I didn’t look for them). Plus, it was a “unique” linked list structure - one where the elements were part of a pre-existing table, and were not dynamically generated. Still, there should have been a generic List interface that I could just use without knowing too much about the innards.

Hypothesis: Program to an interface, not an implementation -> code that is easier to understand, reuse, and modify.

For example, if you want to check if an element is a value in a Map, don’t search the specific functions within Data.Map. Look for the interfaces that it satisfies: Foldable, Traversable, Functor, Monoid, etc.

Test: You’re given Maybe (Int, String) and you want to map the string to its length (“foo” to 3). That is, you want to get Maybe (Int, Int). Don’t try to unpack Maybe into the two cases Just and Nothing and then try to convert the inner values. Instead, realize that you want to deal with Maybe as a Functor. – easier to understand, easier to reuse (you could use it with any functor of functor), easier to modify (you could change the definition of Maybe).

*Main Data.Map> let x = Just (3, "foobar")
*Main Data.Map> fmap (fmap length) x
Just (3,6)

Compare that to:

*Main Data.Map> let f (Just (x, s)) = Just (x, length s); f Nothing = Nothing
*Main Data.Map> f x
Just (3,6)

This is a very specific function. You don’t know exactly what it’s doing unless you examine both the branches and realize that neither x nor the Maybe structure is affected, but the string s is turned into its length. It takes a while to realize that the whole point of the function was to turn the string into its length. With fmap (fmap length), that purpose is front and centre. We know that we don’t care about the other value x or the structure of the Maybe value. Similarly, this is hard to reuse with any other type, such as a list of list of strings, because you have to handle a different type of structure, whereas the interface-version transfers easily:

*Main Data.Map> fmap (fmap length) [["yo", "boyz", "I"], ["am", "sing", "song"]]
[[2,4,1],[2,4,4]]

Likewise, if you want to do something other than get the length of the string, you know exactly what to change - there are just three words in the function - “fmap”, “fmap”, and “length”. Whereas in the explicit version, you have to search for the precise thing you want to change and be wary of not affecting any of the surrounding code.

Test: You have a XML-like tree of sentences and you want to get all the verbs used in the tree. (Assume you have a function that will extract verbs from a sentence.) – treat the tree as Foldable and use foldmap getVerbs.

Minimize the Number of your Weapons

Observation: How do you check whether a value is in a Map? You could do (elem (elems m2)), but that would require you to know the function elems. Instead, you could just do (value elem m2)! Why? Because Map is in Foldable.

Hypothesis: Have interface for commonly-used functions -> don’t have to remember a lot of specific functions, can reuse code, can make your code more abstract.

Don’t have interface for commonly-used functions -> have to remember specific functions, can’t reuse code, code is specific to that data type.

For example, I would have to know that elems was the function that returned all the values in a Map.

Mock Objects

Test: Test-Driven Development: A Practical Guide has a whole chapter about different types of “Mock” classes - Verifiable, MockObject, ReturnObjectList, Expectation, AbstractExpectation, ExpectationValue, ExpectationSegment, etc. Do we really need all those classes?

Observation: Haskell doesn’t seem to use these kinds of things. I don’t hear advanced programmers like PG or Stevey or others using mocks heavily.

Inference: Mocks -> test behaviour of an interface without using costly I/O or computations.

You may simply have a lot of fixture to put in place, or you may be working in the context of a large, complex system. This could be a database, a workflow system, or some system for which you are developing an extension (e.g., an IDE). Your fixture may involve getting the system into a specific state so that it responds the way your tests require. This may be impossible to do quickly.

– Test-Driven Development: A Practical Guide, Chapter 7

Hypothesis: Separate pure code from impure code -> won’t need to use mocks (much?).

What about my scheduler? Didn’t I need a default scheduler?

Refactoring Hypothesis

Hypothesis

Starting out -> get overall type signatures and goals.

one function definition - abstract it and run tests.

two function definitions with duplicated code and slightly different goals - extract common abstract functions and run tests.

one abstract function, old unrefactored code that you’re trying to subsume - check if it’s a subclass of your abstract function (that it handles a subset of the branches?)

Question: What does this predict about programming to an interface? Will the code be easier to understand, reuse, and modify? Why?

Tests

Test: get_features_strike_rate and get_features_average with roughly the same code except for strike_rate and average – accept a stat_fn and have a common abstract function.

Rewriting: Refactoring Prose

Refactoring Prose

Acceptance test: Refactoring - default out instead of default in.

Given that I have 100k+ words in different essays with almost no acceptance tests

when I start with a fresh, small positive exemplar that I can make sense of and then bring in untested prose only when they pass some acceptance test

then I will have a working product at each time and be able to make sense of things and have everything well-tested.

Disjoint Hypotheses

Corollary: Two hypotheses, different output variables -> disjoint.

Two hypotheses, same output variables -> overlapping.


Test: category hierarchy <- locality; category hierarchy <- causal model – overlapping hypotheses.

Test: direction of car <- steering wheel; speed of car <- pedal – disjoint hypotheses.

Rewriting Mechanism

What is the input and output? The input has old hypotheses with some tests and perhaps new hypotheses with some tests. The output has hypotheses and tests that cover all previous tests and perhaps some new tests as well.

Hypothesis: Get the original output variables, get old tests, come up with your own causes, pass new tests, pass old tests -> get a hypothesis that covers all tests present and past.

Question: What are the tests I run to make sure the new version works the same as the old one?


Test: My summary of old category hierarchy ideas. I got it down from 3800 words to 1100 words, while adding new ideas and tests. Took around 3 hours. Original prose: around 30 “hypotheses”, 30 examples (no tests or even observations). Final prose: 8 hypotheses, 28 tests.

Number of hypotheses straight from the original: 3.

The final prose was strictly in the form of disjoint hypotheses followed by tests of each branch. Around 21 hypothesis branches (including corollaries). So, roughly one test per branch, sometimes more.

The original prose was in the form of repetitive and overlapping hypotheses with no branches, some examples, and no real tests.

What about the variables?

Here are the output variables from the final prose that were in the original prose: grammar; easier to code in; first-class categories for certain patterns; close to the problem domain; category property; mysterious category; can reuse category hierarchy; granularity of categories. (That covered basically all the hypotheses.)

Here are the output variables from the original prose: learning; grammar; category hierarchy; categorize a program; high-level patterns; extend the grammar; which parts are decoupled; behaviour of the whole as a function of the behaviour of its parts; simple language; mysterious properties; can reuse category hierarchy; composing categories; granularity of categories; recomputing categories;

Most of the output variables are the same.

Test: decisiveness - had three sections talking about it, collected them all under my newest explanation (indecisiveness is caused by optimizing for too many variables), was able to cover them all, shrank my prose. – I ignored the hypotheses and collected the tests.

Test: “Stuff not up to standards, insult -> negative reinforcement for better action” – it’s in the form of [situation, action -> result].

Refactoring Algorithm for Prose

TODO: Meta: This is subsumed by backward inference using category hierarchy.

Refactoring Algorithm for Prose: Hypothesis

Hypothesis: Refactoring algorithm for prose

Basically, refactor disjunctive hypotheses of the same type to get an inverse hierarchy of conjunctive hypotheses with criteria to choose among alternatives. Start with the topmost alternatives - the high-level variables like induction or goals or learning experiments. Figure out their effects and invert them to get the inverse lookup. Then recurse.

Compose hypotheses of different types to pass the acceptance tests you care about.

Steps:

Collect sections that talk about a common output variable.

For one output variable, refactor disjunctive hypotheses into a category hierarchy of inverse conjunctive hypotheses with criteria.

Compose hypotheses and test using acceptance tests.

Question: What about isolating and importing? Doesn’t that give you feedback quicker?

Hypothesis: Input - a bunch of sections talking about different types.

Desired output - acceptance tests that show key surprising insights; decoupled prose that explains those acceptance tests; unit tests for each function.

Properties: No “duplication” among the imported functions. Every function should be used in some acceptance test. Every function should have unit tests for its branches.

Don’t those properties hold now? No. Multiple sections talking about conservative focusing. Lots of concrete sections using the scientific method or the Baconian method.

Question: But refactoring usually means that you change the code but not the behaviour. So you need tests. (Or, at a minimum, you check that their type remains the same.)

But can’t you refactor purely on the level of expressions? map f . map g = map (f . g). Also, A, B -> C; A, B' -> C => A -> C.

Question: Do I need to write acceptance tests first? Do I need to bring in one section at a time or a bunch of sections on the same topic? Have to toggle the variables and find out.


Test: Not sure. Let’s just try with one label.

Test: Experiment - Bring in all sections on category hierarchy. No tests. See if you get refactored code anyway.

What does the input look like? What is its structure?

23 section headings. 119 paragraphs. 40 of them are hypotheses or corollaries or continuations of hypotheses. 31 are tests. 10 observations. 5 examples. 27 are random lines. The rest are break lines.

How many hypotheses can you reduce that to?

(Imagine that. I have 30-40 functions lying around in ~2k words of prose.)

You can’t get rid of the tests or observations or examples. They are valuable pieces of evidence.

Test: My “hypotheses” don’t look composable at all. They have arbitrary inputs and arbitrary outputs.

Write function definition -> get closer to the problem domain.

Input to a causal model is in the form of an AST governed by a grammar <- designed by the possible outputs.

Object, output of causal model -> category property.

I suspect reductionism - describing everything in terms of some core building blocks (like say functions) and avoiding unnecessary new categories will clear things up.

Test: Some of it is better predicted by my latest hypotheses.

Hypothesis: Mysterious category = set of inputs for which you can’t predict the output of a model (when given just that category).

Test: Can’t make predictions - “magic” – Harry can’t say what will happen the next time (or what can be done and what can’t).

After reading Skinner, I now believe that mysterious explanation means treating the cause as some hidden variable (to which you can as much detail as you please) and ignoring relevant external factors. For example, when Minerva tells Harry it’s “magic”, she fails to mention that you need to have a wand and not be a Muggle or a Squib and know the spells and concentrate and stuff.

Test: Output variables -

size of code size of grammar number of design options number of errors first-class categories for certain patterns how close you are to the problem domain mysterious category different label can reuse category hierarchy can use the same action on lower-level objects granularity of categories goal categorize according to your goal use extreme cases to test the label concrete cue for the technique notice abstractions more

Refactoring Prose: All I care about is the Algorithm

Hypothesis: Maybe types or acceptance tests make things decoupled. You can change the functions implementing those tests without affecting the outward behaviour.

Hypothesis: The outward behaviour - the metacognitive algorithms - of my prose should remain the same. That is, I should get the same use out of them as before.

So, how do I use my prose? (The answer is I don’t.) How do I want to use my prose?

I want to have algorithms for thinking that I burn into my mind using practice. So, when I learn, I want to learn as quickly as possible, maybe by using a representative exemplar and following the values. When I research, I want to toggle variables as per conservative-focusing and focus-gambling. When I program, I want to do much of the same, along with using equational reasoning and acceptance tests and whatnot.

Corollary: This means the functions and their unit tests can potentially go to hell. All I care about is my algorithm.

Corollary: If I’m going to do the same thing as ever, there’s no point in having billions of little unit-tested functions. They don’t add any value to me.

Corollary: It should be easy to search for what to do in any situation.

Lesson: Judge your prose by the value of its acceptance tests.

Hypothesis: Actually, I think the mysterious explanations section is pretty good (once you add my latest hypothesis). It’s got some great unit tests. Remember that your mental model is a bunch of small functions.

Hypothesis: I have been throwing out hypotheses like confetti in this section. How do I merge all of them?


Test: Tons of text about categorization algorithm. Have I actually changed the way I categorize the books I read or the lectures I attend? – Oh, hell no. Useless.

Test: Want to remember how to figure out a mechanism – look at my induction notes. Hard to remember. Also, lots of scattered sections.

Test: Try to come up with a practical acceptance test where I use my categorization ideas. - .

Test: (Need a positive exemplar for a practical acceptance test. Some situation where I performed better because I knew about categorization.)

People use the trait “he is a problem child” instead of noticing how his tantrums have been shaped by the parents ignoring the current level and then giving in when he shouts louder. – people are ignoring the relevant factors in the environment and going for a mysterious explanation in the form of a hidden variable “problem child”.

Refactoring: Recognize a High-level Pattern and Abstract the Code

Hypothesis: Unique effect -> can search for it by goal (i.e., can categorize it).

Corollary: Come up with unique effects or diagnostic cues for all functions.

Corollary: Want to merge a function with refactored functions, look at the diffs between their outputs, abstract both of them if possible -> get merged refactored code.

Hypothesis: Recognize a higher-level pattern, use the higher-level function -> simplify this definition (minimize interface) -> easy to understand, easy to reuse, cheap to change. easy to recognize in future code.

Can’t recognize a higher-level pattern, minimize inputs -> get a high-level function, use it -> simplify this definition -> ditto.

Corollary: “Refactor” a function = recognize a high-level pattern and abstract.

Corollary: Final code will have high-level functions and various applications.

Hypothesis: Large pre-existing functions, try to recognize them in new function -> take a lot of time; probably won’t remember them.

Small pre-existing functions, try to recognize them in new function -> quick; probably will remember them.

Hypothesis: Goal of refactoring = get a bunch of functions whose patterns are easy to recognize in new code or new problems.


Test: Unique effect - confused about the explanation “problem child” - look at your notes on mysterious explanations.

Test: Merge two functions - positive exemplar.

sum [] = 0
sum (x:xs) = x + sum xs

length [] = 0
length (x:xs) = 1 + length xs

Explanation: Their empty list branches are the same. And their cons list branches are similar. So, a branch by branch comparison is one way to go.

The other way is to abstract sum fully.

sum :: (Num a, Foldable t) => t a -> a
sum = foldr (+) 0

length [] = 0
length (x:xs) = 1 + length xs

Now, you can no longer do a branch by branch comparison because sum doesn’t seem to have branches.

Hmm… You have to abstract length using the pattern of your foldr function. So, you must know how to recognize the pattern of foldr.

foldr :: Foldable t => (a -> b -> b) -> b -> t a -> b
foldr f z [] = z
foldr f z (x:xs) = x `f` foldr f z xs

Test: Man, that is non-trivial work. For example, foo wouldn’t have matched even though it’s almost identical:

foo [] = 0
foo (x:xs) = 1 + foo (xs \\ [x])

You need to notice that the cons branch knows about not (:) and 1 and (+), but also about (\) & []. Also, the recursive call to foo knows about three things.

Explanation: Not really. All you’d have to notice is that it no longer satisfies the pattern of foldr.

Test: sortInts - minimize inputs till you realize that it can be written as a polymorphic sort. sortStrings will be easy after that because you can match the pattern of sort. – got a high-level function.

Test: Try to recognize map - effortless. – Any time you want to do something to all elements of a list, use map.

Test: Try to recognize a nested for-loop calculating subsets in a new code, or even sum (x:xs) = x + sum xs – hard to see the pattern.

Refactoring Case Study: Causal Hypothesis

Hypothesis: Algorithm - Bring in one important hypothesis. (Ignore low-value corollaries.)

Refactor it as much as you can by minimizing its interface and trying to reuse older functions. Rinse and repeat.


Test: Variables - bring in ideas one at a time into a fresh essay.

Make sure that the ideas in the essay are small and clean and easy to recognize in new prose.

Let’s see how long it takes to cover 145 sections and 54k words.

Stumbling blocks:

Don’t know what to do with a side corollary that may or may not be important. Don’t want to lose it.

Corollary: ATDD by example (page 143): divide input into equivalence classes and pick test cases “exactly at the boundary, at least one right in the middle, and one value right before the next boundary”. For multiple input variables, consider all combinations of their boundary tests.

Maybe just leave it in the original essay. We’ll dig it up if we need it.

Test: I’d been acting like I had to cover all the ideas I’d ever jotted down in my essay. – Nope. Life’s too short for that.

Search = Backward Inference

Hypothesis: Organize prose by topic -> decouple the different sections -> need to search less (? for what?).

Hypothesis: Search = backward inference. Need to eliminate variables that don’t cause the output. Basically, when you’re searching, you want to know which variables cause something. If you’ve organized your information poorly - that is, if you’re uncertain about it - then you will have to consider a lot of possible causes. If you’ve organized well, then you will know the exact causes.

Hypothesis: Lots of possible causes -> may have to get a lot of instances to eliminate unnecessary variables.

Corollary: Collect sections that have similar effects (aka categorize them), induce the common causes -> low uncertainty -> fast search.

Hypothesis: Minimal interface = fewer causes of the output.

TODO: Refactor the essay using your instincts and use the before and after to come up with hypotheses for refactoring. For example, is it the fewer comparisons thanks to the hierarchical structure that let you refactor things quicker or is that you’re merging one unrefactored function at a time with well-refactored functions? (A smaller example would be good too.)


Observation: [2018-01-13 Sat] Haven’t refactored any essay in years. Have just been piling on idea after idea. (Haven’t released them either, which usually forces me to clean them up.)

Test: Collect sections with obvious headings - collect labels first (by going through the headings). Took about 18m.

Test: Unrefactored vs refactored code - sum of numbers in a list of lists. What’s the difference?

for(int i = 0; i < n; i++){
    for(int j = 0; j < k; j++){
	    sum += a[i][j];
	}
}
sum . map sum

Explanation: In the C code, the output is caused by a lot of different things like the two for-loops, i, just, and a. – Have to get more evidence to eliminate those unnecessary causes (aka refactor the code). In the Haskell code, the output is caused by just the three things sum, (.), and map sum. These are necessary causes - no more work need be done.

Test: Search causal hypothesis essay for stuff on category formation - have to go through a lot of sections. – Don’t know exactly which ones say something important about it.

Test: Want to “refactor” ideas on programming - again, have to collect all the sections that mention programming (i.e., cause some insight about programming) – will take a lot of time because I don’t know which ones they are.

Hypothesis: Disjunctive hypothesis, want to get some output -> have to look at all the possible causes of that output.

Conjunctive hypothesis, want to get some output -> have to look at one direct cause.

Hypothesis: Unrefactored code = disjunctive hypothesis.

Corollary: Refactoring = inducing a conjunctive hypothesis out of a disjunctive hypothesis.

Corollary: Programming to an interface, not an implementation = extracting a conjunctive hypothesis as an interface so that you have just one cause for the effect.

Hypothesis: Disjunctive hypotheses are the root of all evil (uncertainty).


Test: Binary search - want to find a value - if value == arr[mid], else if value > arr[mid], etc. – Have to look at maybe three if conditions. Compare that to an encapsulated function find (== value) :: [a] -> Maybe a - you don’t have to look at any other functions.

Test: Want to learn how to solve the subset sums problem - could look it up on this website or that one or Wikipedia. Don’t know which ones will actually give you the “aha!” you’re looking for. – Lot of work.

Test: for-loop to get sum of list; for-loop to get length of list. Feels like list + sum -> one program; list + length -> another program; list + map -> another program. – Have to look at all those possible programs. But you can cover all of them using foldr.

Test: for-loop in two places to initialize page directory - feels like a page directory can be created in two ways. When you want to create a page directory, you’ll have to look at both of them. – But it’s just one way - so, extract a initialize page directory function and make that explicit. Just one way to do it.

Category Hierarchy Design: No Alternatives unless you encode their Criteria

Corollary: Criteria to help you choose among alternatives -> exactly one cause of each effect.

No criteria and free-form alternatives -> lots of causes of each effect; have to test all of them.

Hypothesis: Hierarchy = inverse lookup from effects to causes for a particular type (not an inverse causal structure). It’s just a data structure to help look for the right function matching a contract.

Hypothesis: Hierarchy, want to do X and Y and Z -> go “down” the hierarchy via backward inference.

Hypothesis: Different properties -> may have different “hierarchies”

Corollary: Want to get something that satisfies different properties -> have to unify output sets from different hierarchies. (Not sure.)


Test: Sort with guaranteed O(nlogn) time + extra space - merge sort; sort quickly on average - quicksort; sort almost sorted list - insertion sort; sort integer list - radix sort. – choices governed by options; not free-form.

Test: Drive a car - could be a Ferrari or a Toyota Camry or a Tesla or whatever. – unstructured choices; have to test all of them.

Test: Want to sort - use the sort algorithms hierarchy; need it to be in-place - look at quick sort and other algorithms; need it to work well for almost-sorted lists - look at insertion sort. – backward inference.

Test: Get the right function matching a contract - function to get a sorted list in O(nlogn) worst-case time with O(n) additional space - merge sort. Function to get O(nlogn) average time and in-place - quicksort. – they have the same type signature (Ord a => [a] -> [a]).

Test: Want to get the best restaurant in town - ask certain friend and not others. Want to get the best textbook for botany - ask some other friend. – different hierarchies.

Category Hierarchy: Inverse Lookup

Question: If we do backward inference so often, why not represent our programs so as to make it easier? (Wait. Isn’t that just a type signature?)


Test: Speed of sorting algorithm - O(n) - radix sort, etc.; O(nlogn) - quick sort (average case), merge sort, etc.; O(n^2) - insertion sort, etc. – You need to construct that inverse list explicitly (unless two-way associations already do that for you).

Test: House - milky tears and compulsive behaviour and been around an earthquake => spores in the brain – inverse lookup.

Acceptance Tests: Show how Functions Compose to do what you Want

Hypothesis: Acceptance tests -> show how functions compose to do the things you care about.


Test: Maximum segment sum – we can see how to compose the individual functions to do what we want.

mss = maximum . scanr (@) 0, where x @ y = 0 `max` (x + y)

Acceptance Tests: Get an Algorithm for it

(Notes from Continuous Delivery, chapter 8)

Concept: Acceptance test = “verify that the acceptance criteria have been met”.

Concept: Functional acceptance criteria = input-output behaviour of a program.

Concept: Non-functional acceptance criteria = “capacity, performance, modifiability, availability, security, usability, and so forth”.

Hypothesis: Acceptance tests pass -> you do what the customer wants (and has specified); have not regressed.

Hypothesis: Acceptance tests -> less costly to create and maintain than frequent manual acceptance testing and regression testing and less costly than releasing poor-quality software.

Hypothesis: Acceptance tests -> catch serious problems that unit or component test suites could never catch.

Hypothesis: Want to know when to use acceptance tests or want to design a new method of testing or want to convince others - need to learn all the reasons why you should use acceptance tests.

Simply want to use acceptance tests - need to learn how to use acceptance tests.

Basically, learning all the reasoning behind acceptance tests does nothing for passing your one acceptance test, which is to write good acceptance tests for your projects.

Corollary: What I want is an algorithm, not a causal model.


Test: 25 sentences per page x 30 pages per chapter = 750 sentences per chapter. If we assume that one sentence corresponds to one branch of a causal model, that’s 750 branches.

Test: First, he gave the high-level reason why acceptance tests are worth implementing. Then, he dived into the detailed reasons.

Test: No negative exemplar for acceptance tests.

Test: Manual acceptance testing - if it is not done every time it is released, defects will be released.

Test: Manual acceptance testing - if it is done every time it is released, defects will not be released.

Test: Manual acceptance testing is costly (positive exemplar - $3mn per release).

Test: Release is costly -> can’t release frequently.

Test: Manual acceptance testing - needs to be performed after dev-complete and as a release is approaching.

Test: Time between dev-complete and a release is a time when teams are under extreme pressure to get software out of the door.

Test: Implement your test in the domain language; otherwise it will be brittle. A change in the UI would break many of your tests.

Test: Ideally, we should test using the user interface of the system; otherwise, we’re not using the same code paths as the users.

Acceptance test format: Test your Mental Model

Hypothesis: If it’s a technique or concept, test your mental model on a real-world problem (that tests only one or two more features than before).

If it’s a project, examine it with a reader’s eye.

If it’s a habit, check that your rate of action isn’t regressing.


Test: [2018-03-08 Thu] Currently, my acceptance tests just lie there inertly, all 260 of them, without me doing the slightest bit to run them. – takes too long to run them all.

Test:

Acceptance test: Behaviour - watching YouTube videos when I’m tired.

Given that I don’t “feel” like working (or it has not been reinforced)

when I feel like spending time on YouTube

then I will review flashcards instead.

Explanation: How will I know if I have passed that test? Many hypotheses are hand to falsify; this one is hard to confirm (and yes, getting a hypothesis up to 99% posterior probability is enough to call it “confirmed”). [go by rate of videos instead]

Test:

Acceptance test: Algebra of programming - greedy algorithm.

Given the shortest path problem

when I try to use algebra to get the answer

then I should get Dijkstra’s.

Explanation: This one’s just too hard.

Test:

Acceptance test: TA recitation 1 - content.

When I look at the presentation

then I must have an overall structure.

When I look at each point in my slides

then I must have a real-world example.

When I think of noise in data vs outlier

then I must have an answer.

Explanation: This one takes a really long time to run. I have to go over the whole presentation each time I change it. [go over your final presentation with a reader’s eye]

Test: [2018-03-08 Thu] Currently, my tests are too hard, take too long to run, or are ambiguous. – bad.

Test: My Python tests for softsec, on the other hand, were easy to pass, quick, and unambiguous. – better.

Test: Real-world problem to test my technique on - solve problems from past assignments or the textbook exercises – good test of your mental model.

Test: Only slightly harder (add just one feature) - have learned what min-cut means (but not Dinic’s algorithm) - so test yourself on a minute-cut problem – just one new feature.

Separate Tested and Untested Prose

TODO: Refactor above prose on tests to show all of their effects in one place (like no overlapping hypotheses, short cycle time, etc.)

Hypothesis: Have lots of untested prose -> no tests or type signatures; not refactored into small functions -> hard to find existing functions for your use case -> hypotheses may overlap; hard to use or reuse; hard to change.

No tests or type signatures, but refactored into small functions -> TODO.

Hypothesis: Try to change large untested prose in place -> still lots of prose with no tests or type signature; don’t know which parts work and which parts don’t -> may still have the above problems; have not compared functions for duplicated code; have to wait till everything is fixed before you can use it “as a whole” (no overall flow?); not clear how much work is left.

Import only those functions that have been tested and refactored -> all prose has tests or type signatures; is refactored; know exactly which parts work and which parts don’t -> easy to find existing functions for your use case -> hypotheses won’t overlap; confident that every function has been compared against other functions for duplicated code; easy to use or reuse; easy to change; clear how much work is left.

Lesson: Stop treating your untested prose as if it is actually worth anything. Sequester it and bring in only clean tested prose.


Test: Causal hypothesis essay - lots of scattered ideas; haven’t explained the old evidence using my latest hypotheses like induction or categories as equivalence classes; no “overall structure” - when I’m given a new problem, I don’t have a clear algorithm with which to attack it. I don’t know which idea to use. All those ideas lying in there are useless if I can’t get at them when I need to.

145 level 1 headings, 188 level 2 headings, 54k words. That is not a sane organization of my ideas.

Explanation: It’s too large because I haven’t explained all the evidence using my latest evidence and rather have some obsolete hypotheses lying around.

It’s hard to use because I don’t have concrete acceptance tests (or type signatures) to guide me through each scenario I may be in.

It’s hard to reuse in my future essays too, because I simply don’t remember what ideas I have had.

It’s hard to change because I don’t know where to insert a new test and add some hypothesis explaining it.

Also, there are wrong, untested hypotheses.

Finally, I don’t know how much work I have left. Nothing I do seems to make a dent.

Refactoring: Merge One at a Time

Hypothesis: Merge n unrefactored functions all at once -> don’t get feedback until you’ve covered them all - slow; don’t know the impact made by each section (by looking at the prose before and after toggling it); hard to debug - which section is getting you stuck?; have to compare each section with every other section for duplication because you don’t have tests or types - laborious.

Merge one unrefactored function at a time with refactored functions -> get feedback after every function - quick; know the impact of each section as you toggle it; easy to debug; have to compare only with a few sections of the same type - quick.

Corollary: When summarizing a book or some other prose sources, merge sections one at a time with your current clean hypothesis -> quick.

Lesson: Algorithms win. Don’t fret about the “complexity” of “managing your code”. Ask instead how long it takes to merge n sections all at once vs one at a time.


Test: If I think of refactoring the essay right now, it feels like I’ll have to compare each section against each of the other sections for duplicated code and thus take O(n^2) time (?). Maybe 15 of the 145 sections will combine to give a new section and, to figure out which they are, I have to read the essay over and over. Plus I won’t know how much progress I’m making or whether I’ve actually covered all the ideas that I should have.

If I try to refactor (or start testing) prose by importing an untested, unrefactored section into an essay full of tested and well-refactored code, where each function is guaranteed to have no code in common with any other, then I can simply split it into parts if needed and merge them with whichever clean functions I need. At each point, I’ll know exactly how much progress I’ve made. And I’ll know that the essay (or program) is clean in the sense of having no duplicated code among its functions.

Plus, if the clean code was written, well, cleanly, then you can merge new unrefactored code pretty quickly.

Explanation: Have to compare each section against every other section - time-consuming; don’t get feedback until it’s all “done”; don’t know the contribution of each section.

Hierarchy gives you Fast Lookup

Hypothesis: Hierarchy by topic -> have to search only through a little (ideally, don’t have to search at all). Roughly O(log n) lookup (?).

No clear hierarchy by topic -> have to search through a lot to get what you want. Basically, O(n) lookup.

Corollary: Man, lookup takes a lot of time. Well, it is a function of your uncertainty about the prose. The more unstructured your prose is, the more uncertain you will be about it, and thus the more you will have to search.

Corollary: Section headings give you a lot of clues about their contents.


Test: Try to get sections that are covered by one test - had to walk through a lot of sections just to look for those that talk about refactoring prose. Just reading the headings of the 145 sections (around 800 words) took me around 15 minutes. Not cool. – Have to narrow it down even further. O(n) lookup.

Test: “Forward and backward inference” - straightforward enough. “mathematical research” - ditto. – Nice.

Test: causal hypothesis essay - which sections talk about goals? I can use imenu to get the four sections that have “goal” in their headings. That doesn’t mean I’ve exhausted all goal-related sections, though.

Hypotheses about Hypotheses: Model of the Domain

Hypothesis: My state-transition model of programming seems to cover all the hypotheses that talk about amount of uncertainty about your program’s correctness or the amount of work required to fix bugs.

Hypothesis: One type of hypothesis: move from concrete inputs to an abstract model, make abstract predictions, and then convert them to concrete predictions.

Test: Toggling one design variable at a time seems to be easier to handle than changing multiple variables – the state-transition model of programming seems to explain this.

Observation: My other “hypotheses” seem to be heavily under-specified. For example: [Multiple changes when trying to go to a desired state -> could leave the system in a state that is “hard to reason about”]. The undefined terms here are “the system”, “state”, “multiple changes”, and “go to a desired state”.

Question: How did you locate that abstract model in the first place? Aren’t you overgeneralizing from a few examples (a1 -> b1 to a -> b)?

Question: How do we write normal abstract programming functions?

Hypothesis: We categorize - we look at all the possible input configurations and group together those that give the same output configuration.

Hypothesis: We basically considered the simplest function definition for each input configuration and then merged them all, leaving aside some differing components.

Skill Acquisition

Skill Acquisition: Hypothesis

Hypothesis: Skill you’re trying to learn, have a “unit test” for it -> know whether you have actually learned it or not

Skill you’re trying to learn, don’t have a “unit test” for it -> don’t know whether you have actually learned it or not

Skill you’re trying to maintain, run your “unit test” for it -> know whether you still have it or not

Skill you’re trying to maintain, don’t run your “unit test” for it -> don’t know whether you still have it or not

Hypothesis: Situations, possible actions, situation-action associations - learn how to apply the actions to anything and then store the right action for each concrete situation -> skill acquisition.

Learning Abstract Ideas

Hypothesis: Procedure you want to learn as skill, refactor it, make flashcards as unit tests for the different configurations -> build the cue-memory associations.

Test: Making a junction tree out of a Markov random field - “first high-level step” -> “variable elimination”, etc.

Fluency and Flashcards

Hypothesis: Test yourself using flashcards repeatedly (overlearning) -> high fluency.

Test: Reviewing flashcards for SML final exam - went from 27 seconds per flashcard to 8 seconds to 5 seconds. For the causality midterm - went from 30 seconds per flashcard to 9 seconds to 4 seconds! – tested again and again - high fluency.

Skill Acquisition: Tests

Test: I want to reinforce good behaviour. Am I doing that? (Too vague, I suspect.)

Test: I learned about inadequacy analysis. Do I think about the amount of work required to get some value when analyzing HP fanfiction (like in Eliezer’s Hero Licensing)? – Is this a good unit test? I don’t know.

Test: The classic graph for front-door criterion - will you apply the front-door criterion? I suspect that I will remember the solution directly.

Question: How to avoid remembering the solution and actually generating it?

Test: Reinforcement - I give away my ideas for free thus reinforcing freeloading.

Question: How to remember to do stuff on specific cues? Can you have unit tests for that?

Test: People come up with some suggestion - I can either reinforce or punish. Choose one of the four options. – just four possible abstract actions. However, there are a lot of concrete situations, like “when something falls short of your standard” or “thought about past action” or “someone tries to manipulate you”. Store the right action for each situation based on careful analysis or just feedback.

Motivation aka Ignition: Close Positive Exemplars

Hypothesis: See someone similar to yourself achieve great things i.e., close positive exemplar -> feel that you could do it too if you made those small changes.

See someone dissimilar to yourself achieve great things i.e., distant positive exemplar -> feel that they must have some special “aura of destiny” and that you couldn’t do it no matter what.


Test: Me and Samsung Patrick. Korean female golfers and the first one. – close positive exemplar.

Test: Einstein. Mad scientists like Rick or whoever. “Geniuses” like the famous startup founders who talk weirdly and wear sweatshirts. Mathematical “geniuses” who look and act completely differently from normal people. “Smart” people using big words and an affected American accent - all to say that restore() simply sets the one bit back to what it was. – distant positive exemplars. Feel like I could never be like them. But I don’t have to be like them, I have to do what they did. And for that, I don’t have to emulate all their other characteristics.

My Experiments with Programs

TODO: Test the Roman Numerals problem again by using an integration test from the problem statement to get the high-level design.

Experiment: Program I know how to write

Test: Roman Numerals.

Failed miserably to write the program quickly (or even correctly).

Question: I’m terribly confused. WHAT THE FUCK takes so much time when writing a program?

Why did it take me 24 minutes to figure out how to concatenate [Parser [a]]? (Ed: [2017-12-25 Mon] You weren’t isolating the mechanism? But you were working in the interpreter. Then, maybe you weren’t being smart about your changes.)

Where was I searching and what could I have tried instead?

Roman Numerals: Observations

Question: Should I have toggled some fixed assumption? Which one?

Let’s look at this as scientific experimentation.

Goal: My job was to find a correct program in the space of all programs - one that would pass my tests.

Observation: An empty program -> will fail tests

Observation: My Seasoned Schemer program (or some other random program) -> will fail tests (how did I know that?)

Observation: My old Roman Numerals program -> passes tests

My tests tell me that my final program should have the type romanToInt :: [Char] -> Int, should type-check, and further accept a valid roman numeral as input and return its decimal value as output.

I suppose that valid roman numeral to int is easy enough. (How do I know that?)


3m to figure out that the program needed just two simple functions intToRoman and romanToInt, with no parsing.

4m to figure out how to use Parser and run it with a simple char 'f' Parser (memory)

10m to create the RomanNumeral type and create a one-character parser

3m to create an alternative parser

8m to append list of RomanNumeral

24m to figure out how to concat [Parser [a]] - writing the type signature would have helped me out

15m to realize that it doesn’t work for “XX”.

Look at everything I tried to concat [Parser [a]]:

Main> parse (numeralParser M) "(unknown)" "MMCMMMoobar"
Main> let foo = parse (numeralParser M) "(unknown)" "MMCMMMoobar"
Main> liftM concat [foo, foo]
Main> let bar = parse (numeralParser CM) "(unknown)" "MMCMMMoobar"
Main> let bar = parse (numeralParser CM) "(unknown)" "CMMMoobar"
Main> bar
Main> liftM concat [foo, bar]
Main> liftM (liftM concat) [foo, bar]
Main> :t liftM concat [foo, bar]
Main> :t [foo, bar]
Main> :t liftM concat
Main> liftM concat [foo, bar]
Main> concat . liftM concat $ [foo, bar]
Main> let bar = parse (numeralParser CM) "(unknown)"
Main> :t bar
Main> let foo = parse (numeralParser M) "(unknown)"
Main> :t foo
Main> bar <> foo
Main> :t fmap . fmap
Main> :t fmap . fmap . fmap
Main> :t (fmap . fmap . fmap) foo bar
Main> :t (fmap . fmap . fmap) (++) foo bar
Main> :t fmap . fmap . fmap
Main> :t (fmap . fmap . fmap) (++) [foo, bar]
Main> :t (fmap . fmap . fmap . fmap) (++) [foo, bar]
Main> :t (fmap . fmap . fmap ) concat [foo, bar]
Main> :t mParser
Main> :t numeralParser M
Main> let x = numeralParser M
Main> let y = numeralParser CM
Main> :t [x, y]
Main> sequence [x, y]
Main> :t sequence [x, y]
Main> :t fmap concat
Main> :t traverse
Main> :t sequence ["yo", "boyz"]
Main> sequence ["yo", "boyz"]
Main> sequence [Just "yo", Just "boyz"]
Main> :t fmap concat . sequence [x, y]
Main> :t fmap concat . sequence $ [x, y]
Main> parse romanNumeral "(unknown)" "CMMMoobar"
Main> parse romanNumeral "(unknown)" "MMMCMMMoobar"
Main> parse romanNumeral "(unknown)" "MMMCMMMoobar"
Main> :type intToRoman'
Main> main
Main> parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
Main> map show $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
Main> map (map show) $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
Main> :t parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
Main> fmap (map show) $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"

I tried 50+ different things before I got the (incorrect) final program.

Observation: Firstly, I went wrong in working with foo and bar in the interpreter, which were of type Either ParseError [RomanNumeral] and not Parser [RomanNumeral] like the parsers in my code.

Observation: I tried

liftM concat
liftM (liftM concat)
concat . liftM concat
bar <> foo
(fmap . fmap . fmap) foo bar
(fmap . fmap . fmap) (++) foo bar
(fmap . fmap . fmap . fmap) (++) [foo, bar]
(fmap . fmap . fmap ) concat [foo, bar]
sequence [x, y]
fmap concat
traverse
sequence ["yo", "boyz"]
sequence [Just "yo", Just "boyz"]
fmap concat . sequence [x, y]
fmap concat . sequence $ [x, y]

Hypothesis: Clearly I was trying stuff at random (fmap . fmap . fmap . fmap).

Hypothesis: Rather, I thought the problem was something, but it turned out to be something else.

Observation: Before the fmap . fmap . fmap attempt, I saw that

bar :: [Char] -> Either ParseError [RomanNumeral]
foo :: [Char] -> Either ParseError [RomanNumeral]

and so thought that I needed to go three levels deep (wrong - it was two) so that I could fmap append (wrong - fmap is for one argument) the roman numerals inside foo and bar (wrong - I cared about the Parsers in my code not the output).

Observation: I was worried that I would get combinations of the inner array if I used sequence.

Observation: I didn’t remember (or didn’t understand) that sequence on [m [a]] gives m [ [a]] without touching the inner array.

In fact, I should have generalized further. What I wanted was [m b] -> m [b].

Hmm… I was thrown off by the fact that I wanted to concatenate the inner lists.

Hypothesis: So, I tried to jump from [m [a]] to the final m [a] in one step. And that is why I spent 24m trying to do a simple concatenation.

Observation: I had a similar struggle in the dark with trying to convert RomanNumeral to Int.

map show $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
map (map show) $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
:t parse romanNumeralParser "(unknown)" "MMMCMMMoobar"
fmap (map show) $ parse romanNumeralParser "(unknown)" "MMMCMMMoobar"

Again, I didn’t think of the type signature and ended up trying to solve the wrong problem (I didn’t need to do fmap (map show)).

Nor did I generalize enough - I should have seen that the output was Either ParseError [RomanNumeral], thus f [RomanNumeral], and therefore I just needed to do an fmap. (Still, this wasn’t the problem I wanted to solve.)

Observation: Look what I did to append two Parsers. (8m lost.)

chainl mParser (++) cmParser
chainl mParser (liftM2 (++)) cmParser

Again, no clear type signature and not enough abstraction.


Observation: Wow. Look at the range of strategies I tried! Who said humans can’t search?

Question: What led to the range of strategies? Did each failed test give me the idea for the next test?

Solving the Wrong Problem at the Wrong Level of Abstraction

Hypothesis: What I should have done was to look at the type signature for my problem, after hiding the particulars.

For example, I wanted [Parser [a]] -> Parser [ [a]]. Generalizing further, I wanted [m [a]] -> m [ [a]].

Corollary: Having the desired type signature in mind will also keep you focussed on your goal.

You won’t waste time trying to append the lists in Either ParseError [RomanNumeral] when you really care about Parser [RomanNumeral].

Hypothesis: Keep type signature in mind -> use time well. Don’t keep type signature in mind -> may waste time on the wrong problem.

Corollary: Maybe there’s something to David Rock’s idea of writing down your overall goal so that you don’t forget it (and don’t have to keep it in your mind).


Hypothesis: Not generalizing enough -> get confused vs clear.

example: During my sequence confusion above, what I wanted was [m b] -> m [b] but I kept working with the types [m [a]] -> m [a].

Similarly, I tried to go 3-4 levels deep after looking at foo :: [Char] -> Either ParseError [RomanNumeral]. But I should have generalized it as foo :: f1 (f2 [RomanNumeral]).

Hypothesis: Not remembering the exact type signature of functions -> get confused vs clear

example: I didn’t remember (or didn’t understand) that sequence on [m [a]] gives m [ [a]] without touching the inner array.

Work at the Right Abstraction Level

Hypothesis: You need to generalize your problem in a way that makes it easy to apply your next step.

For example, if I had “simply” generalized the first step of my problem “concatenate Parser [a]” as [m b] -> m [b], I would have got the answer in a jiffy (sequence). I should have realized that Parser was irrelevant to my goal of concatenating the inner lists, and further that getting m [a] came after getting m [ [a]].

Somehow, I needed to realize that given my goal of concatenating the inner lists, Parser was irrelevant.

Hypothesis: You first come up with a strategy, and then prepare your code to make it easy to apply that strategy.

For example, I decided to concatenate the Parsers above. So, I should have abstracted Parser a as m a.

I decided to convert RomanNumeral into decimal by looking up the value of each token and summing it. So, I should have abstracted Either ParseError [RomanNumeral] as f [RomanNumeral].

How to abstract your code?

Hypothesis: Look at what your strategy needs. Abstract everything else.

concat needs [ [a]] and will give you [a]. So, abstract everything else.

lookup needs a String, which needs a RomanNumeral, so get in and get a RomanNumeral somehow. Abstract everything else.

Scientific Method for Programming

TODO: Create an ideal program for RomanNumeral and compare it to your current one.

What I really want is to test for presence and absence of test-passing. However, it’s silly to use unrelated programs like SudokuSolver or SExprParser.

I need something that is similar to RomanNumeral but wrong.

Observation: Right now, I can’t think of variables that differ between working programs and non-working programs.

Hypothesis: Maybe we need to compare intermediate versions of my program to get variables. Look at which versions gave an error and which ones didn’t.

Variables

Observation: Look at each diff of RomanNumeral:

data type RomanNumeral
parser for M
parser for CM
alternative between them
many parsers
many M and many CM
Roman numeral to int
parsers for all Roman digits
combine many parsers in sequence
function to create parser for one Roman digit
handle Either output of Parser
tests
QuickCheck property

Coming up with a parser for M involves trial and error. After that, a parser for CM is trivial (since they obviously differ only in the digit string).

Again, after many M, many CM is easy.

Hypothesis: Basically, if you can get the same output with slightly different values, then you’ve got a variable.

For example, parser for M and parser for CM.

Type Signature of the Scientific Method

Observation: I’m not at all clear about the type signature of the scientific method here.

Maybe the program is the output?

Hypothesis: We want to figure out the model: thoughts -> correct program vs wrong program.

One hypothesis so far is that not keeping a type signature in mind leads to the wrong program. Another is that not abstracting the current program correctly leads to the wrong program in the next version.

For example, I needed to know how to append two Parser [a] and I kept trying lots of things.

Hypothesis: My target model there was: problem statement, current version of program, strategy -> program that passes the test of the problem statement (appending two Parser [a]) vs fails vs passes more tests than before.

So, the variables to change are the problem statement, the current version of the program, and your strategy.

Hypothesis: Strategy :: current version of the program -> next version of the program.

And the strategy can be pretty subtle, like abstracting [m [a]] as [m b]. The first one led to a wild goose chase, the second one gave the answer in an instant.

Observation: Still don’t know how to compare or even observe strategies.


Lesson: Remember to include all your information in your model. For example, I design my program using the problem statement and the current version of the program.

How to come up with Strategies?

Question: How to come up with strategies?

How did I decide that I needed to concatenate Parsers?

Observation: I had a cached thought that a Parser would be a good idea for getting Roman numerals.

More Programming Experiments

Observation: I finished most of the little problems from 99 Problems in around a minute each.

Observation: rotate took 5 minutes, though.

Didn’t know for sure how split worked. That took a minute.

Error because I didn’t use n’. Another minute.

Observation: Combinations - lazy solution - ~4m; recursive solution - 3m

Time Taken is Proportional to LOC?

Hypothesis: (Due to Brooks) Time taken to write a program is proportional to the number of lines of code?

Note: I used website-count-body-lines to count LOC for Haskell.

RomanNumeral - 54 LOC - 68m - 1.25926 - 35 =s 99 Problems - 22 LOC - 25m - 1.13636 RPN Solver - 36 LOC - 97m - 2.69444 - 20 =s RaceSort - 36 LOC - 77m - 2.13889 - 32 =s CIS194 Monads - 79 LOC - 238m - 3.01266 - 65 =s Wikibook Applicative Functors - 221 LOC - 257m - 1.1629 - 305 =s LispJavaPhoneNumber2017 - 86 LOC - 208m - 2.4186 - 62 =s

Calc says that the correlation between [54, 22, 36, 36, 79, 221, 86] and [68, 25, 97, 77, 238, 257, 208] is 0.78.

Why would this be? For one, rotate had two equations, the main one and then the calculation for n. But encode had three, even though they were simple. dupli and split were just a few words long.

How to Make Observations: Follow the Evidence

Hypothesis: How to make observations? Observations require evidence. So, look at the places where you get evidence.

example: A, B -> C; A, ~B -> ~C; you need to see the output

Could you program in the dark, with no type-checking or intermediate tests or checking documentation, but still get it right? Probably not.

So, you need evidence. And what do you do with evidence? You update on it.

Corollary: Your output categories are determined by the feedback mechanism.

Such as “9/15 tests pass” or “type error in the definition of romanToInt”.

Aim: Optimize how you use Evidence?

Hypothesis: Maybe my aim is to optimize my use of evidence. Get to the right answer with the least amount of evidence (and thus hopefully the least time).

Hypothesis: Time taken is decided by the amount of evidence you get and how well you process it.

Observation: When I had trouble with concatenating Parser [a] (I’m never going to live that down), I tried around 50 commands in the interpreter.

encodeModified - 7 times (4 loads, 3 test runs) - 2m decodeModified - 3 times (2 loads, 1 test run) - 1.5m dupli - 3 times (2 loads, 1 test run) - 1m split - 3 times (2 loads, 1 test run) - 1m rotate - 14 times (6 loads, 5 test runs, 2 type queries, 1 evaluation) - 5m combinations - 19 times (9 loads, 6 test runs, 1 type query, 3 evaluations) - 8m

Correlation (according to Calc): 0.98 between [7, 3, 3, 3, 14, 19] and [2, 1.5, 1, 1, 5, 8]


As a programmer, you’re basically a scientist. You’re trying to discover the causal model as described by the problem statement and tests.

So, if you want to program really fast, you need to process evidence really fast. Which means making powerful observations, inferences, and experiments.

Evidence from Within

Observation: The interpreter is not your only source of evidence. Your mind can give you evidence too.

For example, using my rule of using type signatures, I could have seen that I just needed to use sequence to help concatenate [Parser [a]].

Hypothesis: Are rules sources of evidence?

Observation: I can tell that fmap foo . sequence [xs, ys] is going to blow up without evaluating it because I have a crude interpreter in my mind. My rule says that a dot-composition needs a $ before the argument.

Observation: When a function gets over 50 lines of code, my mind warns me that it will be a headache later on (the code, not my mind).

Observation: Similarly, my mind tells me whether some technique will work (sequence to transpose a matrix) and how far a program is from completion. Useful things to know.

Trying 50 different ways to concatenate Parser [a] is stupid and inefficient. You’re not making full use of the evidence given to you by the interpreter.

Tum sunn nahi rahe ho, yaar.

– Music director in Rockstar

Hypothesis: You need to find out as soon as possible if your idea will work.

Experiment: Sujeet RPN Solver

TODO:

Experiment: Sudoku Solver

TODO:

Experiment: Heathrow

Hypothesis: I had to consult the interpreter several times because I was dealing with an unfamiliar library (the Writer monad). Ditto for Parser.

However, Sujeet didn’t need to check any other resources for solve_rpn because he knew everything he needed to write that program.

Experiment: Sudie Heathrow with Twist

TODO:

Ideas for Programming Experiments

Give them instructions.

For example, when they finish writing one block, tell them to test their code so far.

When they make an error, ask them to narrow their scope and work on a toy example.

Tell them to think of the problem’s type signature.

Programming Experiment Design

Hypothesis: Controlled experiment <- toggle some variable that you suspect is a cause.

If you think an using an interpreter or even running test cases gives you feedback, then eliminate them to see how that affects programming time.

If you think the “complexity” of the problem statement affects programming time too, then make it simpler or more complex somehow.

Lesson: In short, write your hypothesis as “programming time <- X, Y, Z” and toggle them individually (or maybe all at once to get a sufficient condition).

Fine-tune your variables to the point where you can predict each extra minute of programming time.

What if you provided helper functions, like isOperator or isOperand? How much less time would they take?

Hypothesis: Also, figure out - response <- programming situation.

Why did I not toggle all the variables when tracking down that Markdown mark-paragraph bug? Where did I go wrong?

If you think programming time is caused by past experience, access to interpreter, and other things, then toggle them. Remove the effect of past experience somehow. (How? Maybe make them program in a new language.)

Vague Stuff

Programming: Acceptance Test

Acceptance test: CS527 - CTF problems.

Given a CTF problem

when you try to solve it using your hypothesis explicitly

then you should either solve it really quickly or learn something that helps you solve the rest of the problem really quickly.

Acceptance test: CS527 - security project.

Given a security project

when you try to implement it securely, poke holes in others’ security, and patch your own holes

then you should succeed very quickly.

Acceptance test: Want to make a Beamer presentation - can’t remember exactly how to do it. Look it up in my integration tests [fast index]. Want to add a pause between slides - can’t find it [try to look it up]. Search online and experiment within the demo presentation [add technique to integration test first]. Use the template.

Acceptance test: Writing a C program - I want to know how to get a pointer to an array. Is p = &a[0] good enough? Can we just do p = a? [remember the integration test; look it up].

Acceptance test: ATDD by example: the parking charge calculator - they wrote an end-to-end test for the calculator in the browser [integration test] that initially checked just one branch of valet parking [short cycle time].

Acceptance test: Milo against Avada Kedavra - he has never fought such a curse before. None of his “acceptance tests” (aka past experience or secondhand knowledge) has anything like this [no precedent in current acceptance tests].

He searched his rule book of techniques (or just his memory) [search techniques] and came up with the Skeletal Troll [new acceptance test].

Acceptance test: [PASS] Anagrams

I tested it once with a small “dictionary”

> anagrams ["yo", "boyz"] "yoboyz"
["yoboyz","boyzyo"]

once with a full dictionary but a nonsense word

> anagrams dw' "yoboyz"
["yboozy","boozyy","boozyy","yboozy"]
(0.44 secs, 190,065,168 bytes)

and finally with a fully dictionary and a phrase from Steve Yegge:

> anagrams dw' "agilemanifesto"
["ngaliemasoftie","ofeliamangiest","ofeliasteaming","agilemanifesto","agleamnotifies","foamiestgenial","foamiestlinage","foetalimagines","fogieslaminate","foliageinmates","genialfoamiest","goaliemanifest","imaginesfoetal","infamieslegato","inmatesfoliage","laminatefogies","legatoinfamies","linagefoamiest","mangiestofelia","manifestgoalie","manifestoagile","notifiesagleam","semifinaltogae","softiengaliema","steamingofelia","togaesemifinal"]
(61.92 secs, 69,279,505,520 bytes)

That’s all it took for me to be confident in my program. Those were my three acceptance tests. And I didn’t automate them. But it was still good enough for my purposes (which was just to test if I could write a quick one-off anagram generator in the Haskell interpreter). [don’t need to automate acceptance tests]

Acceptance test: Siebel self-experiment on Fischer random chess

http://www.gigamonkeys.com/misc/fischer-random-chess.txt

He wrote down his thoughts after every iteration.

Challenge: Explain his thought process. Predict what he is going to do next.

Acceptance test: Siebel self-experiment on the “Impossible Book” problem

http://www.gigamonkeys.com/misc/impossible-book-tdd.txt

Acceptance test: Designing at speed - OS HW3 Paging.

How to go from the problem statement to the high-level design?

Your attempt: how could you have reached the final version faster? How could you have processed evidence more efficiently?

Acceptance test: Speed coding - OS HW2 Pipes.

How to implement the given high-level functions as quickly as possible (using the problem statement)?

Acceptance test: Code compression - Monads

How much longer is your code if you don’t use monads?

Lisp-Java Phone Number Problem

Source: http://www.flownet.com/ron/papers/lisp-java/

Acceptance test: Lisp as an Alternative to Java - Phone number problem.

Convert phone number to phrases that encode it. Each final phrase should decode to the given phone number and should be made of words from the dictionary (and maybe single digits).

(I understood the problem statement by intuition. Don’t know how to use the quantitative requirements.)

Picking an integration test by most “complex”-seeming:

10/783--5: neu o"d 5
10/783--5: je bo"s 5
10/783--5: je Bo" da

Observation: That wasn’t really an integration test. It was just a bunch of unit tests. Also, I didn’t flesh out the integration test by progressively replacing the stubs.

Property:

(concatMap decode . map stripQuotes $ phrase) == just the digits == stripPunctuation phoneNumber

I’m sure Bird would take the inverse somehow and get the answer in a few steps.

Ok. How to get the “fully-specified integration test”?

10/783–5 -> 107835 [phone number -> digits]

107835 -> 1 07835; 10 7835; 107 835; 1078 35; 10783 5. [digits -> prefixes and suffixes]

10 7835 -> je [recursive call on 7835] -> je bo“s [recursive call on 5]; je Bo” [recursive call on 35] -> je bo“s 5; je Bo” da. [prefix, suffix -> phrase]

Hmm… Ok. That is a fully-specified integration test. Let’s run with it and see how it goes. [Step one done. Step two too.]

Hello world works. [Step three done.]

Integration test passes with one stub. One-button deployment.

Follow all the values. Write tests for each function call needed to pass the integration test.

Observation: I’m making sure progress. May be slow (or not). But I’m definitely moving with a purpose.

Observation: 34 minutes into the implementation - done with encode to words and the others. Didn’t expect to finish this easily. (encode is still left, though.)

When debugging encode (which gave [] instead of phrases), I went back to a working version.

My first integration test passed.

Observation: However, stuck on the others. “*** Exception: Prelude.head: empty list”.

Time for my debugging algorithm to step up.

Hypothesis: No debugging -> anybody can coast through the problem.

Debugging -> way better to have unit tests and well-designed code.

Test: What to do when you don’t have a diff? When you fail for a new test? – Narrow down the test first, I guess. (And then, maybe start stepping through the code.)

Turns out I was using head on an empty list.

Huh. It was in checkConsecutiveWords. Once I turned that off, the test passed.

Test: Man, that “severe bug” was handled easily. It took all of two minutes to squint at the error message and realize that head was causing the exception. I overestimated the difficulty of the problem.

Test: Ran everything under one test - had to run tests several times before I could get the generated outcome for each of them. Much easier when I put them in separate tests and got all the generated outcomes in parallel. Way more information! You can easily refactor the test names later. For now, it’s more important to get information quickly.

Test: Hard to get the output of intermediate mechanisms when you use where statements. High cycle time (?). Uncertainty about the cause of the output.

Test: Separated f and thus the value of result. Realized that

filter (not . null) . concatMap (g dict letterMap) $ possibleSplits "112"

itself gives ["1 "]. Narrowed down!

Test: Broke down the input list possibleSplits "112" and checked the value. Turns out that [("1","12")] is what is giving the wrong output.

Test: I’m running manual tests. Have to keep fucking binding dict and letterMap each time.

Test: I don’t seem to have any first-class output for g. It’s just magic to me at this point. Hmm… I don’t know if g should return [""] or [] when the suffix can’t be encoded. I don’t have a contract for g.

Test: Took me 3h19m. Pradeep_2014 did it in 2h17m and he used better types than mine. I used Strings everywhere, for God’s sake. (Maybe his types led to a better design.)

Test: Major problem - I was using functions or variables like encode and g and result without knowing their exact types or contracts. (I wrote those functions without specifying a test for them, even a rudimentary one based on the integration test.)

Hypothesis: I should have inferred g’s contract from its input and output for the integration test.

[2017-12-31 Sun] Want to refactor g. Realize that it’s type is pretty bad.


Observation: Feeling like shit right now because I didn’t zoom through the problem.

Hypothesis: Don’t overpromise, keep improving -> you will get the most utilons you can and you will be happy with that. No need to worry about what others might say.

Inverse-Oriented Programming

Acceptance test: Clean up your phone number example and publish it.

Acceptance test: Use inverse functions to solve Sudoku using its property.

Acceptance test: Test your algorithm on the number to digits problem.

Acceptance test: Collect examples from Bird. (Implement laws as functions.)

Acceptance Test: Fix Bugs in OS HW3

That project involves legacy code and takes a while to compile and test as well.

Acceptance Test: Write Tests for and Refactor OS HW1

It has legacy code and it is in C. Enjoy. Make sure that you can write acceptance tests and pass them with ease. Remember how much trouble you had earlier.

(Toggle the scheduler behind an interface.)

Acceptance Test: Refactoring

Acceptance test: Refactoring example from Ruby conference.

Given some concrete code

when I try to get minimum interface everywhere

then I will get super-clean code.

Acceptance Test: Debugging

Acceptance test: Debugging an Awk one-liner

Given that I wanted to extract acceptance tests from my essays

when I tried an awk one-liner

then it kind of worked, but missed more than half of the acceptance tests for reasons I couldn’t figure out. [negative exemplar]

when I tried to change a few things

then I got random changes. Didn’t know what worked. [toggling on a negative exemplar]

when I started with one narrow section for which the command works perfectly

then I could extend it from there quickly. [positive exemplar; don’t debug; fall back to a positive exemplar]

Acceptance Test: Type Signatures

Acceptance test: Type signatures - Awk - make them simpler and easier to compose.

Given that I had an Awk script to get acceptance tests but it occasionally printed stuff beyond the current section

when I broke the input Markdown files into sections [type: sections, not free-form text]

then I could handle it effortlessly. [simple type]

Interview Question

Acceptance test: Interview question - subset sums

Given the problem statement “Given a set of candidate numbers (C) (without duplicates) and a target number (T), find all unique combinations in C where the candidate numbers sums to T. The same repeated number may be chosen from C unlimited number of times.”

when I try to figure out the algorithm

then I should get

def combinationSum(candidates, target):
	xs = [[] for _ in range(0, target+1)]
	xs[0] = [[]]
	for x in candidates:
		for i in range(target, 0, -1):
			k = 1
			while i >= k * x:
				xs[i] += [ys + [x] * k for ys in xs[i - k * x]]
				k += 1
	return xs[target]

More Acceptance Tests

Acceptance test: Grammar for problem statement of programming problem.

Programming Algorithm

Hypothesis:

Get a “representative” integration test from the problem statement with input and output at every step. [fully-specified representative test]

Follow the values step-by-step to get a high-level design along with type signatures (by looking at what values the output depends on). [follow the values]

Start on top of a working hello-world program. [hello-world]

Set up and pass the integration test with stubs. [stubs]

Make it a one-button deployment. [deployment pipeline]

Implement necessary functions in isolation [implement in isolation] using tests from the integration test [initial unit tests from the integration test; red-green]. Run the integration test every time. Replace stub with actual function call. [flesh out the integration test; run the deployment pipeline]

Pass the test in a quick and dirty way. [coding hat]

Add unit tests for each of the functions. [unit tests] Then refactor. [refactoring hat]

If you need to use some technique, look it up in a Haskell integration test. [look it up in integration test]

If you get a failing test or compilation error, narrow down the causes. [narrow down the causes by backward inference on top of the last working version - look at the diff]

Committing Partially-Working Code or Partially-Correct Ideas (??)

Hypothesis: If I commit ideas that I will soon subsume with another idea, then I will have trouble finding them. – Not true. I can always look at the git log and diff.

But then my repo will no longer satisfy the property that, at any point, every function exists for a purpose - that is, it helps pass an acceptance test that wouldn’t be passed otherwise.

Won’t it, though? At the previous state in the repo, the acceptance test would be passed thanks to the old idea. After you’ve subsumed the idea, the acceptance test would be passed thanks to your new idea. At no point is there a function that doesn’t pull its weight.

Maybe the problem is that the code isn’t fully refactored. But that isn’t part of your repo property. It doesn’t say everything will be beautifully written. It just says everything will be necessary for passing the acceptance tests.

Hypothesis: I don’t want there to be wrong ideas in my repo history aka perfectionism. What will people say?

Hypothesis: New idea, no acceptance test, want to refactor old ideas that the new idea subsumes -> have to go through the whole essay (or at least the revision log) to find out ideas that it subsumes.

New idea, no acceptance test but old changes are uncommitted, want to refactor old ideas that the new idea subsumes -> have to go through just the uncommitted changes.

New idea, have acceptance test, want to refactor old ideas that it subsumes -> go through the acceptance tests to see which value transformations you could do by yourself.


Test: I hesitate to commit partially-correct ideas that I’m in the middle of improving, even when they add up to over a thousand words.

For example, I’m writing about reductionism and behaviorists and observable consequences and how you should have definitions that make more than one prediction and so on. I know that my final hypothesis will tie up all these loose ends elegantly pretty soon, so I don’t want to commit the rest before that.

Test: Tried committing partially-correct ideas - felt unclean, somehow. Didn’t like it at all. Felt like the log message would be misleading for future Pradeep. It would say “advance predictions <- testable definitions”, which isn’t the whole story, especially because of what I’m working on next. – It’s extra work for him. If he had the obsolete ideas lying around uncommitted, he would know exactly what to change instead of having to trawl the essay or revision log. (There might still be other old ideas that could be subsumed, but at least this extra work would be avoided for the given ideas.)

Specs and Estimation

Hypothesis: Spec -> estimate how long it will take.

Hypothesis: Spec -> may turn into a to-do list and thus negative reinforcement instead of the positive reinforcement of building something that has never existed before.


Number three giant important reason to have a spec is that without a detailed spec, it’s impossible to make a schedule. Not having a schedule is OK if it’s your PhD and you plan to spend 14 years on the thing, or if you’re a programmer working on the next Duke Nukem and we’ll ship when we’re good and ready. But for almost any kind of real business, you just have to know how long things are going to take, because developing a product costs money. You wouldn’t buy a pair of jeans without knowing what the price is, so how can a responsible business decide whether to build a product without knowing how long it will take and, therefore, how much it will cost? For more on scheduling, read Painless Software Schedules.

Test: CS373 HW1 - can’t seem to estimate how long it will take me. Feels like it will take a very, very long time and that I will surely fail to do a “good job” (even though I’m not clear about what that is).

Other Tests (??)

Test: page table code in HW3 - wrote hopelessly buggy pointer arithmetic code. Zero tests.

Test: Wasn’t sure how to condition on a variable when there were two factors. Did the derivation in a scratch page - got the answer!

Test: Wasn’t sure how to do pointer arithmetic on an array. Ditto for bit shifting. Results were shocking! – yup.

Test: HW3 - no unit tests, just constant rebooting; still discovered all my bugs by reading the diffs. – read diff line by line.

Test: Trying to write org-autoclock - should I create an automated test or do a manual test? Which would lead to faster (quick-and-dirty) code?

Test: org-autoclock - manual test - fast; try to write automated test - lost! – impure function.

Test: function to paste joined text - automated test - fast – pure function.

Testing

Look at the Input Configurations (TODO - flashcards)

Hypothesis: pure function, know all possible input configurations for which the output is different -> know test cases; clear how to proceed.

pure function, don’t know all possible input configurations for which the output is different -> don’t know all test cases; not clear how to proceed.

impure function, know exactly where it will be called and what you expect (thus, know all possible input configurations for which the output is different) -> know test cases; clear how to proceed.

impure function, don’t know all the places where it will be called and what you expect (thus, don’t know all possible input configurations for which output is different) -> don’t know test cases; not clear how to proceed.

two input configurations, don’t know if they are supposed to give the same output or not, plan for both -> can choose the correct one once you get feedback.

two input configurations, don’t know if they are supposed to give the same output or not, plan for neither -> confused even after feedback?

Hypothesis: Concrete walk-through, toggle each variable (come up with alternatives for each concrete value) -> get possible input configurations.


Test: org-autoclock – crystal-clear after I wrote out the configurations.

Three variables: task, clock, file

no clock, task file -> ask for org task
no clock, not in task file -> do nothing
clocked in, task file -> resume if needed
clocked in, not in task file -> clock out if needed
clocked out, task file -> resume
clocked out, not in task file -> do nothing

Test: unclear about vcreate and what it actually does - well, hsize can be 100, 300, or 1M. What to do for each case? Next, allocate_bs may return SYSERR - what to do then? – write input configurations -> clear!

Test: deallocate_bs - not sure of all the places where it is called. I can see that it is called when a process is killed. But where else? I feel unsure. – impure function, don’t know all possible input configurations.

Test: Haskell - list recursion - either empty or head and tail - clear. – pure function - know all possible configurations.

Test: OS HW3 - open backing store - don’t know whether more than one process can access the same backing store. If so, then I have to handle the case where multiple processes try to access the store and own different pages in it. If not, then I don’t have to worry about any of that. – don’t know all possible input configurations for which output is different. If one process, then no matter what, we know that the behaviour is the same simple thing. If multiple processes, then we know that the behaviour depends on which process is asking for what page and whether somebody else is accessing the store right now.

Test: OS HW3 - don’t know whether hsize can be more than 1600 or not. Every other doubt stems from this uncertainty. – if hsize is only within 1600, then I know what to do. If not, then I don’t know what to do, but I can find out. Clear.

Test: OS HW3 - What worked, I think, was to have a concrete walk-through. That told me where there could be potential hotspots.

“You want to read from some address in your private heap. [might want read from elsewhere] (You’ve already used vgetmem to allocate the memory. [might not have used vgetmem yet]). First, there will be a page fault because you’ve never accessed that page before [might have accessed that page before and it might be present or not]. So, you will allocate one frame for the page table and then one frame for the page itself [may not need to allocate one frame for the page table]. And then copy that page from the backing store [or just don’t]. There will always be a free frame available, so no problem [there may not be a free frame available, in which case, problem].” – Input configurations - want to access memory within your private heap or not; have used vgetmem or not; have accessed page before; is present in memory; need to allocate frame for page table; copy from backing store or don’t; free frame available or not. Also, it’s one process right now. What if you have multiple processes?

Compare with the Past: Our Acceptance Tests are Implicit

Hypothesis: Humans don’t carry around an explicit checklist of features they want to have in their next project. They simply compare it to the set of features they have seen in a successful project (and not seen in failed projects).

Corollary: Types of acceptance tests - compare with a similar past project to make sure you have all its features; compare with a similar failed project to make sure you don’t repeat your mistakes.

Corollary: So the similar projects we have are our acceptance tests.


Test: Designing a homework assignment - given previous assignment and current assignment, when you compare each part of their structure (like the submission instructions, the dataset details, the heading, points adding up to 100, examples of the intended output, specifying whether you want the code or just the answers in the PDF, etc.), then you should have covered everything they did – your assignment should have all the features the previous one did.

Test: Writing a programming project - given that I sucked at the first OS homework and didn’t have any unit tests, when I check my second OS homework, then I should have (something resembling) unit tests for my functions. – don’t repeat your mistakes.

Test: Algos midterm 2 - given that I blew some stupid questions in the first one and ran out of time, when I prepare for the second one, then I must clarify simple concepts (and perhaps add them to my cheat sheet) and hurry up in the exam – don’t repeat your mistakes.

Test: Classmate uses an automation tool that I didn’t know about and that starts, interacts with, and stops the OS on a remote computer without any human interaction - given that I am working on an OS project or some other project that requires manually entering commands, when I look at what I’m doing, then I should be using that automation tool – learn from others’ successes.

Test: Preparing for my midterm - looked at past midterm notes to know how I’d done it – acceptance test.

Test: Forgot that this chilli powder is extra hot and that I should use less of it – forgotten acceptance test.

Acceptance Tests vs Unit Tests: Done when Acceptance Tests Pass

Hypothesis: Can write “working product” without writing or passing all unit tests -> may go faster without them.

Can’t write “working product” without passing all acceptance tests (even if implicitly) -> may go faster by writing them explicitly.

Corollary: Expert programmers write programs with implicit acceptance tests in their minds.

Corollary: You don’t need to automate your acceptance tests.

You may need to run it only once in a while and the overhead of doing it by hand might be worth the convenience.


Test: Jamie Zawinski

There’s bound to be stuff where this would have gone faster if we’d had unit tests or smaller modules or whatever. That all sounds great in principle. Given a leisurely development pace, that’s certainly the way to go. But when you’re looking at, “We’ve got to go from zero to done in six weeks,” well, I can’t do that unless I cut something out. And what I’m going to cut out is the stuff that’s not absolutely critical. And unit tests are not critical. If there’s no unit test the customer isn’t going to complain about that. That’s an upstream issue.

Explanation: Can write what matters without writing or passing all unit tests.

Acceptance Test: Merge Old Tests for Speed

Hypothesis: Starting out, test one feature at a time -> fine-grained feedback.

Pretty sure your code works, test one feature at a time -> slow.

Pretty sure your code works, merge tests into most valuable paths -> quick.

Pretty sure your code works, error cases, keep them vs don’t -> ??.


Test: Softsec project - lots of single command scripts like whoami\n, etc. – pretty slow; not of much use.

Write Notes in Acceptance Tests Format (Obsolete)

Hypothesis: Add notes and to-do items in the form of acceptance tests -> collect ideas by scenario and goal; know exactly where an idea is needed and thus how important it is.

Add free-form notes and to-do items -> don’t collect ideas by scenario and goal; don’t know where an idea is needed or how important it is.


Test: When writing acceptance tests for a presentation, I found myself moving material from my free-form notes to the tests. For example, seeds for k-means clustering - I moved in notes about a visualization tool I found online. Similarly, for weaknesses of k-means - I moved in notes from Wikipedia. – could tell the scenario for which each note was useful. It wasn’t lying around without a home.

Test: Negative exemplar - free-form notes - my essays - 200k words lying around just like that – don’t know where each idea is needed or how important it is.

Actually Writing Acceptance Tests

Hypothesis: Use an actual, slightly-comical person in your acceptance test -> make it more lively.

When you’re writing a spec, an easy place to be funny is in the examples. Every time you need to tell a story about how a feature works, instead of saying:

The user types Ctrl+N to create a new Employee table and starts entering the names of the employees.

write something like:

Miss Piggy, poking at the keyboard with a eyeliner stick because her chubby little fingers are too fat to press individual keys, types Ctrl+N to create a new Boyfriend table and types in the single record "Kermit."


Test: Instead of saying

Given that this is just after the command “login foo” and the password for foo is bar

when foo types “pass bar”

then foo must be successfully “authenticated”.

I said:

Given that this is just after the command “login j0hn_5m17h” and the password for j0hn_5m17h is j0hn_5m17h123

when the user types “pass j0hn_5m17h123”

then j0hn_5m17h is successfully “authenticated”.

Explanation: There’s no semantic difference between the two. And yet, the latter feels like it’s talking about an actual person, complete with a super-weak password.

Mistakes despite Testing

Test: [2019-03-10 Sun]

def subgradient(self, df, cols, target_col):
	mistakes_mask = (df[cols].dot(self.w) + self.bias) * df[target_col] < 1
	# I had written:
	# mistakes_mask = df[cols].dot(self.w) * df[target_col] + self.bias < 1

My test didn’t catch the difference.

Red-Green-Refactor: First, Make the Test Fail

Hypothesis: Making the test fail without the change is a relevant factor for knowing that the change caused the test to pass.


Observation: Red-green saves my ass. I keep forgetting to add my new test case to the test list, and so it seems like the test has passed. Only when I make it fail do I realize my mistake.

Red-Green-Refactor vs Focus-Gambling (??)

Test: Implementing the pseudo-code first and then covering branches with tests violates red-green-refactor. Then again, focus-gambling itself violates red-green-refactor. You’re concluding first and running tests later.

Unit Testing vs Regression Testing

Hypothesis: In any kind of testing, you have to provide the input for the (hopefully) representative test. In unit testing, you have to provide the desired output as well. In regression testing, the last good implementation provides the output.

So, if you’re regression-testing a React component, you still have to construct representative test inputs. It’s just that the program will produce the output and Jest will save that snapshot.

Shallow Testing vs Deep Testing

Hypothesis: Shallow tests distinguish between completely failing products and reasonably working products (especially if your code doesn’t game the test) and are quick to write.

Deep tests distinguish between completely working products and any non-working products but are slow to write.

Hypothesis: Shallow tests are better than nothing. And since they’re pretty quick to write, if use them early you will be able to measure partial progress.

If you insist on using only deep tests, you will have to wait a long time before you can pass your test.

Corollary: These are probably the tests that Harry was talking about. Run quick, easy checks to test whether your ideas are right or not.


Test: Shallow test - check if the CSS class for a button is in the DOM – shallow test that distinguishes between a completely failing and a reasonably working product. Quick.

Test: Deep test - check if the React component matches the previous snapshot – deep test that distinguishes between a completely working and any non-working product. Slow.

Regression Tests are Awesome!

Observation: [2018-07-26 Thu] “8 volunteers needed” vs “8 more volunteers needed”. – May never have caught that!

Testing Refactors your Code

Hypothesis: Testing forces you to think about the input and output of your function, at least for a representative example.

Hypothesis: Testing makes you refactor your code by breaking out unnecessary dependencies.

Wow. Testing forces you to decouple your code.

For example, I just realized that the volunteering stuff had nothing to do with the rest of the to-do list. It was limited strictly to the row.

Hypothesis: Snapshot testing turns your code’s behaviour into a first-class output. Easy to test.

Observation: jest --watch is super-sweet! Instantaneous feedback!


Test: [2018-07-10 Tue] Trying to Jest snapshot test a React component made me realize that it wasn’t pure. (It looked up some config stuff from a store.) – break the dependency.

Test: Snapshot is a JSON file – first-class output. Easy to test.

Testing makes you think of the Different Branches

Hypothesis: Testing makes you think of the different branches and thus the different representative tests you will need to understand the individual functions.

(Note that understanding the system as a whole may take more representative exemplars. Or not.)

Corollary: That means that I need to think about the different branches in my own essays to figure out what tests I need. Right now, it’s all just a test-less mess.

Hypothesis: Why I resist making sweeping changes here on the frontend - testing all the different branches again would be a massive pain.


Test: [2018-07-13 Fri] Testing a Flux store - realized that there were just around 10 actions and that they each had simple effects. There were practically no branches at all. So, all I would need would be 10 tests.

Note, though, that there are still edge cases like adding an item at index -1 (which would currently put it at the front of the list) or removing an item at an index that doesn’t exist.

Test: Just discovered a bug (several, in fact) - reset list doesn’t reset the last added index.

Unit Tests help you Eliminate Irrelevant Factors in a Function

Hypothesis: Unit tests are like instances that help you learn a hierarchy of relevant factors. In particular, they help you eliminate the parts of your functions that are irrelevant to your use cases (i.e., your category feedback).

Notes on The Art of Software Testing

White-box Testing: All about Code Coverage

Hypothesis: Testing must falsify other plausible programs and thus provide strong evidence for your program.

Other plausible programs are those that you can get by modifying different features in your program, like changing && to ||, changing the function applied to a value, or changing X > 1 to X > 0. Of course, there are infinite such programs that you can construct. We want the ones that are most plausible.

Hypothesis: Branch coverage (or decision coverage) means that every decision must be set to true and to false at least once. (And if your code has multiple entry points, each must be invoked at least once.)

Hypothesis: Condition coverage means that every condition must be set to true and to false at least once. (And entry points must be invoked at least once.)

It may not always satisfy branch coverage because if(A && B) may pass condition coverage with (A = true, B = false) and (A = false, B = true), which does not test the if-true branch.

Hypothesis: “Decision/condition coverage” requires that each decision takes all possible outcomes, each condition takes all possible outcomes, and each entry point gets invoked.

Hypothesis: Multiple condition coverage means that each decision gets all possible combinations of condition outcomes.

This seems to be the gold standard for white-box testing.

I suspect that multiple-condition coverage is about getting a positive and negative exemplar for each condition so that we know it is indeed a relevant factor.

Question: What if we combine all the if-blocks into one long if-block?

You have to be careful to write tests that distinguish between the different branches of that block. For example, I had if (a && b) { if (c || d) {}}. Now, I collapsed that into one block as if (a && b && c) {} else if (a && b && d) {} .... But when I tried to test the second branch, I inadvertently gave an input that passed not just a, b, and d, but also passed c, which meant that the branch that got executed was the first one. Instead, I have to write if (a && b && c) {} else if (a && b && !c && d) {} ..., so that my test is forced to pass a, b, and d but not c.

Therefore, because || is short-circuiting, we get less than the full exponential number of configurations. Which tells us that the first condition of || is intended to be more important than the second, since the latter is checked only when the former fails.

Hypothesis: What about all the possible configurations of the conditions?

For example, you have to cover all branches like A > 1 && B == 0 && !(A == 2) && !((X / A) > 1) and predict their value. However, your program treat many configurations the same way. For example, if it had if (A > 1) { return 3; }, then for all combinations of the other three conditions, you would return the same value.

However, the number of such configurations is obviously exponential.

What does White-box Testing do?

Hypothesis: White-box testing checks that the target function does not use fewer or more conditions than your function. And when the target function uses the same number of conditions, white-box testing checks that your boolean logic is correct (for instance, you haven’t used (&&) instead of (||), or x < 1 instead of x > 1).

Test: TODO.

Black-box Testing: Check the Spec

Hypothesis: Black-box testing checks whether your program implements the specification.

Hypothesis: Once you’ve got valid and invalid classes for each variable, cover the valid classes with some end-to-end examples and cover the invalid classes by toggling exactly one class from some valid example.

Hypothesis: When you’re given a grammar for some data structure, get valid and invalid equivalence classes for each node.

Hypothesis: Output equivalence classes are a relevant factor for input equivalence classes. So, you need to define the spec anew for each output variable.

Hypothesis: Different input cases are a relevant factor for input equivalence classes. So, you may need to handle each case separately.

Hypothesis: White-box tests are a relevant factor for getting positive exemplars as you implement your function.

Corollary: If your valid test passes, it means that your program handles all the valid classes as they are supposed to. If your valid test fails, then your program doesn’t handle one or more valid classes. To narrow it down, you will have to write tests where you change just one valid class at a time.

Corollary: A combined valid test will tell you whether something works in just a few tests, but will tell you only broadly if something failed. Unit tests will tell you whether all your features work but will take up a lot of tests, and if some things fail, they will tell you exactly which ones. So, combined tests are cheap for checking correctness, whereas unit tests are cheap for diagnosing failures.

TODO: Question: When will a black-box test find a bug that a white-box test won’t? Is it that it will find stuff in the spec that you didn’t implement or stuff outside it that you did implement? I need a controlled experiment to show that.


They suggest having one test case cover as many valid equivalence classes as possible (like “mountain bike with front lights and helmet”, where “mountain bike”, “front lights”, and “helmet” are equivalence classes). I suspect this is so because the test case would pass only if all of them are implemented correctly, which means you need fewer tests to check a lot of features.

They also suggest having one test case cover only one invalid equivalence class at a time (like “flying bike”, but not “flying bike with quantum-powered helmet”). I suspect that is because the latter may fail immediately after “flying bike” and thus you will never know if an invalid input like “quantum-powered helmet” is caught by your system.

Test: [2018-11-03 Sat] Fortran DIMENSION statement.

External condition Valid equivalence classes Invalid equivalence classes
#array descriptors 1, > 1 0
symbolic name length 1 to 6 0, > 6
symbolic first letter letter digit
symbolic name contents letter, digit non-alphanumeric
#declarators 1 to 7 0, > 7
constant -65534 to 65535 < -65534, > 65535
ub constant, int var name array elem name, something else
lb constant, int var name array elem name, something else
lb specified yes, no
ub >= lb < lb
lb -ve, 0, +ve non-number
multiple lines yes, no

If I didn’t have separate equivalence classes for lower and upper bounds, then I may try (6:FOOINT), but not (FOOINT:6), which may crash the program.

Note, however, that the book itself treats integer variable name as a separate equivalence class. So, it may not test whether ub could handle digits in the integer variable name. All your tests may have ub as an integer variable name with only letters. Even if you were clever enough to use F3035X as your example everywhere for the integer variable name class, you still aren’t testing FX3035. Maybe ub fails when there is an alphabet in the second letter. Or, worst of all, it fails only for the specific name FAILZZ. How can you guard against that? You can’t.

So, you’re assuming that the programmer won’t create finer equivalence classes than the specification implies.

Test cases to cover valid equivalence classes:

First, we cover all the valid equivalence classes.

Test 1: DIMENSION A393B(-6000:FOOINT)

Test 2: DIMENSION FOO(5, 6:9, BARINT:9), BAR(0:4)

Then, we cover the invalid equivalence classes.

Test 3: DIMENSION Test 4: DIMENSION (3:4) Test 5: DIMENSION FOOOOOOOOOOOO(3:4) Test 6: DIMENSION 3FOO(3:4) Test 7: DIMENSION FOO!(3:4) Test 8: DIMENSION FOO() Test 9: DIMENSION FOO(1,2,3,4,5,6,7,8) Test 10: DIMENSION FOO(-1000000:3) Test 11: DIMENSION FOO(10000000) Test 12: DIMENSION FOO(BAR(3)) Test 13: DIMENSION FOO(BAR(3):9) Test 14: DIMENSION FOO(1 + 3) Test 15: DIMENSION FOO(3:2) Test 16: DIMENSION FOO(1+3:5)

Test: [2019-01-12 Sat]

We define a new language grammar: ::= ‘+’ | ‘-’ ::= ::=

For term, valid classes are 0-9 and invalid classes are +, space, alphabets, etc. For addop, valid classes are + and - and invalid classes are * and / and space and other symbols and alphabets. (Fuck! I hadn’t even thought of those until now.) For expression, valid classes are term followed by addop followed by term, and invalid classes are cases where one of them is missing or invalid and cases where you have trailing garbage.

So, I should come up with one example of expression to cover all the valid classes and then toggle each part to cover all the invalid classes.

If I hadn’t come up with valid and invalid classes for each node, I would have missed out on symbols and alphabets for addop.

Test: [2019-01-13 Sun] StackASMGenerator - had just given input expressions that covered the different structures of Exp. However, I hadn’t tried to construct expressions that would result in different output values like -1, 0, etc. – Ah. The output variable is different now, hence I may need different input equivalence classes. So, 2 - 1 and 1 - 2 were equivalent inputs for the output variable of Exp but vastly different inputs for the output variable of Int.

Test: [2019-01-13 Sun] StackASMGenerator - needed to handle Minus, Divide, etc. separately since they have different input equivalence classes for the output variable of type Int. For example, -9 * -2 didn’t have any problems. But -9/-2 gave me 5! I was even ready to dismiss that as “rounding”. But then -9 / -3 gave me -1431655762. Dafuq! Realized that I had failed to sign-extend rax into rdx and was simply zeroing out rdx. – Input cases are relevant factors for input equivalence classes. That saved me from making a stupid mistake.

Test: [2019-01-18 Fri] I need smaller tests to check my progress. For example, with a large black-box test, I didn’t implement operator precedence, so I get this long error message

Prim(*/+-,Unary(-,Lit(3)),Prim(/,Lit(0),Prim(*-,Lit(2147483646),Prim(*,Lit(4),Prim(-,Unary(-,Lit(2)),Prim(-,Lit(2147483646),Prim(+,Lit(2),Prim(*,Lit(7),Prim(*,Lit(8),Lit(9))))))))))
did not equal Prim(*/+-,Unary(-,Lit(3)),Prim(*-,Prim(/,Lit(0),Lit(2147483646)),Prim(+,Prim(-,Prim(-,Prim(*,Lit(4),Unary(-,Lit(2))),Lit(2147483646)),Lit(2)),Prim(*,Prim(*,Lit(7),Lit(8)),Lit(9)))))

It’s a precedence problem, alright. But I’ll need to muck around a lot before I get that entire test to pass. (Or not. Maybe it’ll work right away.) In the meantime, I haven’t got even a single positive exemplar for my code so far. Bad.

Maybe this is why we have white-box tests.

What I need is a test that toggles just one variable at a time. So, when I implement let-expressions, I should add to a working expression a new let-declaration.

Man, black-box tests suck for debugging. Didn’t know what had gone wrong when implementing a let parser.

High-Coverage Tests that Match the Spec tell you when Code Deviates

Hypothesis: High-coverage tests that match the spec are relevant factors for failing the tests when the code doesn’t match the spec.

Tests that don’t cover the code as much may miss some cases where the code doesn’t match the spec.

Tests that don’t match the spec definitely miss some cases where the code doesn’t match the spec.

Hypothesis: Changing the outputs of your tests when the spec changes is a relevant factor for noticing code mismatch.

If you don’t change (or check) the outputs of some tests, you may not notice when the code doesn’t match the spec anymore.


Test: [2019-02-23 Sat] Case where I was saved because of branch coverage (i.e., making the white-box test inputs cover the code’s branches) and updating the outputs when the spec changed:

This is some old and ugly code from my Emacs configuration. It adds a Markdown quotation source using the link that you have in your clipboard. If you pass an argument, then it simply adds an empty quotation source.

(defun website-insert-quote-source (&optional arg)
  "Insert yanked link in blockquote format.
With arg, just insert an empty signature."
  (interactive "P")
  (if arg
      (insert ">\n> -- ")
    (let* ((yanked-link (website-get-yanked-link))
	   (link-string (format "[%s](%s)"
				(website-link-attribution yanked-link)
				(website-convert-link yanked-link))))
      (insert (format ">\n> -- %s" link-string))
      (search-backward "]")))

Now, I wanted to add a newline at the end of the quotation source because otherwise I had to do that manually each time. I added it to the second branch and ran my tests and found that three tests failed, as expected. Score one for unit tests. However, I realized that I actually had four unit tests for this function and one of them was failing to fail. Turns out I hadn’t changed the case for the empty source.

This is why you need branch coverage. Without it - specifically, without a test for the empty-source branch - I would have updated my three tests with newlines, changed just one branch, passed those tests, and thought that I had successfully changed my function to match my new requirements. I wouldn’t have realized that I had missed a branch until I kept getting the old annoying behaviour in some mysterious cases.

Test: [2019-02-23 Sat] Let’s say your spec is to join a bunch of strings using some delimiter.

You have tests whose inputs cover the code and whose outputs match the spec:

*Main> let joinWith d [] = []; joinWith d xs = foldr1 (\x y -> x ++ d ++ y) xs

*Main> joinWith "-" [] == []
True
*Main> joinWith "-" ["yo", "boyz", "I", "am"] == "yo-boyz-I-am"
True

The two input cases that cover the code’s branches are the empty list and the non-empty list. We test that their outputs match the spec. (I’m using simple equality checks in the REPL for this example.)

Let’s say the spec changes to ignore any empty strings. What needs to change?

Well, unfortunately, none of the test outputs need to change because their inputs don’t lead to any leading or trailing whitespace. And since the tests don’t change, the code doesn’t need to change either. But that’s not right.

This is where black-box tests, specifically equivalence classes, come in handy. We don’t have any test input that covers the equivalence class of “empty string” as mentioned in the spec. All of the strings we use in our test inputs are non-empty.

So, the need for equivalence class coverage will make us change our test input:

*Main> joinWith "-" ["yo", "boyz", "", "am"] == "yo-boyz-am"
False
*Main> joinWith "-" ["yo", "boyz", "", "am"]
"yo-boyz--am"

As we can see, it adds a double dash because of the empty string. The test fails, so we change our code:

*Main> let joinWith d [] = []; joinWith d xs = foldr1 (\x y -> if x == "" then y else x ++ d ++ y) $ xs

*Main> joinWith "-" ["yo", "boyz", "", "am"] == "yo-boyz-am"
True

Now, it passes the new test. Are we done? Actually, the tests don’t cover all the branches of the code. foldr1 has two branches: the singleton list and the non-singleton list.

See what happens when you test with a singleton:

*Main> joinWith "-" ["yo"] == "yo"
True

It works. Huh. No problems, then?

What if the singleton element is an empty string?

*Main> joinWith "-" [""] == ""
True

I somehow suspect that there’s a bug, but I’m not able to narrow it down.

*Main> joinWith "-" ["yo", ""]
"yo-"

Aha! Gotcha. How could I have caught that?

So far, the tests are:

main :: IO ()
main = hspec $ do
  describe "joinWith" $ do
    it "should join strings with delimiters" $ do
      joinWith "-" [] `shouldBe` []
      joinWith "-" ["yo"] `shouldBe` "yo"
      joinWith "-" ["yo", "boyz", "I", "am"] `shouldBe` "yo-boyz-I-am"

    it "should handle empty strings" $ do
      joinWith "-" [""] `shouldBe` ""
      joinWith "-" ["yo", "boyz", "", "am"] `shouldBe` "yo-boyz-am"

    it "should handle an empty string at the end" $ do
      joinWith "-" ["yo", ""] `shouldBe` "yo"

How could I have come up with these test cases given my original spec and implementation?

Note that I fell into the trap of the empty string at the end of the list because I modified my existing program instead of coming up with a new program based on the new spec. If you’d told me that you needed to ignore empty strings, I would have just filtered out the empty strings before calling joinWith.

Other Ways to Notice Code Mismatch

Hypothesis: Running tests is a relevant factor for noticing code mismatch.

Hypothesis: Changing the inputs of your white-box tests when the code changes is a relevant factor for noticing code mismatch. (TODO: Need a controlled experiment.)

Question: When you refactor your code, you judge whether your code still works by checking if the tests pass. But don’t you need to change the white-box test inputs when you change the branches in your code? (TODO)

Hypothesis: Changing the inputs of your black-box tests when the spec changes is a relevant factor for noticing code mismatch.

Hypothesis: Changing the outputs of your tests when the inputs change is a relevant factor for noticing code mismatch.

Branch Coverage vs Multiple-Condition Coverage

Hypothesis: If-expression with multiple conditions is a relevant factor for multiple-condition coverage leading to more tests than branch coverage.


Test: [2019-01-23 Wed] In the following case, one case each for Int, Unit, and Boolean, and one case like Literal(3) would satisfy branch coverage. However, you aren’t testing close to the boundary. You want to test something that is also an identifier but not one of the three above. Something like Ident("IntFoo"). I guess multiple-condition coverage would kind of cover that since Ident("Int") is actually isIdent && Ident("Int"), which will force you to try Ident("SomethingElse") to satisfy the first clause but not the second.

def parseType: Type = in.peek match {
  case Ident("Int") =>
	in.next
	IntType
  case Ident("Unit") =>
	in.next
	UnitType
  case Ident("Boolean") =>
	in.next
	BooleanType
  case _ => expected("type")
}

That wouldn’t have been the case if each had been a simple case like:

def parseType: Type = in.peek match {
  case IntIdent =>
	in.next
	IntType
	...
  case _ => expected("type")
}

In this case, branch coverage and multiple-condition coverage would have yielded the same tests.

Stateful Programs: Assert Properties about the State

Hypothesis: If you also mention what a function does to the state, then you can catch any unwanted side-effects. Otherwise, you can’t.

Test: [2018-11-07 Wed] I tested that an empty method on a Queue would do the right thing for an empty, partially-filled, or full queue. I didn’t test, however, that it wouldn’t change the queue.

How to Test Whether a Context Variable is Relevant?

Hypothesis: Having a context variable is a relevant factor for trying to find out examples where that variable affects the output. (TODO)


Question: How do I test the following? There are no real branches within the Let-case. However, I was surprised to see that the return type for Let was nbody.tp and not pt. It feels like there should be a test for that, just to smoke out people who went for the wrong thing (even though I don’t know why it’s wrong).

case Let(x, tp, rhs, body) =>
  if (env.isDefined(x))
	warn("reuse of variable name", exp.pos)
  val nrhs = typeCheck(rhs, tp)(env)
  val nbody = typeCheck(body, pt)(env.withVal(x, nrhs.tp))
  Let(x, nrhs.tp, nrhs, nbody).withType(nbody.tp)

I think the black-box test will handle that. The type of the function is Let(x, tp, rhs, body) -> pt -> Let(x', tp', rhs', body').withType(letType). We know that x’ is just x. rhs’ is just rhs with types. But can it be affected by tp? I’m not sure. That’s the problem with typeInfer as it is currently defined. I can’t tell the exact causes of the final types.

Ah. What I need is one example where rhs’ gets affected by tp. Until I do that, I will never be sure if tp matters.

Note that if this function weren’t defined to do both type inference and type unification at the same time, you would have a clean recursive function, where the inferred type of rhs would depend only on its body. Now, it depends on its body and on the expected type. Now, in order to induce the input-output relation for the function, you have to concoct an example where tp changes the output type rhs’.

I would understand perfectly if Let(...) just had the type signature of its body (in the context of the bound variable). But now, it somehow has pt as well.

Maybe I don’t need to find an example of the argument pt affecting the output of typeInfer for every branch. Maybe just one case would be enough. But I’m not able to think of it.

And now I’m thinking of combinations like pt is UnknownType and the body’s type would end up being IntType, or pt is IntType and the body’s type would up being UnknownType (can that ever happen? don’t know).

Feature Testing vs Regression Testing

Hypothesis: When you change an existing branch, covering it first with a test is a relevant factor for detecting regression. (TODO)

When you don’t change an existing branch (say, when you add a new feature), covering the existing branch with a test is not a relevant factor for detecting regression (since you probably won’t affect it or any of its callees).

Hypothesis: When you add a new feature, writing a test for it is a relevant factor for finding out if you’ve implemented that feature correctly.

It’s all Equivalence Class Testing

Hypothesis: Black-box testing with valid and invalid equivalence partitions treats all inputs to a variable that produce valid outputs to be part of the same equivalence class (and the same for those that produce invalid outputs).

Branch coverage treats all inputs that exercise the same branch to be part of the same equivalence class.

Path coverage does the same for all inputs that exercise the same path.

Hypothesis: One input from an equivalence class working as expected is a relevant factor for all the other inputs from that class working as expected for that aspect of the program, be it a branch or a path.

Observation: [2019-02-23 Sat] I got this hint from the follow source:

A partition test is an abstraction of any systematic method that divides the input space into disjoint subdomains, and requires that tests be selected from each subdomain. Path testing is such a method, because the equivalence classes of inputs defined by the relation “execute the same path,” do constitute a partition.

– Random Testing, Hamlet 1994

Note that he talks about partitioning the input space. I’m not going that far. For example, one input could exercise all branches in a function and thus give you 100% branch coverage (like 5 for factorial, since it will eventually call factorial(0)). So, the equivalence classes for the branches may not be disjoint, but they still serve their purpose. If some input going down the zero-branch of factorial works as expected, you will assume that all inputs going down that zero-branch will work (it’s actually just one input - the value zero). If some input going down the non-zero branch of factorial works as expected, you will assume that all inputs going down that branch will work. It doesn’t matter if it is the same input in both cases.

But aren’t there cases where one input going down a branch could work as expected but some other input going down that same branch could fail? Yes, there are cases where path coverage gives more tests than branch coverage (don’t have a concrete example right now). But there are always more tests that you could write. State coverage, I’m told, yields even more tests than path coverage (no concrete example right now). The point is that we can’t afford to write all the possible tests for a function. So, we stop at some practical limit, usually branch coverage or multiple-condition coverage.

Fast Testing vs Fast Debugging

Hypothesis: Choosing a minimum set of inputs that cover all equivalence classes is a relevant factor for fewer tests but higher uncertainty about which equivalence classes failed and higher cost in figuring out how many classes it covers.

Choosing a unique input for each equivalence class is a relevant factor for more tests but lower uncertainty about which equivalence class failed and lower cost in figuring out how many classes it covers.


Test: Consider tests for factorial.

factorial 0 = 1
factorial n = n * factorial (n - 1)

The test case factorial 5 == 120 covers both the branches because it eventually calls factorial 0. Note that you have to think harder to figure out that it does so. There’s an additional cost in writing test cases that cover multiple branches.

However, suppose somebody had written it wrongly as:

factorial 0 = 1
factorial n = (n - 1) * factorial (n - 1)

Now, factorial 5 == 120 still covers both branches (though you’d have to think a bit to figure that out, just like before). This time, the test fails because factorial 5 returns 0.

Now, which branch has the problem? It’s easy to eyeball it when there are just two lines of a very simple program. Things will get much harder when you have hundreds of lines of old code with lots of branches.

If, instead, you’d written two disjoint branch tests, like factorial 0 == 1 and factorial 5 == 120, you would know that it works correctly (as far as you know) for the zero-case. So, you’ve isolated the error to exactly one branch. That’s a major decrease in uncertainty.

Assorted Notes on Tests and Uncertainty

Implementing a Spec: Acceptance tests vs Unit tests (??)

Hypothesis: Given high-level design, pseudo-code that you know is correct, implement first, write tests later, put only as much confidence in your program as the number of tests you pass -> get correctness first and good module-level design later -> low uncertainty about each function because you already know the pseudo-code, quick.

Given high-level design, pseudo-code that you know is correct, test-driven design -> get correctness and good module-level design at the same time -> low uncertainty, but slow because you have to code step-by-step.


Test: OS HW3 - I’m given the pseudo-code for page fault handling and free frame obtaining. But I’m still taking a long time to “design” the free-frame obtaining code correctly. – slow because you’re trying to get correctness and good design at the same time.

Test: pipcreate - all possible configurations handled, but it took a long time (~25 hours) before I could pass even one shell integration test. – was going test by test.

When Planning on Paper, You may Unwittingly Double-Allocate Resources (??)

Hypothesis: If you work in the final medium (or in something that correlates to the final medium), you will allocate each resource only once.

However, if you just plan on paper, you may double-allocate resources without knowing it.


Test: [2018-12-19 Wed] When making a plan for the winter vacations, you may say “I want to experiment with X and Y and I want to write an essay about Z and do this and that and the other thing”. But it could be that just the experiments on X and Y or just the essay on Z need the entire vacation, which is something that your plan does not reflect at all. So, you’ve double-allocated your vacation and are guaranteed to be disappointed.

Contrast that to a plan made on a calendar, where you block off days (or just chunks of time) for each activity. Suddenly, you’re forced to allocate each day to only one activity. Your plan may go wrong because your estimates were off, but it won’t go wrong because you allocated the same day to two conflicting tasks. Note that a calendar plan isn’t the same as an actual week spent working on a project, but it is correlated in the sense that a week on the calendar corresponds to a week spent on the project. You can’t fit more than a week’s work on a calendar, but you can fit in as much as you want in a free-form plan.

Quick and Dirty vs Carefully Planned: The Fear of Future Change (??)

Hypothesis: Change is costly, fear that my code or “understanding” of some subject might change -> don’t commit to it (don’t create flashcards, don’t practice the technique, don’t build other ideas on top of it).

Change is costly, feel 100% sure that this fact will remain this way forever -> commit to it (create flashcards, practice the technique, build other ideas on top of it).

Change is cheap, _ -> commit to it.

Question: Why is change so costly? (“Asked in such fashion, the question answers itself.”)

Hypothesis: High coupling between modules -> change is costly.

Low coupling between modules -> change is cheap.

Hypothesis: Not completely clean -> I hesitate to commit it (literally).

Completely clean -> I commit it.

Hypothesis: I need to reinforce the action of getting and quick and dirty results for a test, and just as importantly, the action of refactoring that code afterwards.

Corollary: I need to split up my essays into decoupled modules. Instead of saying “don’t couple unrelated things”, make it so that I “can’t couple unrelated things”.


Test: My current flashcards - why use salted hash [permanent information] – willing to commit.

Test: how to run an HUnit unit test [permanent information] – willing to commit.

Test: why I dislike do-blocks [permanent information] – willing to commit. Pretty solid theory.

Test: keybinding for find-tag [permanent information] – willing to commit. Pretty solid keybinding.

Test: Negative exemplar - impermanent information - category hierarchy - still not sure how exactly to build a “category hierarchy” and whether that will “really work”. (Only recently figured out how category formation and concept attainment really work. Still not memorized that like hell.)

I seem to have a major block around using any technique I don’t understand completely. So I keep procrastinating till I have refactored each technique to within an inch of its life and placed it in its proper home in a global, flexible, easy-to-remember category hierarchy. This. Is. Retarded.

Explanation: My main worry: what if I change my idea tomorrow of how to form a “category hierarchy”? Then… Then what? Then I would have to narrow down and update the flashcards I created on this topic. (Easy fix: just delete the old ones.)

What else? Not much.

Minimize the effect of a change in one model on the other models. Dude, this is basic software engineering. Deal with it. Minimize your interfaces and move on. Decouple your models. Don’t write giant-ass essays.

Test: Decoupled models that I fail to separate - category hierarchy and … everything else. But it still feels like it’s intimately connected. Doesn’t your thinking algorithm affect everything you think?

But category hierarchy and, say, minimizing interface are 100% decoupled. No matter how we humans categorize things, minimizing interface is still going to be the ideal way to decouple a function from other functions, make it easier to understand, easier to reuse, and easier to change. That part really is pretty solid. – I act like I can’t proceed with flashcards for minimizing interface and other programming techniques until I get a handle on category hierarchy.

Test: Hesitated to commit this section because I felt like it could be cleaned up even more. – suggests that submitting not-completely-clean code has been punished, somehow.

Carry the Source and Mechanism along with the Output

(h/t The Pragmatic Programmer, chapter 5)

Label: Source plus mechanism for an output = mechanism that produces the desired output when given the source.

Hypothesis: Have the generating mechanism and source for an output, lose the output -> can regenerate it (at some cost).

Have the generating mechanism and source for an output, have the output -> can change the source to get different outputs.

Don’t have the generating mechanism or the source for an output, have just the output -> can’t get a different output.

Don’t have the generating mechanism, have the source, have just the output -> can try to port the source.

(Basically, having the source means that you can set the inputs and having the mechanism means that you can generate the output. You have got low cycle time.)

Hypothesis: One-button change -> force you to carry the generating mechanism along with the output. Why? Because it’s easier to change the source material (like LaTeX) than the target material (final presentations).


Test: Source plus mechanism - positive exemplar - LaTeX file plus the pdflatex program – can generate various outputs.

Test: Source plus mechanism - negative exemplar - old homework assignment – have the mechanism pdflatex and even the output, but not the source.

Test: Source plus mechanism - negative exemplar - my old Turbo-C++ Breakout clone - have the source for the game but not the mechanism – can’t recreate the game (without looking up some modern port of graphics.h).

Test: Presentation – pretty useless if you lose the source. Have to work hard to recreate the slides (and muck around with image-editing tools).

Test: Presentation - have the professor’s slide - want to include some of the slides or pictures – have to work hard to recreate them. Need to do a Google image search for the pictures.

Test: LaTeX - lost the source for the original homework assignment – had to copy the text and wrestle with the formatting till I got it looking right again.

Test: Plots generated for the output of k-means or other clustering algorithms – can’t change the input slightly (maybe have five clusters or try a modification of the algorithm).

Test: Gifs generated to visualize the steps of the k-means algorithm – can’t change the input slightly, just like above. Don’t have the mechanism either.

Test: Dendrogram generated from hierarchical clustering – can’t change one variable (the distance measured used) to get another dendrogram because I don’t have the source or the mechanism.

Test: Negative exemplar - have a text file containing the number of points I’ve got. Want to change it – can do it effortlessly because I have the source and the mechanism (which is just a text editor to change the text and view it or a shell command to do the same).

Test: What about my mental model of something? – may not have the source books and lectures and personal experiments that formed my mental model; still have the generating mechanism though (my mental learning algorithm), even if it has changed after experience. It’s not a first-class output either. So, I can’t recreate the output in some other mind and I can’t give them the source material for them to recreate it themselves.

Test: Citations in a book or paper backing some fact – can re-read the paper and get the actual fact.

Test: Citations plus quotation – don’t even have to re-read the cited paper (unless you’re suspicious); you can read the source quote itself.

Isolate the Features you’re Most Uncertain about

Hypothesis: Isolate the features you’re most uncertain about and implement them in reverse order -> minimum time and worry.


Test: For example, I’m super-uncertain about the theoretical questions in my assignment – wait until later.

Test: Also not sure about how to use pandas or how to prune. Not fully sure how to handle missing attributes either. – wait.

Test: Negative exemplar - I know exactly how to implement the vanilla version – do it right now.

Don’t Bite Off More than You can Chew

TODO: Write an essay about releasing early and how you try to optimize constraints that are not worth optimizing. How should you ideally approach the matter? How do you get stronger?

Key to Peace of Mind: Thorough, Fine-grained Testing

Hypothesis: The way to be confident about a system is to test for exactly the properties you want it to have, including past behaviour. That way, if the tests pass, you know you’re golden.

If you don’t test for your desired behaviour or don’t test for regression, you will always be worried.

Why? Because the positive and negative exemplars from your tests would narrow down the causes of things for you and remove your ambient worry about everything else.


Test: [2018-07-10 Tue] Earlier, I didn’t have positive exemplars of how to render a post. Full of worry.

Test: [2018-07-10 Tue] Now, I do. No worries.

Coupled Tests are Harder to Pass (??)

Hypothesis: When you have several tests you must pass but they don’t correspond to different branches in your code, you will find it harder to code because at each point you will have to satisfy multiple constraints.

Hypothesis: When you satisfy multiple constraints with the same branch of code, it means that your solutions don’t compose.

If you can handle the constraints separately, that means you can compose the results.


Test: CSS - a single row in a form should handle short text, long text, hovering, clicking elsewhere, etc. with the same CSS config.

Test: Applicative parser - handle the two sum types separately and compose using <|> – your solutions compose.

Test: Constraint - the meta text has to align with the item text in a list. But the item text has to be centered in line with the circle at the start of the list. – couldn’t do it.

Mocks give you Fine-grained Feedback for your UI

Hypothesis: Use the design mocks to get fine-grained feedback about your UI design.

Change Only Data that this Feature Owns

Question: What about side effects? Can’t you inadvertently affect other features by changing common variables?

Hypothesis: If you change data that belongs to your feature, you can be confident that you won’t affect the behaviour of other features.

If you change common data, then you have to check that other features aren’t affected.

Corollary: OOP separates data for different features by keeping all of a feature’s data within the same class.

Corollary: Pure functions mean that you never change data used by other features.

Global mutable variables mean that you change data used by other features.

To Prevent Overgeneralization, Have a Default Else Case (??)

Hypothesis: Having a default else-case for your code will force you to specify the exact conditions under which your feature is applicable and, importantly, all the other conditions under which it is not known to be applicable. Even if you find out that it generalizes to other areas, you can expand the if-condition while still keeping the default else-case. This way you will prevent yourself from overgeneralizing.

This is the same idea as eliminating potential factors only after seeing them toggled in some positive exemplar.

Editor Wars

Observation: Atom (Nuclide) seemed highly adequate during my internship. No real complaints.

Linters gives you Fast Feedback

Test: Stopped using a field in a Relay fragment – immediately got a lint warning saying that the field was unused!

Changing a Program and Reusing Code: Minimize the Interface

Implement in a New Branch or Implement in Isolation

Hypothesis: If you want to add a new feature, implement it in a new if-branch.

If you want to extend the output within an if-branch, implement the change in isolation and move it in.

Toggle behind an Interface with One Button!

Hypothesis: Feature you want to toggle, hide the current feature behind an interface and toggle completely with one button -> handle each use of the interface (read or write), don’t mix and match between the features, quick -> no uncertainty, no errors.

Feature you want to toggle, don’t hide the current feature behind an interface or don’t toggle completely or don’t have just one button -> may miss places where it is used, will take longer -> uncertainty, errors.

Hypothesis: Searching for all uses of the thing you’re toggling leads to recursive toggles.


Test: OS HW3 - massively confused. Don’t know how to implement page-directory switching. Don’t know what could go wrong. Don’t know all the places where page-directory could be used (and thus be messed up by me). - Try getting one integration test. If it breaks, you can always revert back to confusion. – It worked! All I had to do was add an entry to procent, initialize it during create, have pointers to the global page tables, set up procent for nullproc too, update the page directory entry in page_fault_handler, and simply load the directory upon context-switch. Basically, set up process page directory correctly for all processes (initialize, pointers to global page tables, null process); update it when needed; load it when needed.

Ah, fuck. I should have hidden process.page_directory and made them come for it. That way I would have known who wanted to read it and who wanted to write it. What a way to drive in an age-old lesson - program to an interface, not an implementation. I forgot it when it came to OS development in C.

Explanation: I had the integration test of two processes writing to the same virtual address. When it came to the first 4096 pages, both the alternatives would work the same. They would differ only when going beyond the initial pages. The default alternative would overwrite one of the values in the null process’ page directory. My new per-process page directory alternative would have separate page directories for them and thus write separate values. So, pdbr should point to the current process’ page directory. Also, as per the integration test, the page fault handler should also use the current process’ page directory.

Hmm… I should basically have grepped for page.*directory and looked at everything that depended on it.

Look at the interface for everything.

The kernel’s view of the “page directory” - all it knew or cared about was CR3. So, if I wanted it to use a different page directory, I needed to change CR3. Where? At the end of resched.

Similarly, the OS’s view of the page directory - right now, everything depended directly on page_directory_nullproc. That was the only page directory in existence. I wanted to propose an alternative to it - a per-process page directory. So, I should have hidden it behind an interface and then toggled to the per-process page directory without anyone being any wiser.

Actual explanation: Feature I wanted to toggle - global page directory vs per-process page directory; didn’t hide the global page directory behind an interface -> missed places where it was used (like the page fault handler and resched).

Test: Didn’t hide procent behind an interface when toggling the feature “no page_directory field” to “page_directory field” -> missed places where it was used (like create, initialize, and global initialize).

Test: I forgot to update nullproc’s page directory. Why? Because I updated procent without hiding it behind an interface. So, I didn’t catch all its users, in particular initialize. I should have looked for everything that initialized procent (since nothing would read or write to page_directory yet). – Caused errors.

Test: When toggling global page directory with per-process page directory, I have to toggle the definition of procent, which requires me to seek all uses of procent. Similarly, toggling memlist involves toggling meminit, which requires finding out all the places where meminit is called.

OLD - Modularity: Make Changes behind an Interface (??)

Hypothesis: Identify what changes you want to make and turn those hotspots into an interface. Then, implement the new module you want behind that interface. Nothing in the rest of the system has to change.

Hypothesis: List of changes -> properties for the category definition.

For example, we define the categories of “greedysubset” and “forwardfitting” by what they do in their inner loops. (Otherwise, they’re both the same.)

Corollary: Don’t anticipate making any changes -> no need to use an interface.

Question: What about code reuse?

Hypothesis: Coding in a modular program -> you will change stuff only behind the interface. Won’t have to touch the rest of the system.

Hypothesis: Refactoring -> will change the rest of the system to use your new interface.

Observation: OS HW1 - I changed only the scheduler code and looked only at the scheduling output (not the memory usage or pipe behaviour).

Hypothesis: Change behaviour of a big system <- choose output property, narrow down the parts you want to change into an interface, toggle to your heart’s content.

Hypothesis: Focus on just one small property (like being unreactive) and figure out what you need to change to get it -> minimum interface (wrt that property).

Decouple by creating an Explicit Interface

Hypothesis: If you have a bunch of code that is all mixed up, decouple one part of it by hiding it behind an explicit interface, testing that the interface covers all existing use cases, and then moving the code to another module. If you don’t have such an explicit interface, then, for all you know, the rest of the code might depend on any part of this code and thus you will remain uncertain about its actual effects.

Small Interface: Expose Only One Thing at a Time (??)

Hypothesis: Small interface -> easier to use.

Large interface -> harder to use.


Test: Haskell monad transformers - sound scary, but are surprisingly usable.

For example, the implementation of ReaderT may look intimidating:

newtype ReaderT r (m :: k -> *) (a :: k)
  = ReaderT {runReaderT :: r -> m a}

But look at its interface: you can use it as a functor or an applicative or a monad. So, if you want to use applyDiscount :: Float -> ReaderT Config m Float twice, you can just say applyDiscount <=< applyDiscount without worrying about anything else because ReaderT Config m Float is just m1 Float.

Also, when defining applyDiscount, if you have a function discount :: Float -> Float -> Float that accepts an amount and reduces it by the given discount rate, then you can use asks discountRate :: Monad m => ReaderT Config m Float and then fmap (discount givenAmount) over the value to get the discounted amount within the ReaderT monad. You’re programming to the interface of functors and monads.

Abstract: Don’t Accept all the Information you’re Given

Hypothesis: Desired output, accept all the information given (deal with a concrete type) -> have to deal with more input configurations.

Desired output, abstract the inputs using the possible outputs (deal with an abstract type) -> have to deal with fewer input configurations.


Test: findPerson vs find. – Have to deal with Person being male or female.

Abstract as much as the Goal allows (?? use cases - new hypothesis)

Hypothesis: Goal doesn’t ask about the specifics of the input type, abstract the input type -> simpler, more reusable, easily changeable code; think the answer is pretty simple.

Goal doesn’t ask about the specifics of the input type, keep the input type concrete -> more “complex”, less reusable, hard-to-change code; think the answer must be “complex” and try weird stuff.

Hypothesis: The use cases are determined by the goal. If in no use case you will need part of the input, ignore it. Or if you can apply a function that will cover all of the uses cases for that part, use it.


Test: Wanted to concatenate all the Parser [a] in a list [Parser [a]]. Was stuck for 24 minutes doing that because I got confused by the Parser bit and thought that the solution would be something “complex” like (fmap . fmap . fmap ) concat pxs (Yes, I actually tried that). – Got unnecessarily intimidated. Once you abstract Parser as Monad m, your job was much easier.

Test: But the moment I thought of the problem above as concatenating all the lists within [m [a]], I got the answer immediately: sequence. – Felt pretty simple.

Minimum Interface at Speed: More Hypotheses (??)

want to handle a new input configuration (add a new feature) - get a failing test first -> speed (no abstract thinking).

new feature, pure function, write automated failing test -> speed.

new feature, impure function, manual failing test -> speed.

failing test - pass it by writing a quick and dirty chain of function calls, no branches -> correctness, speed.

test passed - refactor your code -> speed, quality (more understandable, reusable, and modifiable).

manual test passed - write automated test -> speed, quality. (??)

writing code - do you have a failing test for it? -> write useful code.

question about one small part, write a test in a scratch file -> get answer quickly (since you’re only changing a few variables instead of the whole original program).

DRY: Don’t Expose Irrelevant Things (??)

Type: Change in interface -> amount of work needed, amount of uncertainty (and thus possible errors).

Hypothesis: Interface needed for goal, expose interface that’s just as small -> have to change only when the interface changes; guaranteed to be consistent.

Interface needed for goal, expose interface that’s larger -> have to change even when the change in the interface is outside the interface needed; outside changes may make things inconsistent.


Test: Magic constant “1024” in paging - it has the same cause everywhere (it’s the size of a page and a page table and a page directory). – If that changed to 2048, I’d have to change a lot of code manually. (Ideally, I should have a separate variable for the size of a page, a page table, and a page directory and, if necessary, assert that they were all equal.) Here, it’s a value (1024) that has the same cause everywhere. Basically, the code doesn’t care about the actual value of the constant, but it uses it anyway. So, when the constant changes, the code has to as well. Also, I might forget to update the constant everywhere (maintain the correlation between all of the values) and thus introduce bugs.

Test: Function spk-split-paragraph-by-semicolon - if I repeat the algorithm in the doc string (split the string by “;” and then join with newlines), then if I have to change the algorithm for speed, then I would have to change the doc string as well. – duplication of effort because of same cause. Here, it’s the documentation and implementation that have the same cause (the algorithm). The algorithm is needed for the function implementation. Ideally, the documentation should not even know what algorithm is under the hood. But if it does, then it will have to change each time the implementation changes.

Test: For-loop to initialize a page table - caused by the structure of the page table. – If the structure changes, I’d have to update in a lot of places. The user code doesn’t need to know about the structure, but it does anyway. May set some values the wrong way - inconsistency and thus bugs.

Test: From The Pragmatic Programmer - line segment class with fields start, end, and length. – When a user changes the start and end of a line segment, he also has to change the length. So, you’re exposing too much information. The user code doesn’t need to know about all three (as fields). It should just have a constructor and a length method (or field - it shouldn’t know the difference). Also, may set length to an inconsistent value - bugs.

Toggle Behind an Interface: No Need for Fine-Grained Feedback

Hypothesis: Change in the expected interface, have tests or type-checker that can validate each change, change all uses -> feedback about every change -> can change one at a time; no unnecessary changes; no inconsistent changes.

Change in the expected interface, no tests or type-checker that can validate each change, change all uses naively -> you’re changing outside the interface you expected -> have to change things unnecessarily; have to change everything before you get the correct output; may make inconsistent changes.

Change in the expected interface, toggle behind an interface -> not changing the interface you expected -> don’t have to change things; consistent; don’t need fine-grained feedback.

Corollary: Change lots of things -> need fine-grained feedback to reduce uncertainty.

Change behind an interface -> don’t need fine-grained feedback.

Corollary: Toggle behind an interface or have some other source of fine-grained feedback (like a type-checker) -> don’t need a huge number of unit tests.

Don’t toggle behind an interface and don’t have some other source of fine-grained feedback (like a type-checker) -> don’t need a huge number of unit tests.

Hypothesis: Basically, you’re not changing the mechanism at all. So, you don’t need feedback to make sure you aren’t breaking it.

Hypothesis: Least number of tests (feedback) needed - well-decoupled code.


Test: Change 1024 to 2048 – have to change things unnecessarily. May miss some spots.

Test: Suppose the name of the function foldr is changed to reduce - how much work would you have to do? – assume foldr is now not even available. So, your type-checker will point out every place where you still use foldr. Easy enough. No inconsistent or unnecessary changes.

Test: Want to change page directory from the global page directory to a per-process page directory. Change the get_page_directory function to return the per-process page directory. – don’t have to change code; guaranteed to cover all the places where it is used.

Test: Want to change all uses of initialize_page_directory once you change its input parameters - can grep for the function – fine-grained feedback about every use.

Test: Ruby guys have tons of unit tests. – no type checker to warn you about inconsistently-used function names or arguments or return values - so changing a function’s type signature is pretty risky. (Can’t they toggle behind an interface? I guess you have to abstract things well for that. Not sure.)

Test: Haskell guys usually don’t have tons of unit tests. – type checker tells you about each use of a function; changing a function’s type signature is not risky. Can abstract well into conventional interfaces - sort wants just an Ord, fmap needs just a Functor, and most common functions deal with just lists. (Not fully sure of this.)

Minimum Interface (Obsolete)

Hypothesis: Get just the relationship between the input and the desired output -> minimum interface.


Test: Lines from BufferedReader – don’t care about anything else it might produce.

Test: List of TransactionOperation from list of strings – don’t care about how to get a permutation of that list or whatever.

Test: Negative exemplar - have the raw BufferedReader itself – large interface.

New Interfaces Considered Evil

TODO: Write an essay on “New Interfaces Considered Evil”. (Reference Gabriel.)

OOP gives Methods Too Much Information

Hypothesis: An object allows each method access to all its fields. That probably leads to large interfaces and too much information that makes the methods hard to reason about.

Lift Plain Functions to deal with Legacy Code?

Hypothesis: To change some legacy code, maybe write plain functions (in isolation) and then insert them where necessary.


Test: [2018-07-30 Mon] Disable the submit button unless the form is valid - if I had plain types, I could say just if(!isValid(form)) disableButton = true; else disableButton = false; or whatever. But I don’t have plain types. I have legacy code.

First, I need to search for the part of the code that enables or disables the button. Then, I need to get into the branch that deals with my feature. And then I need to call the isValid function.

So, the pure part of the feature change is the isValid function. I can write that without reference to any of the legacy code. It then becomes a question of where to insert that function call.

Reversible Change: Use the Same Composition Operator to Invert

Hypothesis: What makes a change easily reversible: you should be able to reverse it without learning a new interface. Basically, you should be able to revert it using the same composition operator.

The key advantage of having an inverse is that a value within the same type is enough to invert your original value. You don’t need to create a new type. This is what you get when your operation is closed and invertible.


Test: Positive exemplar - 1 + 2 = 3; 3 + -3 = 0. – you use the same composition operator to create or invert a value.

Test: Positive exemplar - reverse . reverse == id – same composition operator to invert.

Test: Early change - wanted to revert it. How? Removed the file alright, but I still had to run the rebuild command. Unnecessary new interface! (But that’s what I did to install it in the first place, so I can’t complain too much) – undecided.

Test: Negative exemplar - push 3 on top of the stack. To remove 3, you have to use pop. – different interface to invert a value.

What if you could use push (inverse top) or something? (But you would have to know the value on top to push its inverse.)

Test: comment-dwim . comment-dwim == id – easy to invert.

Test: What I’d like to invert effortlessly - added a bunch of debug statements. Now revert them all.

Similarly, tried out one CSS combination. Now, get back to the original.

Can’t you do that with version control?

Types Narrow your Interfaces

Hypothesis: Types narrow your interfaces.

They turn “don’t do that” into “can’t do that”.

For example, if all you know is that the React component tree is Foldable a, you can just fold over it with different monoids. That’s all. Can’t access just one part of the tree.

Why Novel Interfaces Proliferate

Hypothesis: Because it gives you a competitive advantage over others and prevents you from becoming a commodity. And perhaps also because you can deliver new features.

Cover before you Change Legacy Code

Hypothesis: If you have explicit tests for different features, you can find out which ones conflict with your new feature and you can split any common branches. If not, you may find it harder to know which ones conflict and thus which branches you must split and how. (TODO)


Test: [2019-06-23 Sun] I covered evil-jump-item with three tests: spk-evil-jump-item--start-at-opening-paren, spk-evil-jump-item--start-at-closing-paren, spk-evil-jump-item--start-in-between-parens.

I wanted to change its behaviour when the point (cursor) was between two parens, but maintain its behaviour for the other two cases. When I naively changed the code, I got the former but later discovered that I had broken the latter. That happened three or four times. It’s the classic problem with legacy code: regression of existing features.

So, I wrote explicit unit tests for them and started coding. Figured out how to pass the test I cared about. Narrowed down the new bug it created in the previous test and fixed it by adding a new if-branch. Done in quick time. What helped, I think, was that there was exactly one failing test at each point and I knew precisely what it was. When I changed code without any unit tests, I didn’t know if the legacy features were broken. Sometimes, both features were broken.

In short, I could take on one feature at a time. I implemented the new feature in the simplest possible way. I found that the first feature wasn’t broken; only the second one was. Ok. Time to fix that.

I realized that both those inputs shared the same branch in the code. Time to split up the branch so that they went separate ways. This was probably the key move. So far, I’d been trying and failing to pass the tests by changing the same branch. Maybe the three explicit tests helped me by showing that the first input landed in a different branch while the second and third inputs landed up in the same branch. I could now safely ignore the first feature while looking at what was different between the second and third inputs. Earlier, I was unsure if the conflict was between the first and third feature or the second and third feature.

Debugging

Debugging: Look at the Diffs!

Hypothesis: don’t know the cause of something, have the source code and can run interventions (aka debugging), look at your diff, list all your assumptions or toggle your diff till you get a “yes” -> figure out which assumption is causing your error.

debugging - read your diffs line by line -> notice what changes you made that could have caused the error.

debugging - just try different inputs without looking at the diff -> find it harder to come up with the right inputs.

Hypothesis: If you’re not sure where your program fails, set printfs at different points. That will narrow it down faster than one printf at a time.


Test: Tried writing at an address but the CR2 value showed some other address. Turned out that I had interpreted the output as hex when it was actually a long int! – toggling my diff would have helped.

Test: Re-read my buggy page directory setup code - discovered my errors. – read the diff line by line.

Test: Process wasn’t receiving message from parent. Tried lots of variations. Finally realized that I wasn’t even resuming the process. – didn’t look at the diff.

Don’t Debug: Fall back to a Positive Exemplar

Hypothesis: Test passes -> positive exemplar (hopefully with all necessary causes).

Test fails, create a smaller exemplar that is positive and extend it to the larger exemplar -> know what works (necessary causes) and what doesn’t; faster.

Test fails, debug by toggling here and there -> don’t know what works (necessary causes) and what doesn’t; slow.


Test: [2018-09-16 Sun] Spent half an hour debugging why I couldn’t clone a bare Git repository from the server. Turns out I used /home/foo instead of /homes/foo. Dammit! – A simple, detailed comparison with a past positive exemplar would have solved the problem.

Test: [2019-01-21 Mon] Didn’t fall back to a positive exemplar when trying to debug the generated code for nested while loops.

Caught it within minutes by getting a positive exemplar for a simple loop body and toggling the body from x + 1 to x = x + 1. This is after an hour of random “debugging”. I was passing in the wrong stack pointer.

Test: [2019-01-25 Fri] Struggled trying to instantiate an inner class while testing. Finally, replicated the nested classes in a fresh repo and systematically narrowed it down to the fact that I was instantiating the inner class using one object of the outer class and calling a method on another object of the inner class. And because Scala has “path-dependent types”, that doesn’t work. – Should have replicated it the moment I got stuck for more than 5 minutes. Or maybe even right away. Note that it still took 20 to 30 minutes after isolating the class, including time to look up stuff, partly because I didn’t focus on getting a positive exemplar first. In the end, it was just 7 tests to victory.

Test: [2019-02-04 Mon] Spent half an hour trying to get function pointers to work in my generated assembly. Kept getting segfaults. Decided to look at the generated code for a simple C program that used a function pointer. Saw immediately that it used movl $foo, %edi whereas I had been using movq foo, %edi. Turns out that you needed the extra dollar sign. In just four minutes, I was done! – I didn’t have a positive exemplar for function pointer assembly code that actually worked. All I had were exemplars of assembly code that kept segfaulting.

Test: [2019-03-10 Sun] Tried to move in my NBC implementation from CS573 HW2 to compare it against SVM and logistic regression. Failed because it didn’t fit the same interface and expected the data to be discrete whereas the data for hw3 was continuous. After an hour of struggle even after I brought in the training set from hw2, I moved back to hw2 and tried to get it to work there. It failed again. Turns out that the training set was out of date because of the other tasks in hw2. Got it working in hw2 and then got it working in hw3. Piece of cake once I got a positive exemplar.

Debugging: Shrink the Failing Test

Hypothesis: Shrinking the failing test (that you’re debugging) is a relevant factor for quickly discovering the causes of failure.


Test: [2019-02-28 Thu] Spent ages debugging the black-box test testAppOrder without figuring out that it was the anonymous function that was the problem. Once I got it down to a small function, I could easily look at the generated CPS code to see that I needed to extract the function pointer from the closure.

Debugging: Cover all Output Classes

Test: [2019-01-26 Sat] astTypeEquals kept failing even when the types were different. I couldn’t figure it out at all. Turns out that it returned a boolean and I failed to assert that that boolean was true. Discovered that by writing a smaller version of the function and trying to make it fail. Given that it always passed, I should have tried to make that test fail.

When Debugging, a Failing Test is a Positive Exemplar

Hypothesis: When debugging, a failing test is a positive exemplar of an input for which the code diverges from the spec. A passing test is a negative exemplar. So, you can shrink a failing test input the way you can shrink any positive exemplar.


Test: [2019-04-02 Tue] testLibFunctions2 failed. But it was a black-box test that depended on a huge library, so I brought in the necessary functions from the library so that I could be sure that the problem existed even without them. But how can you shrink a negative exemplar? Isn’t it that you shrink a positive exemplar by removing irrelevant factors?

Well, a failing test is a positive exemplar of your implementation diverging from the specification. A passing test doesn’t diverge from the specification. So, you were shrinking the failing test to find the smallest version of the test that still failed - i.e., still diverged from the spec.

So, a failing test is a positive exemplar when debugging (i.e., searching for places where your code doesn’t match the spec), but a passing test is a positive exemplar when coding (i.e., searching for some configuration of code that matches the spec). That is what allows you to refactor code by removing unnecessary parts.

Understanding: Uncertainty about the Program Output

Follow the Pure Type Signatures to Understand the Diff

Hypothesis: A pure type signature for a function is a relevant factor for knowing that function’s output variable.

Hypothesis: A pure type signature for a function is a relevant factor for knowing its output variable’s relevant factors.

Hypothesis: A minimized signature for a function is a relevant factor for knowing the narrow parts of the input being used.


Observation: [2019-02-05 Tue] Having trouble understanding what the classes in VTK do. They seem so fat and ugly that I don’t feel like getting deep into them.

But things seem much simpler when I try to think of their type signatures. For example, vtkWarpScalar is nothing but Int -> [((x, y, z), s)] -> [((x, y, z), s)], which you can think of as (z -> z') -> [((x, y, z), s)] -> [((x, y, z'), s)]. No more explanation needed.

Similarly, vtkDelaunay2D :: [Point p] -> [Triangle (p, p, p)]. It’s not fully accurate and there are a ton of fields you can configure, but for my purposes as a novice trying to get through a course, this is all I need to know. Likewise, vtkDataSetMapper :: vtkPolyDataSet -> [GraphicalPrimitive] (we don’t need to much about GraphicalPrimitive apart from the fact that there exists render :: [GraphicalPrimitive] -> Image).

I suppose these type signatures help because they show you the diff created by the function. For example, “triangulation” can sound daunting, but you realize that it just takes Points and turns them into Triangles (with the property that that you get a mesh of triangles and not of quadrilaterals or pentagons). The type signature (z -> z') -> [((x, y, z), s)] -> [((x, y, z'), s)] is enough to show practically the entire diff.

Yes, in some cases, a type signature can’t illuminate the diff too much, like when you have changeStuff :: BigFatClass -> AnotherBigFatClass, which doesn’t narrow down the change or, worse, doStuff :: BigFatClassA -> IO (), which serves to tell you only that the function can do just about anything. But, with a type signature like (z -> z') -> [((x, y, z), s)] -> [((x, y, z'), s)], you know that the only thing being changed is the z coordinate.

Test: [2019-02-05 Tue] foo :: (z -> z') -> [((x, y, z), s)] -> [((x, y, z'), s)] tells you what foo changes. But foo :: (z -> z') -> [((x, y, z), s)] -> IO () does not tell you as much. Yes, the (z -> z') is a hint that only the z coordinate is being changed, but the IO means that other things could change as well. And OOP isn’t all that famous for taking in higher-order functions as arguments. – The pure type signature lays out all the relevant factors for the output, namely the function (z -> z') and the input list [((x, y, z), s)]. The impure type signature doesn’t even tell you the output variable, let alone all of its relevant factors.

Even worse is foo :: Int -> [((x, y, z), s)] -> IO (), which doesn’t tell you how that Int is being used (though the function name vtkWarpScalar and parameter name scalingFactor do help). The problem is that all kinds of other things might be changing.

Worst of all is foo :: Int -> BigFatClass -> IO (). Now, you don’t know which fields within the class get changed. Sadly, most object-oriented methods have this kind of a type signature, especially when they are “commands” and not pure “queries”.

Test: [2019-02-05 Tue] foo :: Int -> Double tells you the input variable. foo :: IO Double doesn’t tell you the input variable. If it’s a method in a class, you have to look at the all the possible class fields. And you still have to check if it gets any input from the user or from some database.

Test: What if the type signature were poorly-written? For example, foo :: BigFatClassA -> BigFatClassA isn’t much of a help even though it’s pure. I guess you have to minimize the interface of BigFatClassA as used by foo. For example, maybe it needs just the interface of Show a => [a]. Now, you know that it uses just the fact that each element in the list can be converted to a String. The same holds for other interfaces like Functor f => f (Int, Double).

How to Tackle a Daunting Problem

Hypothesis: Get as close a positive exemplar as you can and take a diff to do what you want to do. You will be able to solve your problem more easily than if you try to figure it all out from scratch (?).

I suspect you have to know how to change the input based on a change in the desired output.


Test: Need to implement a bunch of classes and methods. Feel like it’ll take a ton of work to do it all. But I suspect that if I can do just one of them right, then the rest of them would be easy to do just by comparison. Also, I can get the first one done by comparison with the sample program mentioned in the wiki (even though I don’t have a executable program for that one). – get as close a positive exemplar as you can.

“Complex” Diffs: Toggling a Lot of Variables

Hypothesis: Diff toggles a lot of variables -> hard to predict what the output will be.

Diff toggles a few variables -> easier to predict what the output will be.


Test: Lots of changes in my Emacs folder - don’t know what leads to what. – lots of variables changed - I took .emacs.d/ off the load-path, moved my personal elisp files into one folder, deleted the old lisp packages I was storing manually and instead got them from the package manager, and used (require 'foo-feature) instead of loading its file manually. Any of them in any combination could cause problems!

Test: Exactly one section changed in my one-button change essay - I know exactly what that does to the essay. – “look at diffs” - ok that affects my model of how to debug something, but not my ideas about refactoring or programming to an interface.

Confusion: Look at the Diffs!

Corollary: Went from not confused to confused, get the specific changes (diffs) that caused the problem -> need to learn about the effects of a very few variables.

Went from not confused to confused, don’t know which specific changes caused the problem -> feel like you need to learn about a lot of variables.

Lesson: Start with something you’re not confused about and take it from there.

Hypothesis: Confused, ask for a positive and negative exemplar -> get clear about the inputs and outputs.


Test: Was following along the causal RL lecture till the q_i’s started coming in. Basically, I went from “not confused” to “confused”. Earlier, my mental model - my program - passed the tests of predicting the correct answer. But now, it couldn’t. So, I needed to look at the diff and figure out which change broke my model. – Got the specific diff that broke my model.

Test: Confused about all causality papers. Don’t think I’ve been un-confused about “all causality papers”. Feel like I will have to read a lot of papers to become un-confused about “causality papers”.

Test: OS HW3 - really, really confused about what to do. No clear idea about how to proceed. – Okay. Narrow down the diff. What worked, I think, was to have a concrete walk-through. That told me where there could be potential hotspots (for example, if there were no free frame available, I’d have to replace a page).

Test: Goals for 2018 - didn’t know what caused my indecisiveness. Tried to think of a case where I wasn’t indecisive. Got it. – Realized that I was changing too many variables.

Differing predictions: Well, Look at the Diffs!

Hypothesis: Don’t know the cause of something, don’t have the source code and can’t run interventions, come up with possible causes - get their differing predictions -> figure out the right cause.

Want to know where I know more than others - get their differing predictions (by taking the diffs of our proposed explanations or just the books we’ve read) -> figure out problems that they just can’t solve.


Test: Why did Goldenfold say “5 more minutes of this and I’ll get mad”. Two possible causes. We can look at their differing predictions to get feedback. – don’t have the source code and can’t run interventions; but can look at differing predictions.

Test: What do I know about “cognitive psychology” that others don’t? How to get differing predictions? They talk about “complex neural networks” and “network of ideas”. I don’t (I know about how it’s cue-memory associations all the way down). They talk about how forgetful they are, I don’t (I know about the exponential decay curve). They think re-reading is good enough. I don’t (I know that you need to test for recall). – Their proposed explanations are different from mine in these ways. From these, I can come up with differing predictions - people who test themselves will actually learn things.

Understanding = Testing it with Different Inputs? (??)

Hypothesis: Some output depends on a lot of other variables or takes a long time to give the output or gives output that is hard to isolate -> have to do a lot of work to test and understand it with different inputs (aka “understand” it).

Some output depends on just a few other variables and takes a short while to produce output and gives isolated output -> have to do less work to test it with different inputs.

Corollary: If that is true, then the explanations that will lead to a lot of understanding will be those that let you change inputs without much effort, get the output quickly, and isolate the cause easily. They would let you answer “what if” questions easily.

Large Diff -> High Uncertainty! (?? need to update based on the question)

Hypothesis: Large diff (change a lot of variables), fail -> think that any of those variables would make you fail; so, lots of possibly-wrong beliefs (high uncertainty).

Small diff (change one variable), fail -> think that that one variable makes you fail, which is true; very few possibly-wrong beliefs (low uncertainty).

Corollary: Keep your diffs small -> have fewer possibly-wrong beliefs.

Corollary: No positive exemplar, only negative exemplars -> super-large diff -> lots of possibly-wrong beliefs about what is possible.

Corollary: Only positive exemplars, no negative exemplars -> consider the opposite viewpoint that you’ve seen only negative exemplars for beating the thing and will thus think that anything untested will still not be able to beat it.

Question: Large diff, correct output -> will you think that all the variables in the diff were necessary? (when you haven’t seen any small diff work?). Think about how I got intimidated by big words.


Test: Right now, when trying to test my representative test idea on causal decision theory, I was going for a massive negative exemplar and getting completely confused. – didn’t know which variable (difficulty, subject matter, amount of text, etc.) caused my confusion. Felt like “representative tests” were useless. Like they would fail on any input - no matter how small the text or what the subject matter or how difficult the topic. But I didn’t really have evidence for all of those other beliefs - they were possibly wrong.

Test: Ran on a treadmill when I was young. Completely out of breath. Felt like cardio was not for me. – Jumped to the conclusion that the target variable (me) could not do any cardio at all, even though I had changed other variables like difficulty, duration, etc. That is, [Pradeep + tough cardio -> couldn’t do it] became [Pradeep + any cardio -> can’t do it]. Felt that any type of cardio would be too hard for me.

Test: Now, I can do cardio for hours (and have actually done so). But I can’t quite stretch my legs or jog for long without losing my breath. – Came to the right conclusion that I could indeed do a lot of cardio. That is, [Pradeep + tough cardio -> couldn’t do it; Pradeep + reasonably tough cardio -> could do it] became [Pradeep + reasonably tough cardio -> can do it]. Even now, I think that anything to do with long-distance running is too hard for me. That’s because I haven’t got any positive exemplars between hard cycling and running. Way lower uncertainty!

Basically, I need to look at the difference between the positive and negative exemplars. Earlier, it was [very, very easy cardio -> can do it; very tough cardio -> can’t do it]. But there are a lot of variables that differ between them. For example, the tough cardio involved running, which I had never done. It didn’t consider cycling, which I had done a lot of. Even with just running, I should have taken “running slowly” and “running a bit faster” and so on, till I found the breaking point.

Test: No positive exemplar - before the 20th century - going to Mars. Never even left the planet. Felt impossible. Now, have been to the moon, have sent probes past Pluto. Lot fewer possibly-wrong beliefs about what is possible. – earlier, very large diff; people thought anything extra-terrestrial was impossible (or at least very hard).

Test: Only positive exemplars - saw TA answer all my questions better than I could - felt like he knew way more than me. Started going into research - even there, he seemed to know way more. But when it got deep, I could start to see mistakes. Fallible after all.

Look at it from the opposite angle. These are all negative exemplars for beating him. Slightly difficult question - couldn’t beat him. Much more difficult question - still couldn’t beat him. And so on. It’s like the scale is inverted. Earlier, it went from easy to hard, and all the positive exemplars were on the easy end and all the negative exemplars were on the hard end. Now, the scale goes from hard to easy, with the (potential) positive exemplars for beating him on the hard end and all the negative exemplars on the easy end. The aim is to go harder and harder till you get a positive exemplar for beating him.

So, as per my hypothesis, only negative exemplars means that you have a super-large diff (from the infinitely hard configuration - the “God” configuration) and so you think that a positive exemplar is completely impossible. That is, you can’t beat him.

How to remedy this? Get a small diff. From what? From a positive exemplar. That way, you will go from pass to fail and thus learn something useful.

(But the number of negative exemplars is way smaller than the number of positive exemplars. Shouldn’t you follow the logic of the earlier examples and toggle from the exemplar that is less numerous?)

The main question is why, here, you think everything untested is possible when, in general, you assume that everything untested is impossible?

Maybe the mechanism you’re trying to test is “what would beat him?”. So, anything untested is considered to be incapable of beating him. (Not exactly sure of this argument.)

Test: Only positive exemplars - undefeated team like Australia in the early 00s - feel like you would need to be a god to defeat them. – only negative exemplars for beating them. Felt like anything untested would also fail to beat them.

Test:

“So you will risk becoming a Dark Lord, because the alternative, to you, is certain failure, and that failure means the loss of everything. You believe that in your heart of hearts. You know all the reasons for doubting this belief, and they have failed to move you.”

Chapter 10: Self Awareness, Part II, HPMOR

Explanation: The diff between himself and Dark Lords seemed so great that he expected that a lot of variables caused you to become a Dark Lord and thus that was unlikely to happen.

(Corollary: We might be making that assumption regarding war. We think there’s a hell of a difference between us and warring nations and leaders from the past and thus a lot would have to go wrong before we went to war. Not so.)

Test: On the other hand, greatness seemed to be within reach. He seemed to be just as smart as the “great” people he admired and just as motivated. All he needed, it seemed, was the right problem, a lot of perseverance, and a lot of hard work (along with a bit of luck). – small diff, low uncertainty.

Allow only Short Diffs using Close Exemplars

Corollary: Compare between a positive and negative exemplar with few changed variables -> short, specific diff -> can narrow down the cause of the change.

Compare between a positive and negative exemplar with a lot of changed variables -> large, vague diff -> can’t narrow down the cause of the change.

Lesson: Allow yourself to use only short diffs. No comparison between different people - there are too many variables that could be different. Comparison between two things that have few changes - his OS setup and mine.


Test: “Linear programming and causal reinforcement learning” -> “I’m confused” vs “I’m confused about X” – don’t know which specific things confused you. Don’t have a close positive exemplar.

Test: Asking smart questions - specify all kinds of details about your problem vs “Emacs keeps crashing. HELP!” – which branch failed? Otherwise you will think your entire system is broken. If you changed your Emacs version in the last week, that’s a strong clue. Need a close positive exemplar.

Test: “I had a blast in Europe” vs “I visited lots of cool ski resorts” - will think “Europe” will cause “great time” vs “going skiing with friends”. Were you happy or sad before Europe? Basically, was it Europe that did the trick or was it being with your friends? Saying that other people who are not in Europe are unhappy isn’t a helpful comparison - too many variable changes. Need a close negative exemplar.

Test: Compare with others - he could understand the lecture, I couldn’t. So, I am the cause for confusion. I suck. – vague high-level diff. Need a close positive exemplar.

Test: Compare with yourself - I could understand five minutes ago, now I can’t. So, something that happened in the last five minutes must be the cause. – specific low-level diff. Close positive exemplar.

Test: Someone wrote great code. Thought I couldn’t do it. But then a friend did it. Goddammit, if he can do it, so can I! – close positive exemplar to my negative exemplar of not making progress on a tough project.

Programming vs Induction (??)

Question: How is programming different from induction?

In induction, you start from a positive exemplar and shave off variables using focus-gambling and accept necessary causes using conservative-focusing.

In programming, you start with a simple acceptance test, one that tests just one feature, and then add bigger tests and code that passes them.

Hypothesis: In induction, you have a mechanism that gives you the output for any input (though not always; sometimes you have to work with limited evidence, like astronomers studying the Big Bang).

You’re writing code in both of them.


Test: Paging - start with just the global page directory, then add page fault handling, then per-process page directory, and so on.

Test: Induction - Semmelweiss - initially, any of the hospital location, the kinds of patients who came to that maternity ward, the other patients in the hospital, and so on could have caused “childbed fever”. But he narrowed it down to the doctor’s hands by looking at the doctor who died after getting nicked by the autopsy knife. At the end, he had a hypothesis - a function - that explained when somebody would get “childbed fever”.

Induce Necessary Causes for an Output using Positive and Negative Exemplars (Basically, get some patterns for the Output)

Hypothesis: Look at different kinds of functions that satisfy contract X -> induce necessary causes for X.

Corollary: Have seen a positive exemplar for X that does not need a particular variable Y -> Y is not a narrow cause of X.


Test: Saw people do something to each element of a list using just map f – that became my necessary cause for transforming each element of a list; a for-loop became an unnecessary cause.

Test: Elisp regex - saw (s-match regexp s) – those are the two necessary causes for a regex match.

Test: Python regex - saw self.assertIsNotNone(re.search(regex, s)) – felt like the assertIsNotNone was an unnecessary cause because there should be a function that simply returns true if the fucking regex matches the string.

Test: C regex - r = regcomp(&regex, regex_string.c_str(), REG_NOSUB|REG_EXTENDED) and regexec(&regex, s, 0, NULL, 0) – felt like the whole compiling regex thing and the matching flags and all the extra arguments to regexec were unnecessary. All I want is something that can tell me whether the string matches the regex!

Test: HTML output - have seen people get it without database access – database access is not a necessary cause.

Test: Negative exemplar - haven’t seen people get lines from a C++ socket without using a loop – feel like the loop is a necessary cause.

Log: Note the Configurations for each Output

Hypothesis: Note the configurations when logging your output -> can tell which configuration changes caused what.

Corollary: Make your configuration a first-class variable -> can print it easily, can change it with a single button.

Corollary: Oh, and by the way, have a log file. It’s awesome. You don’t have to rely on just your memory for the past outputs of your program.


Test: ML - decision trees - helped to note down whether I was shuffling the data or not, whether I was using 80% of the training set or 90%, what the depth and size of the resulting tree was, the time taken, the timestamp, the test set accuracy (obviously), the training set accuracy, and the maximum depth allowed for the tree – could effortlessly see how accuracy decreased with increasing size and so on.

Test: Can toggle whether to shuffle the data or not by just toggling the configuration variable – easy.

Test: Negative exemplar - didn’t note down the configuration information – had to remember what had happened with the previous runs; whether accuracy had gone up or down; couldn’t even tell whether my code gave the same results after a refactoring (no unit tests :P).

Test: Log file - decision trees – helped a lot, as mentioned above.

Test: Log file - XINU – helped a lot for me to keep track of what was going on. Had to save separate files, though, because the output was pretty huge. (Could have noted down only the variables I cared about.)

Test: Negative exemplar - log file - didn’t have any for the IPL dataset – didn’t have a good idea of what was going on, especially when trying something new.

Granularity of Tests affects the Granularity of your Model

Hypothesis: Having only a high-level test is a relevant factor for inducing only a relation between the high-level input and output.

Having a unit test is a relevant factor for inducing a relation between an intermediate input and an intermediate output.

Working through code with a concrete example (and getting feedback about the output somehow) is a relevant factor for inducing a relation between an intermediate input and an intermediate output.

Having a Jest-like snapshot test is a relevant factor for inducing a relation between an intermediate input and the penultimate output and a relation between the penultimate output and the final output.

Hypothesis: A unit test as a regression test is not a relevant factor for the difference between visualization programming and compiler programming because you rarely change code more than once and, even when you do, it’s within a few days of the original change, when it’s still fresh in your mind.

Skill smell: Do you have a unit test or concrete example for this line of code?

For example, when it comes to mapper.InterpolateScalarsBeforeMappingOn() or mapper.SetScalarRange(vector_field_reader.GetOutput().GetPointData().GetScalars().GetRange()), I don’t know what they do or what would happen if I commented them out. When it comes to prose, I didn’t have a concrete example for “barycentric coordinate calculation” and fumbled hopelessly on the exam.


Test: I can test for regression at any time when I have a set of unit tests. I can’t do that in visualization. I have to run stuff manually. Perhaps that also makes me not break dependencies, as I would when I write unit tests, and perhaps that makes it harder to extract free variables and run things in a script.

Test: Huh. I haven’t worked through the visualization code with even one concrete example. If the final output looks right, I accept it. If it doesn’t, I tweak the code here and there till it does. Contrast that to unit tests where I usually have to consider the direct input and direct output of each part of my code.

Think about it: I didn’t even have a good “mental model” of what the vector field dataset looked like. I didn’t know how the points were laid out. Still have no idea why we are using a vtkPolyDataMapper. What’s the actual difference between an unstructured grid and a structured grid? I know the definition, but I can’t come up with a small differing example.

For project 3, my model of the program was of the type [(isovalue, color, opacity)] -> multiple isosurfaces. Which is great when you’re trying to optimize the look of the final image, but is useless when you’re trying to get the program to work in the first place. This might be what I used to call the Hello World setup - integrate the components so that you have to use just a high-level model.

Ideally, you want two kinds of models of your program: one at the level of the domain (such as [(isovalue, color, opacity)] -> multiple isosurfaces) and one at the level of the code (such as render_streamlines :: vector field dataset -> actor for streamlines). To build the former, you need end-to-end tests; to build the latter, you need unit tests.

Maybe this is a drawback of judging your code through some ultimate feedback. You exonerate yourself from having to “understand” every line of your code.

Test: What’s so different about visualization programming? For one, the final output isn’t a data structure that you can break into pieces. It’s an image. Neither are the intermediate values first-class. So, you end up inducing a relation between the parameters in your code to the final image. – Basically, you lack any source of feedback apart from the final one.

Test: MATLAB integration test - didn’t force myself to “understand” the innards of the code. Just checked if the whole thing worked correctly. Even when it did, I had no idea how my code worked. – Note that the values at each point were pretty first-class. I could print the weights of the SVM and the inputs and labels of the test set, but I didn’t. So, I prevented myself from getting any source of feedback apart from the final one.

Contrast that to my data mining implementation of SVM - I “understood” precisely what was happening at each point because I wrote unit tests.

Test: Software security project - I wrote acceptance tests for the different API commands, but not unit tests for our code. Again, this high-level source of feedback made me induce a relation between the high-level commands and the final output, but not necessarily the intermediate code. Had no clue about the multi-threaded code, for example.

Isolated Branch Coverage: Set Input and Get Output after Executing Branch Exactly Once

Hypothesis: Normal branch coverage is a sign that your input eventually executed that branch.

Isolated branch coverage is a sign that your input directly went into the start of that branch and directly came out from the end of that branch.

How do they differ in their partition of the input space? Obviously, all inputs of the latter also obey the former. When will an input obey the former but not the latter?

Hypothesis: Maybe Haskell programmers decouple code so much that they don’t need to write unit tests.


Test: Compilers project 6 - Ideally, I’d like each optimization to be a separate phase.

Right now, CSE for LetL checks if the state has another identifier bound to the same literal. DCE for LetL checks if the state marks the current identifier as dead.

That’s the problem right there. The input type is always current node plus State. However, the output type is a tree computed by working on all descendants. That makes it much harder to reason about the immediate output because you have to infer it from the eventual output. You can’t see the first-class output of the function on a LetL node. You have to deal with a recursive call.

Notice how foldr avoids this problem by writing a function that deals with the immediate input and the eventual output of the recursive call. That way, you don’t have to worry about the recursive call. Contrast sum [] = 0; sum (x:xs) = x + sum xs to sum = foldr (+) 0. The function (+) is immediately testable. Give it 5 and 10 and you will get 15. The (partial) function sum (x:xs) = x + sum xs is not immediately testable. If you test it with sum (5:[4, 3, 2, 1]), you will have to compute sum [4, 3, 2, 1], which in turn depends on the definition of sum (x:xs) and on the base-case definition of sum []. Can you see how you can’t check its output using a first-class value?

I guess an implicit requirement of unit tests is that you set the input and get the output by executing exactly one branch of code once. That way, you can set arbitrary inputs to see their outputs and you can try to test arbitrary results of recursive calls by setting those values directly.

The code (+) computes its output of 5 + 10 by executing the (+) branch just once. However, with sum (5:[4, 3, 2, 1]), you get the output by executing the same code multiple times. Specifically, the output 15 comes after you run through the recursive branch sum (x:xs) = x + sum xs five times (and run through the base case branch once).

What if you use sum(1:[]) as your unit test? Does that solve your problems? Not really, because you can’t set the output of the recursive call to different values. For example, if you wanted to test what your function would do with a negative sum from the recursive call, you could just try (3 + -4) in the foldr version. However, to test the same thing in the recursive version, you would have to construct some list that gives you a negative sum. Now, because sum is so familiar to us and because you can just use sum([-5]) to do that, it may seem easy, but it could get very complicated when you have a recursive call in a tree traversal. You will easily be able to send in arbitrary inputs. But you won’t be able to get arbitrary outputs without working hard. For example, I know that CSE on val x = 7; body will substitute 7 for x in body. But I can’t just check if the state after val x = 7 has 7 substituted for x. I have to come up with some example of body and check if all the x’s were replaced with 7s. Instead of checking the immediate output (i.e., the change to the state) directly, I have to look at the distant output and infer what the immediate output would have been.

A Test needs to be a True Indicator of some Property (TODO); This makes us Decouple Key Features

Hypothesis: You want a failing test to be a sign that your code lacks some valuable property. You want a passing test to be a sign that your code possesses some valuable property. In other words, the properties you care about should be relevant factors for your tests and properties you don’t care about should be irrelevant.

Hypothesis: Requirement - One test should be a symptom of only one feature. The other features should be irrelevant. For example, you want factorial 5 to return 120. However, you don’t care if factorial 5 is computed by folding over a list or by looping over a range. So, you test for the former property, but not the latter. Taken to the extreme, my test for the recursive branch of factorial shouldn’t depend on the actual value of the base case. Even if I change the base case to be 2, the recursive case should still pass. (So, factorial 5 == 120 won’t cut it as a test of the recursive case, since it assumes that the base case returns 1.) An isolated branch test, for example, is a test of that branch and nothing else. – This is probably a relevant factor for decoupling the code for that feature.

Hypothesis: Requirement - A unit test for a property must fail for its alternatives. Otherwise, it will no longer reliably indicate whether that property exists. For example, if you choose entropy instead of the Gini function and use a unit test with all labels having the same value, both will give the same value - zero. If someone swaps in the Gini function later, your test won’t catch the fact that an important property was violated. (Note that if you have the score function as a high-level variable, you don’t need a unit test. The function call itself will tell you that you’re using Gini gain.) – Remember the bug where I mixed up the current stack location and the memory block’s stack location. My unit test should have distinguished between the two implementations. Basically, it should consider all the viable alternatives in scope and choose an input that gives different values for each of them. So, the entropy vs Gini test may not distinguish entropy and some other new probabilistic function, but it’s good enough because that other new probabilistic function is not in scope. Extracted functions reduce the number of variables in scope and thus reduce the chances of such errors and obviate such discerning unit tests. The other extreme is when you use a raw number like 73 - all other raw numbers are in scope. It could have been 74 or 999. Hmm… Concrete types allow a lot of alternatives. You need to write unit tests to pin down the value you care about. ADTs allow only a few, well-known alternatives. It’s either a Nil or a Cons. It can’t be an integer all of a sudden.

Corollary: When you add more code in the same file, you introduce more alternatives in scope. Now, you need to write a test to make sure someone doesn’t use a wrong yet plausible alternative. For example, my long dot-emacs configuration file with lots of custom functions means that I have a lot of options that seem right. Did I use the right one? Remember how many sort algorithms fall out naturally once you choose the high-level recursion scheme. At each point, you have no alternatives in scope, so you’re automatically guided to the right design choice. The same goes for free theorems about abstract function parameters.

So, extracting a function also simplifies your unit testing by reducing the number of viable alternatives.

Hypothesis: Hmm… You don’t need to unit test a function’s input. You know that it won’t break any of the inner code. That is, if your desired property is that the input to square be 5, then you don’t need to write a unit test. It’s right there in the input. The code works for any value of the input. More realistically, if you want Emacs’s theme to be black, you don’t need to write a unit test for the code. You write down just that in your dot-emacs. Of course, if you’re calling the function inside another function, you will need to test the caller.

Hypothesis: An abstract variable states the property that all places where that variable is used need to have the same value. Not lifting a free variable is a relevant factor for missing an instance when you change it. Basically, you have failed to test the property that all of those values need to be the same.

Hypothesis: An abstract variable also states the property that the behaviour of the function doesn’t depend on its value (as long as it satisfies the given type). – TODO

Hypothesis: Indirectly testing some feature you might change is a relevant factor for choosing a test that can be passed by an alternative (i.e., wrong) feature even though it differs in other ways. So, your test will pass even though the code doesn’t have the intended property.

Hypothesis: A change in some unimportant feature is a relevant factor for a change in your tests. In other words, your tests will fail even though the property they test hasn’t.

For example, if I test my decision tree implementation using information gain, then when I switch to Gini gain, I might have to update my tests.

Hypothesis: An indirect test is a relevant factor for potential mistakes in inferring the distant input and output. Then, your test might pass even though its property is violated or fail even though the property is satisfied.

Hypothesis: An indirect test is a relevant factor for more effort in inferring the distant input and output.

Hypothesis: Making your desired feature a function argument is a relevant factor for testing it directly.

Hypothesis: A static type is a relevant factor for preserving some property.

For example, if you want to receive a function that takes and returns an element of the same type, you can mark it as a -> a. People can’t send in Int -> String. If your code doesn’t type-check, either your definition or some caller is probably violating the above property. If your code type-checks, your definition and your callers are satisfying the above property.


Test: [2019-03-31 Sun] I could change the score function from entropy to gini because it was an argument to the best attribute function. Imagine if it weren’t. I’d have to go into the code and hope nothing else broke. For example, maybe I used it in two places and forgot to change the other use. – lift free variable.

Test: Map square vs for loop - change it to cube. Basically, the overall test you have may not catch the difference if you use a list with just 0s and 1s.

Test: Use gini instead of entropy - if you’d not extracted the function and if your change was wrong, your overall test may still pass because your mental calculation of the necessary input and the eventual output was wrong. If you had extracted the function, you can’t infer wrongly.

Use Higher-Order Functions instead of Flags?

Hypothesis: Using a higher-order function is a relevant factor for avoiding an if-branch with a flag in your code.

This is exactly what Chris Done mentioned when comparing Haskell and Lisp:

One difference in philosophy of Lisp (e.g. Common Lisp, Emacs Lisp) and Haskell is that the latter makes liberal use of many tiny functions that do one single task. This is known as composability, or the UNIX philosophy. In Lisp a procedure tends to accept many options which configure its behaviour. This is known as monolithism, or to make procedures like a kitchen-sink, or a Swiss-army knife.


Test: [2019-03-31 Sun] Wanted to extend my decision tree implementation into random forests by choosing a random subset of features at each split. Initially thought of using a flag is_random_forest, which would decide whether the columns looked at were a random subset or the full set. This would be hard to test because it would be a random subset at each point.

Then, I decided to use a columns_chooser (good idea, bad name) function as an argument. The default value would be just lambda xs: xs - use all the columns. But, for random forests, I could send in lambda xs: random_sample(xs) without touching the decision tree code at all!

So, I avoided having an if-branch within my decision tree code. That made it simpler to reason about and test. Specifically, it relieved me of knowing about the random forests idea when looking at the decision tree code. It decoupled the two things.

ADTs gives you an Index into the Input Space

Hypothesis: Looking at a program’s input as a combination of different variables is a relevant factor for getting an index into the input configuration space.

Otherwise, you can feel like you haven’t covered all the possibilities and you can feel like there’s always a lot more work left.


Test: [2019-04-17 Wed] Trying to analyze how many cases there would be for the free list allocation and freeing - was finding it hard. It seemed like anything could go wrong anywhere! It felt like the linked lists and the pointers could be in any arbitrary order.

Looked at it as a combination of an action on the free list and different equivalence classes of the free list state. Suddenly, I could tell exactly how many cases there would be. I could tell that once I had covered one set of cases, there would not be any surprises. And I could traverse the configuration space in a predictable manner.

In short, I had an index. I could tell roughly how many items there were, how much I had covered, how much was left, exactly where I’d left off, and exactly what came next.

These are the observable dependent variables. What are the observable independent variables?

I’m representing the possible actions as allocate and free. I’m also representing the state of the free list as a bunch of linked lists with blocks of particular sizes. It’s a product of the different linked lists. A linked list for a fixed size block has a few possible states: it can be empty, it can have one block of that size, or many blocks. A linked list for arbitrary sized blocks can be empty, have one block of some large size, many blocks in ascending order, descending order, or jumbled order. Now, the property they satisfy is that they must together comprise the heap.

Controlled experiments: What if the alternatives weren’t exhaustive? For example… What if the alternatives weren’t mutually exclusive? What if you couldn’t decouple the state into independent variables?

Corollary: ADTs give you the ability to index the configuration space. In Haskell and other languages, we have pattern matching to do so. Note that you can get this indexing without static types.

High-level Design: Decomposing the Output

Bottom-up Design: Layers of Abstraction

Test: [2019-03-01 Fri] Source: Bottom Up vs Top Down Design in Clojure - Mark Bastian, ClojureTV

He represented the game state as a first-class value. Then, very few functions know about it, such as update player hand. All the other functions compose the lower-level functions to build higher functionality like initializing the game state or playing a round.

In the oop version, you modeled each class separately.

Test: We deal with lists as a whole, whereas C programmers talk about specific pointers. Java programmers talk about methods on a particular object.

Hmm… When you write a class, your method depends on that class’ instance. It’s good in the sense that it can encapsulate code that deals only with that object, like…?

But what if you want to write a high-level function that updates an entire Sudoku grid or a whole proof tree instead of the individual cells and branches? You don’t care about the cells and branches.

In a class hierarchy design, you’re predicting that the only changes you’ll need to make are to subclasses of the current classes. You may subclass game, player, board, and card for D&D but the top-level classes remain the same.

In a functional design, you’re betting that the only changes you’ll make are to the bottom layer. You may change the game state from a simple list to a btree or whatever, but the layer design remains the same. Even if you have to change a few layers above that, the other layers won’t change.

Test: The bottom most layer - your first-class value - represents your domain completely. For example, the grid in Sudoku or the proof tree in my equational calculator. Your layers of code represent your abstraction of the concrete data.

What set of data does cps Interpreter refer to? What does player refer to? Well, the value within the object.

I guess the main difference is that you can’t combine and abstract stuff across classes. There will always be a player class.

Test: In the functional approach, you have one value and a few top level functions that act on it. In the object oriented approach, you have a bunch of objects and their methods (though you can write a high-level game class).

Test: He says functions compose but objects don’t. What does that mean?

Test: He got a working implementation of the game in one iteration whereas the top down design had just created classes and relations.

Once you have the bottom most data representation of your state, you can abstract as much as you want and still have a working program at every point.

Can’t you abstract iteratively with classes?

Maybe a difference is that you can change one layer without touching the rest.

Test: Will both approaches have the same API? If so, they will differ only in their implementation.

How much do they differ in code reuse? How much in modifiability? How much in understandability?

Test: He says that you’re decoupling data and functions. Classes couple the two (??). Languages decouple the AST and the value it evaluates to.

Test: In Scala, objects are still the glue with which you compose stuff. Here, we used function composition as our glue.

Test: With classes, you can lose sight of your goal.

You can spend a lot of time thinking about the perfect class hierarchy for a domain.

Question: What sort of data and behavior is hard to model using classes but easy using functions?

Need a controlled experiment.

Programming = Induction

Hypothesis: Tests are positive exemplars. So, they allow you to easily eliminate irrelevant factors or figure out the effect of a change.

(h/t Michael Feathers’ Working Effectively with Legacy Code)

Refactoring is nothing but eliminating irrelevant factors. And that is why refactoring needs tests.

I suspect that category theory laws act as de facto tests. For example, I don’t seem to need a unit test to help me understand fmap (+1) someMaybeValue. Gabriel Gonzalez says that he doesn’t usually write tests for his Haskell modules and yet his code works. When do you not need a unit test?

Code with minimum interface is basically code with all the irrelevant factors eliminated. It’s easy to understand because it has only the relevant factors. It’s easy to reuse because a smaller set of conditions means that more inputs will match it. And it’s easy to modify along the predefined axes. For example, it’s easier to change map square to map cube than to go into a for-loop and modify the function call.

Note that a “general” function isn’t one that can solve a ton of problems. It’s just the one with the minimum necessary factors for solving a particular problem. For example, sort :: Ord a => [a] -> [a] is more general than sortInts :: [Int] -> [Int] even though it solves the same problem. The latter was taking in unnecessary information (assuming that the sorting algorithm used just the ordering, not the integer values themselves).

Change: Transforming a Concept

However, it’s harder to change minimized code when your purpose changes. For example, you turned a for-loop into a map because you assumed that you would just change each element of a list individually. Now if you want to also use the index variable in your calculation or, worse, refer to arbitrary elements in the list, you’re going to be screwed. You will have to use zip [1..] on the list first or do something else entirely. That’s the danger of premature optimization. Basically, there’s no possible configuration of the existing variables that will give you what you want. You have to include new variables like the index variable or the other elements of the list.

Hmm… You basically want to shift from one concept to another without having to induce each factor from scratch! Ideally, if you wanted to use a map function that uses the index, you’d just write a new version imap that also provided the index. You would want a separate function for every version that you write. However, we just update our function and throw away the previous version.

I suspect that you have transformed a concept when you have changed its type signature. For example, map square and map cube have the same type signature, whereas map square and imap square don’t.

Question: How do we usually transform our concepts and how can we do so efficiently?

For example, I changed Bird’s calculator to return a tree of rewrites instead of a chain. Most of the code, like the expression-parsing or the law-rewriting, remained untouched. All I had to change was the ultimate function simplify.

Clear your Mind and See the Matrix

Hypothesis: We decompose an output in terms of input factors by first inducing away all irrelevant factors and then recognizing patterns we’ve seen before. Basically, we decompose by inducing and understanding.

Corollary: You can’t decompose something you haven’t seen.

Corollary: You can’t decompose something if you haven’t cleared the noise.

Lesson: Think about each output variable in isolation and clear your mind of all irrelevant input factors.

Corollary: Tests help you decompose the output (in terms of the input factors).

Corollary: As do types.

Lesson: See the first-class output of whatever you’re doing so that you can decompose it and thus decompose the input.


Test: Negative exemplar - Ents for tasks - I don’t have an idea of the Ent subgraph I want as output. – can’t decompose the output!

Test: Positive exemplar - I have an idea about the final product for the task planner – can decompose (???).

Test: website-join-paragraphs - had a clear idea of the input and output – could decompose the output in terms of the inputs.

Test: Didn’t realize I could decompose - FB uses Duo Mobile - so that’s a state-of-the-art security system (at least for ordinary tech companies) - basically, I had seen a company that probably employs some of the best security guys out there and I had seen the security measures it uses. But I had failed to connect the two, to realize that this meant that two-factor authentication was probably a pretty strong security measure. – failure to induce. (I should have recognized the general pattern “Advanced tech company - whatever methods it uses are at least state-of-the-art for ordinary people”.)

Decompose the Output and You have Decomposed the Input

Hypothesis: Once you’ve decomposed your output in terms of the input factors, you know that you only have to implement whatever code is necessary to produce those individual pieces (and to put them together). Now you can tell how far behind you are and exactly what stuff you have to work on.

If you haven’t broken your output into pieces, you can’t tell how much work is left and will think you have to do a ton of work. So, you will probably procrastinate on it (because there are obviously too many constraints).

TODO: Labels: “can tell how far behind you are”, “can narrow down the stuff you have to work on”, “think you have to do a ton of work”, “procrastinate” (how much?).


Test: Negative exemplar - Ents for tasks - I don’t have an idea of the Ent subgraph I want as output. – So I can’t tell how far behind I am.

Test: Super Mario - can jump and run but can’t score points – you can tell how much of the “final product” is done.

Test: Negative exemplar - I have an idea about the final product for the task planner and I’m still stuck on the Ent design because I don’t have an idea about its final product (the Ent subgraph along with the privacy policies, etc.). – can’t see my lack of progress.

Test: website-join-paragraphs - had a clear idea of the input and output – easy to move ahead.

Figure out the Exact Part of the Output you want to Change

Hypothesis: Figure out the exact part of the output you want to change, know the corresponding input factor -> can get the change you want with little effort.

Figure out the exact part of the output you want to change (assuming you know the categories), can’t figure out the corresponding input factor (perhaps because it’s not written in terms of your intentions) -> can’t get the result you want with little effort.

Don’t know the exact part of the output you want to change (because you don’t know the categories) -> can’t get the result you want with little effort.


Test: Wanted to add number of failures to my drill stats data. Once you realize that “[2018-05-20 Sun 13:33] 40 items reviewed. Session duration 0:10:53. (16 seconds per item)” consists of different parts and that you simply want to add one more part, you can focus on that part of the program. – less work.

Test: Negative exemplar - “add Ent schema for task planner” - don’t know which parts of the output I want change and thus which parts of the input I’ll have to modify. – stuck. Lots of “work”.

To Know what to Change, get Fine-grained Tests (aka Controlled Experiments)

Corollary: If you can run a fine-grained test (toggle just a few variables) -> small diff -> know exactly which branch of my code to change -> less work.

If you can run only a coarse-grained test -> large diff -> don’t know exactly which branches to change -> more work; have to write multi-variable branches and later refactor.


Test: Tried to do 5 things at the same time once my winter vacations started. Failed. Felt like none of those 5 things would work out. – coarse-grained test; have to do all 5 things correctly before I will tell myself “my plan succeeded”. Don’t get feedback about intermediate moves. Don’t know which one is actually causing the overall plan to fail - just see that I’m not making any progress. Don’t know what to change.

Test: Representative test idea - tried on a “complex” causal decision theory chapter - felt like I had to handle a lot of different variables at the same time (difficulty, equations, prose, background knowledge) before my idea would work. – coarse-grained test; don’t know exactly what to change about my idea to make the test pass; have to try lots of possibilities.

Test: pipgetc - worked for the non-blocking case; now added blocking - I felt like everything else worked and I just needed to support blocking on the pipe – fine-grained test; small diff (non-blocking vs blocking; everything else is the same); know exactly which branch to change.

Minimum Interface gives you a Small Diff

Hypothesis: Don’t think of the interface of each part -> feel that you have to do a lot of things at once (large diff) -> overwhelmed.

Think of the interface of each part -> few things at a time (small diff) -> can do it.

Corollary: Have a detailed end-to-end acceptance test -> can induce the interface of the different parts from it.


Test: Struggled to write the get and put functions. Why? Got intimidated because I didn’t think of the interface for each part. In short, I didn’t use locality of causality. – I thought it had to be some big feature to be implemented all at once. It seemed like I had to implement a large diff before I could get any feedback.

Test: Acceptance test for the get command (wow, I didn’t think even once about the actual flow of data) - they enter “get foo.txt” - you send that to the server - it checks the file, creates a new port, and sends it to you “get port: 13355 size: 205” - you then connect to that port and receive the contents and test whether you’ve got 205 bytes. You close the port (as far as this test is concerned). The end. – how fucking clear was that?!

Decompose the Input (??) (The Output, actually)

Hypothesis: Input, treat it like some monolithic data structure -> have to deal with a lot of input configurations.

Input, treat it like some decomposable data structure (where the parts are decoupled) -> fewer input configurations.

Hypothesis: Sum-product type, decompose -> easily done.

Some other type, decompose -> pretty hard.

Corollary: Decompose reality.


Test: Guy talking about TDD - parseQuery in Java took several red-green-refactor cycles. I did it in one line of Haskell and it was right too. Why? Why didn’t I need all the extra unit tests? I suspect that most of them were already covered by the functions I was composing. (The empty string one wasn’t.) – The guy was treating QueryString like some monolithic data structure that could have any number of possible configurations. I was treating it like a bunch of strings separated by “&” and then separated by “=” within those.

import qualified Data.Map as M
m = M.fromList . map ((\[x, y] -> (x, y)) . splitOn "=") . splitOn "&" $  "name1=value1&name2=value2&name3=value3"

Maybe these were well-worn paths that many programmers had already trodden. It would be a mistake to write fresh unit tests for them, as if you were completely uncertain as to their behaviour.

Observation: parseQuery is way too specific. What you really have is a bunch of a’s with delimiters. Writing a specific function for that is just stupid.

Chain = Decompose, Transform, Compose (??)

Hypothesis: Chain of functions -> each function either decomposes the previous data structure or composes it into something bigger or transforms each part without composing it.


Test:

callGrepOnFile p = unlines . grep p . lines

-- Read the files, grep each of them, and combine the results.
callGrep p fs = fmap (concat . map (callGrepOnFile p)) . mapM readFile $ fs

lines breaks down the string into a bunch of lines (each of which is simpler than the whole file and is independent of the rest). grep p transforms the lines into some other set of lines (just a subset). unlines composes the bunch of lines into a string.

mapM :: (Monad m, Traversable t) => (a -> m b) -> t a -> m (t b)

Next, mapM readFile transforms a list of files into a list of monads and then composes those monads to get an overall monad. Basically, you can see it as composing a bunch of files. fmap transforms part of a structure without touching the rest. map (callGrepOnFile p) transforms each element of that part-structure using callGrepOnFile. Finally, concat composes those elements into a valid part-structure.

Integration Test: Follow the Values to get the High-level Design

[2017-12-25 Mon] Go through Bird section 6.1 taking for granted whatever he claims.

Hypothesis: Integration test, accept all the deductions (function calls) based on their given outputs, actually follow the values across steps from start to end -> get the type signature of the high-level components, maybe even induce how it works based on the concrete values.

For example, if step 1 takes X and outputs Y, then you must ensure that step 2 takes Y and outputs Z.

Integration test, don’t have outputs for some deductions (function calls), follow the values -> may get stuck on the type signature and working.

Question: Why would following values give the “high-level design”?

Hypothesis: Follow the values -> know what values each component of the program takes in and gives out -> narrow down the causes of each value, can decouple the components.

Don’t follow the values -> don’t know what values each component of the program takes in and gives out -> don’t narrow down the causes of each value, can’t decouple the components.

Hypothesis: Integration test, get inputs and outputs for all the function calls -> decent representative tests for those functions, maybe even induce how they work.

Question: What if you gave somebody a fully-representative test of paging or some other large problem? Would they be able to design it in a quarter of the time you took?

For example, consider a test that involved multiple processes accessing virtual memory as well as non-virtual memory and having page faults and needing to replace pages in FIFO order while using semaphores and so on. Assume that all output values are specified at each point.


Test: I saw

= [wrap, snoc] . (id + val^c x id) . [embed, op]^c
-- coproduct
= [wrap, snoc . (val^c x id)) . [embed, op]^c

and induced that “coproduct” simply meant taking [a, b] . (c + d) and returning [a . c, b . d]. That’s all there is to it, at least as it is used in this proof! – just followed the values; got how it worked!

Test: Not following values - I saw the long chain of equational reasoning. Kinda “followed” each step by assuming that its proof step actually did what they said it did. But couldn’t see the link between that whole chain and the final argument that digits = val^c and thus digits would convert a number to digits. I hadn’t followed the values from the chain to the final paragraph – didn’t get the type signature of the model; didn’t induce how it worked.

Test: Follow values closely - already realized that val takes a decimal representation of a number to the number itself, and that thus val^c is just trying to get at the inverse! Crystal by the end of it! (Took a bit of time, though. Still, I was guaranteed to understand it by the end.) – got the type signature of each component (such as val, val^c, and function refinement) and even induced how it works (want foo; can see that there is a bar that goes in the opposite direction; get a “functional refinement” of bar^c and define foo as that function.);

Test: MATLAB Cross-validation page – Understood it in depth! (Took 24 minutes, though. Had 2.5k words. So, that’s 6.25k words per hour = 104 words per minute. Need to speed that up.) But it worked. Also learnt how to visualize things during feature selection (have a plot of MCE vs number of features, etc.). Got type signatures for the different parts of it (like holdout vs k-fold cross-validation, feature selection itself, ttest2, MCE, MCE with resubstitution - even though I’m not sure how it works, MCE vs number of features, how to judge whether you’re at a local minimum, etc.)

Test: Knaster-Tarski theorem 6.1 – took around 15 minutes. Not as crystal clear as the previous ones. But I did get the gist. However, I don’t understand the final point (equations 6.2 and 6.3). (5 minutes later, I still don’t get the last paragraph.) So, this technique isn’t infallible. But it’s better than nothing. Got the type signature. Got stuck because I didn’t understand the line “Each of the equations … has a least solution and these least solutions coincide.” and I needed it to understand the last paragraph. Don’t have the contract since I didn’t see any concrete values and didn’t understand the theorem statement.

Explanation: Didn’t have the output value for the unfamiliar function call “least solutions coincide” and couldn’t see a concrete example showing me the output value.

Test: Follow the values to decouple - paging - when switching processes, you need to have different page directories and page tables; when accessing a virtual memory address, you need to have a page fault handler that can get a new frame for that page (and possibly its page table); when replacing a page, you need to have a free frame list with FIFO information and a backing store that has the data for the incoming page and can store data for the evicted page. – know the causes of paging during process-switching, page-fault handling, and page replacement.

Test: How to create a one-to-one mapping (and more importantly, how to get P(y = 0 | z = 0) from y = fy(z, uy)) - test on Shpitser’s paper - got it. – technique: one-to-one mapping; known working example: Shpitser’s paper; clear feedback - what his proof said (as opposed to my vague incomplete attempts).

Minimum Interface: Try Alternative Values and Abstract (??)

Hypothesis: Follow the values, try alternative values (toggle it), no change -> abstract that value; reuse known functions.


Test: Want to convert roman numerals to decimals. Use Parser. – I broke down the concrete problem of reading a roman number to the known technique of parsing.

Test: Paging - broke it down to get_new_frame, page_fault_handler, read_bs, etc. – Well, get new frame needed to care only about the frame list and the replacement policy.

Test: ATM example - suppose you were given lots of extraneous details like the make of the card reader, the keyboard model, etc., how could you strip it down to the core problem of requesting money and getting it? – toggle the values and realize that the outcome would be the same [may not always be so].

Values are Easier to Understand than Unfamiliar Types

Hypothesis: Unfamiliar type or category -> unfamiliar function calls; have to look them up.


Test: Phone number problem - could follow the input values from start to finish because I saw them simply as strings. You received an input string like “10/783–5” and got a digit-only string like “107835” and then you broke it up into prefixes and suffixes and tried to match them with dictionary words and so on. – It was all simple and understandable; I know how Strings behave.

Test: Phone number problem - contrast that to my first attempt - look at the types: DigitString, Dict, Token, NonDeterministicEncoding, Parser, PState, DictWord, PhoneNumber. – You have to work pretty hard to understand all of them. And here’s the rub: you can’t fully understand the code unless you’ve understood them. I’d have to keep looking up the definition.

Test: Look at this code (from 2014 - I don’t think I would have written the same code now):

parse' :: Dict -> PState -> [PState]
parse' _ ps@([], _) = [ps]
parse' dict ps = flip nextPStates ps . mapMaybe (lookupToken dict) . prefixSplits . fst $ ps

You can’t really understand that code if you don’t know what PState behaves like. – I can’t get started until I lookup PState or even Dict. Slow.

Test: Here’s another sample:

nextPStates :: [(DigitString, Token)] -> PState -> [PState]
nextPStates [] ps = nextPStatesWhenNoNewTokens ps
nextPStates ts' (_, ts) = map (fmap (:ts)) ts'

Explanation: Can’t move forward until I know DigitString and PState. Slow.

Test: Contrast that to my current code (which isn’t beautiful or anything) TODO

Test: Reading Applied Behavioural Analysis - constantly got stopped in my tracks by the definitions they kept laying out - “respondent conditioning”, “modeling”, and other things. I didn’t know what those labels referred to and I had to struggle to figure it out. – can’t move forward without knowing those terms; hard to follow. Have to keep looking up the definitions.

Test: Category theory book or Bird book - full of unknown types like “initial object”, “group”, “product”, “coproduct”, “functional refinement”, “catamorphism”, etc. – can’t move forward unless you know; have to keep looking up the definition.

Types prevent you from Following Values (??)

Hypothesis: Types -> stop you from following values and thus stop you from inducing the types and mechanisms of different parts.


Test: Trying to apply the “fusion law” for folds to a practical problem. Not able to get through because of the maddeningly-similar fold and listFold. They have the types fold :: (a -> b -> b, b) -> [a] -> b and listFold :: LAlg a b -> ListF a b -> b.

Note that I first have to unpack LAlg, which is type LAlg a b = (a -> b -> b, b). So, the first arguments are literally the same type. Then, I have to unpack ListF, which is data ListF a r = NilF | ConsF a r. That just seems like the List type with extra steps.

I can’t see any substantial difference between listFold and fold. But the condition and result of the law include both, so there must be some difference.

Explanation: I can’t see any values passing around. Hard to induce the work being done.

I mean, take the fusion law itself:

h . listFold f = listFold g . listMap h
==>
h . fold f = fold g

What is the type of h? What are the types of f and g? Are they of the same type? Not at all sure.

What I need is an actual concrete exemplar.

Test: Let’s look at a concrete practical example.

Immediately, I got some ideas.

(Paraphrased to name f, g, and h as per standard usage.)

If we have the following properties:

  1. h is a strict function.
  2. h (f x y) = g x (h y), for all x and y in the appropriate ranges

Put b = h(a).

Then: h . foldr f a = foldr g b.

Not yet a concrete example, though.

Let’s try it with the ascending lists example.

Explanation: Could see what was happening to each element of the list being folded.

Follow the Values in the Final Program: Induce Program Design Principles (??)

Hypothesis: Follow the values in the representative test, follow the values in the final program, induce -> model for representative test to high-level design.

Corollary: Get model for representative test to high-level design -> store that technique on that concrete cue.


Test: Paging - could see that page faults needed to go through the page fault handler, look up stuff in the page directory and page table, evict a page, and so on. Those concepts remained in the final program too. – So, I could have designed the system like that right from the beginning.

Abstract from Concrete Representative Exemplars

Hypothesis: Start with a concrete example -> know it will work; can abstract quickly.

Start with an abstract function -> don’t know if it will work; slow.


Test: Elisp function to compile and run this C/C++ file. – Took me 10 minutes. Writing the function itself took only 5 minutes or so. Binding the key took up the rest of the time. Helped to abstract my way to the ideal function. First, I simply ran “g++ popen-try.cpp && ./a.out” by hand. That worked. Then, I wrote a simple function that did that. Worked. Then, I abstracted that function to work with arbitrary file names (and use the buffer file name if there was no argument). Worked like a charm. When the keybinding was giving me some trouble, I reverted to a simple non-Evil keybinding that worked, then tried an Evil keybinding that worked, and finally used my desired key.

Fast Index for your Techniques: Integration Tests (Categorize the Output)

Corollary: Real-world scenario, can’t remember which technique to use, look only at your integration tests for a similar scenario -> have to remember just the integration tests and will automatically have concrete discriminating cues; quickly get the single function you want; guaranteed up-to-date since you will replace old techniques with better ones; DRY because you can’t have multiple hypotheses for the same scenario.

Real-world scenario, can’t remember which technique to use, no integration tests -> have to remember a lot of context-less techniques for whom I don’t have discriminating cues; slow; may not be up-to-date; not DRY because I may see several techniques for the same (poorly-categorized) scenario.

New scenario that your past acceptance tests don’t cover, search your techniques (aka weapons) -> hopefully respond correctly; add as an acceptance test.

Corollary: Scenario not in integration test, go online and add the technique to an integration test first -> every technique you needed to look up will be in an integration test for easy lookup later; quicker way to test the new technique.

Corollary: We can easily remember integration tests (even if we don’t remember the answers). We struggle to remember generic text.

Hypothesis: Maybe all you need are tests for real-world scenarios, even if they don’t make an end-to-end acceptance test. That way you can still look up what to do in any given situation. (But an end-to-end representative test lets you induce the whole function. Still useful.)


Test: Want to use a pre-commit hook to run lint - look at the how the demo repo does it with Ruby. – done. quick - no need to search 30k words of my programming notes. got one single function - exactly what I wanted. up-to-date in terms of versions and quality; DRY - didn’t have to decide between two different techniques.

Test: No integration test - wanted to write OS HW3 paging faster. Didn’t know what to do. There were around 15k words of programming notes. Kind of stuck. – Could see that I would have to read through a lot to get back to my idea of toggling behind an interface and minimizing interface by coding everything in separate files and stuff. Saw lots of techniques like one-button change and minimum interface and one branch at a time and program to an interface and so on.

Test: Self-help books with a bunch of techniques (such as The Little Book of Talent: 52 Tips by Daniel Coyle, Writing Tools: 50 Essential Strategies for Every Writer by Roy Peter Clark, or Your Brain at Work by David Rock, which has 14 tips) - don’t know when to use what. I usually forget to use them. – have to remember a lot of context-less techniques (“Tip 45: For every hour of competition, spend five hours practicing.”); can’t find what I want; abstract scenario - may hit upon several seemingly-applicable techniques.

Test: Have to make a Beamer presentation using Org mode - look it up in my demo Org Beamer presentation, not online. – have a default Org Beamer template I can use anywhere; can experiment on a simple template.

Test: C pointer details - hard to remember the abstract knowledge about p + i and (void *) p and (struct foo *) p and so on. Easier to remember a test I ran on all these examples, from where I can easily get the answers I want. – done.

Test: New scenario - Milo against Avada Kedavra - none of his usual spells worked; use the Skeletal Troll. – he probably searched the rule book for something like this. He could then use it against anyone who cast the killing spell against him.

Test: YBAW - overwhelmed by a project plan - use a story board to represent a few big ideas. – it’s not a full story, just a sketch.

High-level Design needs an Integration Test (maybe with Mocking)

Corollary: Have integration test with expected output (maybe mocking out each lower-level function for the particular test input), get high-level design -> low uncertainty.

Don’t have integration test, get high-level design -> high uncertainty.


Test: OS HW3 - change design of my paging code - don’t know which one will be better (and still pass all tests). Don’t know how to implement free-frame list in the best way or how to decouple the heap and backing store from the rest of the program, etc. – didn’t have any integration test, lots of uncertainty.

Test: OS HW2 - pipes - knew how the high-level pieces were going to fit together. Basically, didn’t have to do any high-level design changes. Just implement pipgetc, pipputc, etc. and the test code would simply use a pipe like a normal device. – didn’t need to change the high-level design.

Test: OS HW3 - tried one example of virtual memory page access - got the general idea of how control flowed. Got the high-level design (heap was used just to allocate memory, backing store was really used only when there were insufficient frames). – integration test helped me get the answer quickly.

Deployment Pipeline: Flesh out the Integration Test (?? duplicate)

Hypothesis: Deployment pipeline for hypotheses or programs = add a high-level integration test with function calls mocked out so that it all flows; fill out mocked-out functions using inputs and outputs from the integration test; release by running integration tests + unit tests; you’re done passing that integration test when you no longer have any mocks in the integration test -> decoupled high-level design by following the values; isolated implementation for the inner functions by using real-world tests; working product at every point as certified by the integration test; frequent releases by implementing inner functions quickly and running the integration test + unit tests;

don’t have high-level integration test; add functions without an overarching test -> may commit to poor design; isolated implementation for functions; won’t have a working product for a while since there are no integration tests passed; infrequent high-level releases.

Corollary: Deploy using integration test -> no duplicate or overlapping hypotheses because you can only implement something that has been mocked out in the integration test.

Don’t deploy using integration tests -> duplicate or overlapping hypotheses.

Question: How does this make for small diffs?


Test: Learning about myosin filaments and stuff - first took myosin filaments as black boxes just to make the high-level example flow, and then filled out my model of them using the next section. – high-level flow; inner definition; run integration test by checking if it fits in the overall flow (need to do it mentally); done when you can explain the whole integration test in detail -> decoupled design (myosin filaments are at a lower level); isolated implementation (yup - you get feedback from the questions at the end of the myosin section); working product at any point because you know how the overall thing works; frequent releases (?).

Test: Bird - number to digits or Knaster-Tarski theorem - followed the high-level flow without knowing how the inner functions worked. Later, could fill them out - such as coproduct or functional refinement. – high-level flow; fill out inner functions, sometimes just by inducing using the high-level test (for example, coproduct); run tests by checking the overall flow -> decoupled design because you learn each concept in isolation and then piece it together with the rest; working product at any point because I choose to skip some esoteric function definitions; frequent releases because you can run the integration tests.

Test: No high-level integration test - current refactoring of causal hypothesis essay. Just refactoring pieces here and there. – No idea how it all fits together and may have a poor design that isn’t related to any real-world problem. Definitely won’t have a working product for like 50 hours or so! Can’t test the overall picture.

Test: [2017-12-27 Wed] My one-button change essay. No real idea how it all fits together - the old random notes, the newer well-tested little ideas that still run up to a 11k-word tangle, and then some ideas from my journal and phone. – no high-level integration test; add new ideas like “follow the values -> narrow the causes” without an overarching test -> poor design for sure; won’t have a working product for a while since I’m not running any integration tests; infrequent release of integration tests (maybe when I write a 2-hour program like TicTacToe or something - which I approach intuitively anyway).

Test: Duplication - causal hypothesis - a bazillion explanations for how we understand things - hierarchical categories, stories, grammar, etc. – no integration test.

Acceptance Test: Talk only about Domain Variables

Hypothesis: Domain variables -> necessary for all use cases; so start with this.

Fancy stuff like the type of messaging protocol or the database type -> not necessary for all use cases; do it later.


Test: Auction sniper example - “Given auction, when I join and don’t bid, then I will lose” - nothing about XMPP or Swing or anything else. – simple test; can change the protocol and such. Start with the domain stuff.

Acceptance Test: Go from End to End (?? no negative exemplar)

Hypothesis: Acceptance test - must go from end to end, not merely test some intermediate state -> fewer overall tests, more likely to cover highly-likely money-making paths, no unused paths.


Test: “Journey test” - Jez Humble - shopping website - login, add 3 items to cart, checkout, add delivery address, pay, logout – highly likely path; one that actually makes the money; not unused.

Test: Could Gambling Save Science?

NEUTRINO MASS Betting markets could also function in the absence of overt controversy, as in the following (hypothetical) story.

Once upon a time the Great Science Foundation decided it would be a “good thing” to know the mass of the electron neutrino. Instead of trying to figure out who would be a good person to work on this, or what a good research strategy would be, they decided to just subsidize betting markets on the neutrino mass. They spent millions.

Soon the market odds were about 5% that the mass was above 0.1eV, and Gung Ho Labs became intrigued by the profits to be made. They estimated that for about $300K spent on two researchers over 3 years, they could make a high confidence measurement of whether the mass was above 0.1eV. So they went ahead with the project, and later got their result, which they kept very secret. While the market now estimated the chance of a mass over 0.1eV at 4%, their experiment said the chance was at most 0.1%.

So they quietly bought bets against a high mass, moving the price down to 2.5% in the process. They then revealed their results to the world, and tried their best to convince people that their experiment was solid. After a few months they mostly succeeded, and when the price had dropped to 0.7% they began to sell they bets they had made. They made $400K off of the information they had created, which more than covered their expenses to get that information.

Or course if Gung Ho Labs had failed to convince the world of their results, they would have faced the difficult choice of quitting at a loss, or holding out for the long-term. No doubt a careful internal review would be conducted before making such a decision.

Gung Ho would be free to use peer review, tenure, and fixed salaries internally, if they are effective ways to organize workers. The two researchers need not risk their life savings to be paid for their efforts. But the discipline of the external market should keep these internal institutions from degenerating into mere popularity contests.

Explanation: He actually walked through a concrete, made-up, end-to-end scenario, just like Joel Spolsky recommends:

When you’re writing a spec, an easy place to be funny is in the examples. Every time you need to tell a story about how a feature works, instead of saying:

The user types Ctrl+N to create a new Employee table and starts entering the names of the employees.

write something like:

Miss Piggy, poking at the keyboard with a eyeliner stick because her chubby little fingers are too fat to press individual keys, types Ctrl+N to create a new Boyfriend table and types in the single record "Kermit."

Acceptance Test: High-level Decoupling (?? refile to the domain variables section)

Corollary: Acceptance test -> you put the domain variables first -> high-level decoupling; everything unmentioned gives you degrees of freedom as per your design choices.


Test: ATM example - doesn’t say which model of the keyboard or monitor or card reader or even bank you’re using. You can vary that as you please. All that matters is that when I have $100 in my account and ask for $20, I should get $20 and have $80 left in my account. – your high-level design shouldn’t worry about the low-level model types. You have to assume that the use will have those high-level actions. For example, you can assume that the user can enter “$20” on any keyboard - whether it’s a touchscreen, a numeric keypad, or a QWERTY keyboard (or even that he can enter it via a voice-recognition machine).

Integration Test followed by Unit Tests (?? write unit tests as demanded by your new features)

Corollary: Write one failing integration test that doesn’t have any branches anywhere, come up with high-level design, write unit tests for each module, put it together -> high-level design at speed, module design at speed, correctness.

Write just the failing integration test, no unit tests -> may not handle certain cases - errors, high uncertainty about modules - slow, hesitate to change code.

No integration test, write just unit tests -> uncertainty about design - slow, don’t know how to proceed.


Test: OS HW3 - obtaining a free frame - if page was dirty, write it to the backing store. – got the high-level design, but don’t know if my get_new_frame function handle possible cases well. Hesitant to make global changes like adding a new field to invpt_entry.

Explaining to Others makes your Input and Output Explicit

Hypothesis: Explain to others -> force you to categorize your situation, i.e., label the input and output variables; know exactly which variables you’re uncertain about (by looking at the positive and negative exemplars); can toggle those variables in an experiment or learn about their behaviour online.


Test: Was going to post a question to the TA online. Was going to say “the program crashes in gdb after the strcpy call”. But then I thought, come on, maybe that’s not the case. Try stepping to the next instruction (didn’t work) or try setting a breakpoint after the call. And that worked! – that simple experiment fixed a lot of things. All because I made my hypothesis explicit - “strcpy makes my program crash”. Immediately, I could think, “what if it isn’t strcpy?” and test it.

Test: Another question to the TA - wrote down a “minimum, verifiable, complete” description of my problem. – made it easy to see that the main reason why my program worked in gdb (positive exemplar) but not in the terminal (negative exemplar) was because of the environment (the diff between the two).

Test: Another example - wanted to ask a question about how to get two addresses from a program. Realized that I just needed one address. – all because I wrote it out.

Test: Another question to the TA - how to get the same random number in Python as from C’s rand(). Search online several times. Didn’t get it. Typed my question into the forum and decided to search one last time. Got it! – making my question explicit seemed to help. My vague half-hour’s worth of struggles boiled down to one simple query - “how to get the output of rand() using Python?”. Google for it and get the answer. (Narrowed down my uncertainty to one variable.)

Representative Test: Cannot be Passed by Any Existing Function

Hypothesis: Representative test = something that cannot be passed by any existing function.

(What if it could be passed by some other function in the future? I suspect that function would be longer and thus have its prior probability penalized.)

Hypothesis: Mechanism, representative test -> good experiment that gives you strong evidence about your mechanism?


Test: dropEnd :: Int -> Seq a -> Seq a – Testing it with a list is not good enough. You could have just used drop. You have to use something for which no existing function would work. example: a pipe (or something).

Test: Applicative laws proof for AT (which is just a binary tree with a type variable):

-- To prove:
Law 1: fmap snd $ unit *&* x == x

where (*&*) = liftA2 (,)
unit = pure ()

Super-simple example of AT = (B (L 4) (L 7))
Representative example of AT = B (B (L 3) (B (L 4) (L 7))) (L 8)

unit = L ()

LHS = fmap snd $ unit *&* x
= fmap snd $ L () *&* (B (L 4) (L 7))
= fmap snd $ liftA2 (,) (L ()) (B (L 4) (L 7))
= fmap snd $ (pure (,) <*> (L ())) <*> (B (L 4) (L 7))
= fmap snd $ L ((),) <*> (B (L 4) (L 7))
= fmap snd $ (B (L ((), 4)) (L ((), 7)))
= (B (L 4) (L 7))
= x
Hence, proved.

Law 2: fmap fst $ x *&* unit = x (Right identity)
Super-simple example of AT = (B (L 4) (L 7))

LHS = fmap fst $ x *&* unit
= fmap fst $ (L (,) <*> x) <*> L ()
= fmap fst $ (L (,) <*> (B (L 4) (L 7))) <*> L ()
= fmap fst $ (B (L (4,)) (L (7,))) <*> L ()
= fmap fst $ (B (L (4,) <*> L ()) (L (7,) <*> L ()))
= fmap fst $ (B (L (4, ())) (L (7, ())))
= (B (L 4) (L 7))
= x

Law 3: fmap assocTuple $ x *&* (y *&* z) = (x *&* y) *&* z (Associativity)
where assocTuple (x, (y, z)) = ((x, y), z)

Super-simple example of AT = (B (L 4) (L 7))

x = (B (L 4) (L 7))
y = (B (L 1) (L 2))
z = (B (L 8) (L 9))

LHS = fmap assocTuple $ x *&* (y *&* z)
= fmap assocTuple $ x *&* ((L (,) <*> y) <*> z)
= fmap assocTuple $ x *&* ((L (,) <*> (B (L 1) (L 2))) <*> z)
= fmap assocTuple $ x *&* (B (L (1,)) (L (2,)) <*> z)
= fmap assocTuple $ x *&* (B (L (1,)) (L (2,)) <*> B (L 8) (L 9))
= fmap assocTuple $ x *&* (B (L (1,) <*> B (L 8) (L 9)) (L (2,) <*> B (L 8) (L 9)))
= fmap assocTuple $ x *&* (B (B (L (1,8)) (L (1,9))) (B (L (2,8)) (L (2,9))))
= fmap assocTuple $ (B (L 4) (L 7)) *&* (B (B (L (1,8)) (L (1,9))) (B (L (2,8)) (L (2,9))))
= fmap assocTuple $ (B (L (4,)) (L (7,))) <*> (B (B (L (1,8)) (L (1,9))) (B (L (2,8)) (L (2,9))))
= fmap assocTuple $ (B (L (4, ) <*> (B (B (L (1,8)) (L (1,9))) (B (L (2,8)) (L (2,9))))) (L (7,) <*> (B (B (L (1,8)) (L (1,9))) (B (L (2,8)) (L (2,9))))))
-- Wow. And this is for the simplest B-values for x, y, and z!
= fmap assocTuple $ (B (B (B (L (4, (1,8))) (L (4, (1,9)))) (B (L (4, (2,8))) (L (4, (2,9))))) (B (B (L (7, (1,8))) (L (7, (1,9)))) (B (L (7, (2,8))) (L (7, (2,9))))))

RHS = (x *&* y) *&* z
= ((B (L 4) (L 7)) *&* (B (L 1) (L 2))) *&* z
= ((B (L (4,)) (L (7,))) <*> (B (L 1) (L 2))) *&* z
= (B (L (4,) <*> (B (L 1) (L 2))) (L (7,) <*> (B (L 1) (L 2)))) *&* z
= (B ((B (L (4, 1)) (L (4, 2)))) ((B (L (7, 1)) (L (7, 2))))) *&* z
= (B ((B (L (4, 1)) (L (4, 2)))) ((B (L (7, 1)) (L (7, 2))))) *&* (B (L 8) (L 9))
= (B (B (L ((4, 1), )) (L ((4, 2), ))) (B (L ((7, 1), )) (L ((7, 2),)))) <*> (B (L 8) (L 9))
= (B (B (L ((4, 1), ) <*> (B (L 8) (L 9))) (L ((4, 2), ) <*> (B (L 8) (L 9)))) (B (L ((7, 1), ) <*> (B (L 8) (L 9))) (L ((7, 2),) <*> (B (L 8) (L 9)))))
= B (B (B (L ((4, 1), 8)) (L ((4, 1), 9))) (B (L ((4, 2), 8)) (L ((4, 2), 9)))) (B (B (L ((7, 1), 8)) (L ((7, 1), 9))) (B (L ((7, 2), 8)) (L ((7, 2), 9))))

Damn, they're equal. I thought my definition of AT was wrong (because it seemed to nest the right-hand B in B <*> B.).

Observation: I didn’t even use the case where it’s B <*> B. But that isn’t part of the Applicative definition. It’s covered under the case B fx fy <*> x.

Fully-Specified Representative Test (??)

Hypothesis: Test that sounds simple enough, but the implementation is pretty “complex” <- lots of other conditions not specified in the test.


Test: Pipes for XINU - sounds simple enough - just send the output of the writer as the input of the reader. – Problem: conditions like writer may be killed at any time, as can reader, the buffer may become full, one of them may starve due to low priority, etc. Your pipes implementation must handle all of these cases.

Extend Representative Test with every Diff (??)

Hypothesis: Extend your representative test with every diff. Either the diff will extend a chain or it will add a branch (or do some other thing). Your representative test should capture that.

Test: Super Mario - v1.0 - just jumping around. Next, moving forward too. – initial representative test is whether you can jump around; next test checks if you can move forward too (which is a different branch).

Acceptance Tests must transform Input to Output for a Positive Exemplar (??)

Hypothesis: Start with a minimal positive exemplar where the input uses only one or two features -> can get feedback quickly; know exactly what to change in your code, therefore quick.

Start with a positive exemplar where the input uses lots of features -> can’t get feedback until you have all features working; don’t know what to change when debugging, therefore slow.

Positive exemplar where you can see how the input is transformed into the output -> can use it as a fully-specified acceptance test that helps you design your code; takes long to get feedback; hard to debug; slow. However, this might give you function-wise tests that produce quick feedback.

Don’t know the output for some input -> can’t use it as a test.

Lesson: Don’t talk about a hypothesis until you have at least one positive exemplar (ideally a minimal positive exemplar).


Test: Tried to get an integration test for my learning mechanism - “Digits of a Number section 6.1, Bird.” – I have the input. But I don’t have the output! Can’t use it as a test (and get a high-level design).

Test: Phone number problem - just 10/783--5 to neu o"d 5 isn’t enough. – You need to show how you remove the punctuation, split the string, and lookup words in the dictionary. (fully-specified acceptance test)

Test: Minimal positive exemplar - just enable paging with a global page directory and five global page tables and no per-process page directory - simple program; useful feature. Hmm… The output of the paging program could be pretty “complicated”. It may have involved page-fault handling, per-process page tables, page replacement, FIFO, backing store, etc. But we chose to target a small subset of that.

So, the challenge was to come up with an input that didn’t invoke page-fault handling or per-process page tables or page replacement or backing store accesses. The input was a string of memory references.

I could see that accessing an address beyond the first 4096 pages would cause a page fault because it wouldn’t have been in the global page tables and would need to be handled by the page fault handler, which I didn’t have. So, that was out.

Similarly, switching between processes and writing to the same virtual memory address would need to lead to different values. So, that was out too.

Ditto for reading so many different pages that I had to replace one of them or for reading from some page that had been evicted (and presumably saved to the backing store).

Explanation: Program + simple input should give simple output. However, if it gives some other output, then you know that you have to change something about the program. And because you’re changing just one feature in the program, you know exactly what to change and so can quickly implement the feature. Also, you can tell what functionality is added by that feature alone.

Test: Mario MVP - just jumping and moving. No princess or turtles or spiky pits. – can get feedback about that feature alone; quick because you know that any problems with the output must be due to the one feature you changed.

Test: Lots of features - scheduler - tried to change lots of things and was mired in debugging output. – slow to get final feedback; hard to debug. Basically had to change all parts of the default scheduler, instead of changing one thing at a time.

Unify Interfaces for Ease of Use

Hypothesis: Common interface and common syntax means that you don’t have to learn different interfaces.

Idiosyncratic interfaces mean that you have to keep learning superficially different interfaces for the same kind of thing.

Corollary: Why text editors and shells rock - they provide the same interface to a ton of programs!

Corollary: Why books rock - unified interface to a ton of different fields.


Test: Bunnylol - id <car-id> | <person-id> | <product-id> is super easy to use.

This wouldn’t have been:

car-id <car-id> person-id <person-id> product-id <product-id>

And somebody would have been tempted to add specialized flags for each one. So, you would have to learn a whole new interface for each object you wanted to inspect.

You want the same thing from - to observe their fields and edges - and you shouldn’t have to do anything different for each one.

Test: Ditto for t Foo.php and t Foo and t foo/bar/ and t foo/* – I don’t have to remember different interfaces. (And t for running Hack tests and Jest tests and Python tests and whatnot.)

They all provide the interface of a set of tests. Just run them!

Test: Note the meta interface of accessing all these tools from the same Chrome location bar! After all, why not?! – They all share the interface of taking some text and returning a web response.

Test: Emacs deals with ASCII key strokes as the unified interface to all kinds of editing tasks. Want to write code? Rebase your commit? Delete files? It can all be done with a few keystrokes! – unified interface.

Test: The shell deals with ASCII commands as the unified interface to all sorts of programs.

Test: Negative exemplar - GUI - have to learn a new interface for each task you want to do! – lots of different interfaces.

Test: JFC! Books provide a unified interface to all kinds of knowledge! Want to learn about economics? Psychology? Programming? You can do all of that from the interface of a simple book. – unified interface.

Test: Tasks for everything - coding projects, bugs, bike repair, facilities requests, bike loan. – Talk about a unified interface!

Test: Diffs for everything - want to edit a gatekeeper? Have to get that diff reviewed! Cool! – unified interface for getting approval for things and for making changes first-class.

Test: [2018-07-31 Tue] Google Cloud printer - simple interface for uploading a file - just upload to the cloud (not to a particular printer near you, whose name you have to find out); simple interface for printing a file - just go to any printer and pick the file from the cloud (instead of going to the particular printer to which you sent the file).

Printing with a cloud = upload to the cloud (easy), go to any printer you want (easy), and download from the cloud (pretty easy).

Printing with a particular printer = search for that printer and download drivers if needed (pretty hard), go to that particular printer (ok if you’ve gone and found the name of the printer near you), and print whatever file you sent (easy).

Basically, with specific printers, you have to learn a new interface each time you come across a new printer. With a cloud, you can reuse the same interface every time.

Test: The same screen-sharing and video-conferencing software in every meeting room. Don’t need to learn new interface.

Test: The same laptops and phones for everybody.

Test: The same chat software throughout the organization.

Refactoring: Imposing Structure on your Code

Refactoring: Write Direct Tests before you Refactor (- lower cycle time)

Hypothesis: Don’t have direct tests, refactor -> may break the code; hard to debug.

Don’t have tests, refactor -> reasonably confident; easy to debug (narrow causes).


Test: Phone number - trying to refactor g (yeah). The overall tests for encode keep failing. That’s because I don’t know the contract for g - I don’t have direct tests for it. So, I should see what behaviour it currently produces before I try to clean it up. – broke the code; hard to debug.

Refactoring: Make the Types Compose

Hypothesis: Type signatures -> can see which functions you can compose (and which you can’t and must rewrite); doesn’t take a genius.

No explicit type signatures -> hard to see which functions you can compose; feels like it takes a genius.


Test: Type of g:

g :: [String] -> M.Map Char Char -> (String, String) -> [String]

It’s not composable at all. It’s just a jumble of types. Specifically, it’s not closed. But I’d like to use it over and over, so I want it to be closed (either in the category of functions a -> a or in the category of monads a -> m a). – You can’t compose it with the external type of any other function.

Test: Hmm… Haskell programs generally express one unit of computation with their function types.

Representing the whole computation of phrases as (PrefixDigits, SuffixDigits) -> [Phrase] hides the inner computations. We don’t know what’s happening underneath.

What are we actually doing? We want (a) PhoneNumber -> [Phrase]. So, we are (b) breaking it into (PrefixDigits, SuffixDigits) pairs and then (c) encoding PrefixDigits as a Token (which could be a WordToken or a DigitToken) and then (d) recursively encoding SuffixDigits as [Phrase] and then (e) combining the Token and the [Phrase] such that there are no consecutive digits.

-- (a) and (c)
encode :: PhoneNumber -> [Phrase]
-- (b)
allSplits :: PhoneNumber -> [(PrefixDigits, SuffixDigits)]
-- (c)
encodeToken :: PrefixDigits -> Token
-- (e)
combineToken :: Token -> [Phrase] -> [Phrase]

See how much cleaner that is! And, of course, I would make the types abstract (for example, allSplits doesn’t need to know if the argument is a PhoneNumber or anything; it can work on a simple list).

You can follow the computation from the types. The only way to get a [Phrase] is through combineToken, which needs a Token. We get that from encodeToken, which in turn takes inputs from allSplits. – They compose well with each other.

To Refactor an Impure Function, Make it Pure

Inference: Refactoring needs tests.

Hypothesis: Make it a pure function, observe the inputs and outputs for a sample execution -> acceptance test -> can induce easily.


Test: my-org-drill-insert-last-session-data – took half an hour but it was much cleaner once I made it a pure function.

Writing Essays for Others forces you get Feedback

Hypothesis: Write essay meant to be published -> actually run over each line with a reader’s eye, have to explain everything in detail -> actually specify your model, actually get feedback.

Write stuff just for “internal use” -> just go with what you feel is the output of the mechanism, can hand-wave over details -> don’t specify your model, don’t get the feedback you could have got.

So, writing for yourself is like writing a 200-line function that you’ve never been tested. Writing for others is like writing a function that you will test pretty soon.

Lesson: Consider your casual judgment to be a poor source of feedback. Instead, prefer the feedback of a reader (even if it’s an imaginary one that pipes up as you’re writing). So, write as close to the final medium as possible. That’s how you’ll force yourself to think like a reader.

Hypothesis: Publishing online differentially reinforces clean writing.


Test: Surrogate endpoints - stuck for nearly six hours on the one-to-one mapping proof for a simple n=1 inducing path. Wrote about it in the report. Found out that I was not going into an infinite recursion as I had thought. My approach was actually right. Why did this work when I was writing it out? – When I was writing the report, I was looking at it from the viewpoint of my professor, who’d be reviewing it. So, I wanted to make everything understandable.

Remember to Refactor

Hypothesis: After passing the test, reminder to refactor -> refactor.

After passing the test, no reminder to refactor -> don’t refactor.


Test: initialize_nullproc_paging remained ugly. I didn’t refactor it. – no reminder.

Don’t Rewrite from Scratch

Hypothesis: Old code that kind of works but is messy, rewrite from scratch -> lose hard-won information (fixes for unintuitive and hard-to-find bugs) stored in the ugly code.

Old code that kind of works but is messy, refactor or at least use as a testing model for your snazzy new function -> keep the hard-won information.

Lesson: Feel like throwing away old code or prose -> refactor or use as a model instead.


Test: Joel Spolsky on rewriting (working but ugly) code from scratch -

The idea that new code is better than old is patently absurd. Old code has been used. It has been tested. Lots of bugs have been found, and they’ve been fixed. There’s nothing wrong with it. It doesn’t acquire bugs just by sitting around on your hard drive. Au contraire, baby! Is software supposed to be like an old Dodge Dart, that rusts just sitting in the garage? Is software like a teddy bear that’s kind of gross if it’s not made out of all new material?

Back to that two page function. Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn’t have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.

Explanation: Would lose bug fixes.

Test: Essay on understanding - 42k words as of December 23, 2017. Feel like just throwing it away and building a new model from scratch using my latest format for unit tests and such. Experimental method! - But no, that would make me forget or have to re-generate hard-won evidence from over the last 8 months. (Roughly 100 hours of work - 155 words in commit 5116ad9 took 22 minutes @ 422 words per hour - where I was exposed to a variety of information such as causality math, biology text, category theory proofs, programming experiments on myself and others, and my Purdue course work - most of which would be hard to reproduce.) You do not want to lose that processed evidence. – Better to use it as a reference model and assimilate its evidence into a new model.

Valuable Use Cases

Scenarios: Make Something Users Want

Hypothesis: Come up with scenarios (not input-output specifications) -> make something users want.

Come up with input-output specifications -> make something that works as per your specification but which nobody may want.


Test: Joel Spolsky - talking about scenarios where a businessman would want to use his website. If he makes it correctly (and if the businessman really does want such a feature), he would have made something users want.

Scenarios. When you’re designing a product, you need to have some real live scenarios in mind for how people are going to use it. Otherwise you end up designing a product that doesn’t correspond to any real-world usage (like the Cue?Cat). Pick your product’s audiences and imagine a fictitious, totally imaginary but totally stereotypical user from each audience who uses the product in a totally typical way. Chapter 9 of my UI design book (available online for free) talks about creating fictional users and scenarios. This is where you put them. The more vivid and realistic the scenario, the better a job you will do designing a product for your real or imagined users, which is why I tend to put in lots of made-up details.

Refactoring: Decoupling the Problem

Case Study: Input for Format String Exploit

Hypothesis: Extract a function with explicit free arguments -> encapsulate its contents, i.e., make it clear that its contents are decoupled from other things.

Hypothesis: Local variables clutter things up (h/t Martin Fowler). Not exactly sure why.

Hypothesis: Without local variables, a function is easy to parse. You know that you put all the inputs together using a final function call.

With local variables, however, a function gets harder to parse because there is no clear, final flourish.

(I suspect this also makes the code harder to invert.)


Test:

format_writes = ''.join([p[0] for p in pairs] + [p[1] for p in pairs])

Need a better way to do that. (Can’t find any. operator.add(*zip(*xs)) is too obscure.)

In Haskell, it’s uncurry (++) . unzip.

*Main> let xs = [("yo", "boyz"), ("I", "am"), ("sing", "song")]
*Main> unzip xs
(["yo","I","sing"],["boyz","am","song"])
*Main> uncurry (++) . unzip $ xs
["yo","I","sing","boyz","am","song"]

Test: Need to abstract every concrete value and every free variable in the expression.

For example, in

('%s%s' % (pack('<L', address), pack('<L', address + 2)), '%%%dc%%%d$n%%%dc%%%d$n' % (0x00000 + least_significant_half(value) - 25 - 12, 344, 0x10000 + most_significant_half(value) - least_significant_half(value), 345))

we have the value 0x00000, the offset -25 - 12, the printf index 344, and 0x10000.

Test: I’m naming each part of the program the way I think about it in my mind.

Test: Type-checking errors or run-time errors tell you what dependencies you haven’t handled yet.

  File "input-for-fsa2.py", line 26, in get_format_string_setters
    setter_pairs, ([], initial_state)
NameError: global name 'setter_pairs' is not defined

So I forgot to pass in the argument setter_pairs to get_format_string_setters.

Test: Overall, there are too many variables to pass around. Do I just create a first-class object containing all of them? They are uncorrelated. – it was tough. Still don’t know the best answer.

Test: Assumptions - least significant half is (hexval % (1 << 16)); that %n is how you write to different locations; that 0x10000 + least_significant_half(value) - num_chars_so_far will write the least significant bytes of the value to the given address; etc. (Wow. It’s been a long time since I’ve had the space to reflect on my code.)

Test: How would I test my code? What properties am I testing for? I’m testing that the final format string will write to the 12 bytes starting at RIP and that it won’t overwrite the stack cookie.

Algebra of Programming

Acceptance test: Algebra of programming - maximum segment sum (from Bird, Thinking Functionally with Haskell).

Given the problem statement “get the segment with maximum sum”

when I calculate the program

then I should get

mss :: [Int] -> Int
mss = maximum . map sum . segments
-- This is O(n^3).
= maximum . map sum . concat . map inits . tails
-- map f . concat = concat . map (map f) [fusion law?]
= maximum . concat . map (map sum) . map inits . tails
-- functor law
= maximum . concat . map (map sum . inits) . tails
-- maximum . concat = maximum . map maximum [??; where the input is a non-empty list of non-empty lists]
= maximum . map maximum . map (map sum . inits) . tails
-- functor law [this is O(n^3) because of map (map sum)]
= maximum . map (maximum . map sum . inits) . tails
-- map sum . inits = scanl (+) 0 [scanl law]
= maximum . map (maximum . scanl (+) 0) . tails
-- scanl (@) e = foldr f [e] where f x xs = e:map (x@) xs [NOT SURE]
-- So, maximum . scanl (+) 0 = foldr1 (<>) . foldr f [e] where f x xs = e:map (x@) xs
-- fusion law calculation
-- maximum . scanl (+) 0 = foldr (@) 0
= maximum . map (foldr (@) 0) . tails, where x @ y = 0 `max` (x + y)
-- map (foldr f e) . tails = scanr f e
-- O(n)
= maximum . scanr (@) 0, where x @ y = 0 `max` (x + y)

Acceptance test: Algebra of programming - greedy algorithm.

Given the shortest path problem

when I try to use algebra to get the answer

then I should get Dijkstra’s.

Inverse

Hypothesis: unfoldr f' . foldr f z = id when f’ can recover x and xs from the given value and replace f with (,). Also f’ z must be the empty list.

Test: Basic - get inverse for foldr (:) [] aka the identity function.

*Main> let g [] = Nothing; g (x:xs) = Just (x,xs)
*Main> unfoldr g [1..5]
[1,2,3,4,5]

Test: Get inverse for foldr (+) 0 $ [1..5]. The answer is 15. We need to recover the original list using that value. Can’t be done.

Explanation: Can’t replace (+) with (,) in 15 because it could be (3 + 12) or (8 + 7) or whatever.

Test: Get inverse for foldr raiseAndAdd 0 $ [3, 8, 4] (which converts the decimal representation of 483 ([3, 8, 4]) into the natural number 483.)

raiseAndAdd n d = 10 * d + n

When it comes to raiseAndAdd n d = v, we need to replace raiseAndAdd with (,) using just v. Can we recover n and d from v? We know that n < 10. And 10 * d is a multiple of 10, so we can get n using v % 10. And d = v / 10.

*Main> let g 0 = Nothing; g v = Just (v `mod` 10, v `div` 10)
*Main> unfoldr g 384
[4,8,3]

Test: exponent - No idea.

Test: Sudoku

checkSudokuSolution mx = and $ liftA2 notElem mx (peers mx)
(checkSudokuSolution . solveSudoku $ inputMatrix) == True

We can infer that for every index x of the matrix, notElem (at x mx) (at x (peers mx)) = True.

Now, we need to get an equation for mx (which is the output of solveSudoku inputMatrix).

notElem x xs = foldr ((&&) . (/= x)) True xs

We can tell that x must belong to [1..9] \\ xs.

That just gives us the brute-force combinatorial solution where we generate all possible Sudoku matrices.

No, wait. We get to choose which x gets assigned a value first. We can choose cells that have the least number of possible values (hopefully, just one) and go in increasing order of choices.

The above equation doesn’t actually capture the case where each cell has multiple choices filled in. (Basically, that would be the output of your Sudoku solver when there aren’t enough initial numbers.)

The real equation should consider your choices and your peers’ choices. For example, if you could have 2 or 3 but a peer has just one value 3, then you know your value is 2.

Not sure.

Hypothesis: The property equation for f has both the input and the output -> can get an inverse for f.

Test: (checkSudokuSolution . solveSudoku $ inputMatrix) == True - you don’t have an equation relating the input and output of checkSudokuSolution. – How are you supposed to get its inverse?

When it came to the phone number problem, I had (foo . bar) xs = ys. Then, I could say xs = (foo . bar)' ys.

Here I have only True on the RHS, which may not be enough.

Test: RPN calculator

Couldn’t get a property that the final solution must satisfy.

Hypothesis: Maybe use f xs@(x ++ y ++ operator) = operator (f x) (f y) where x and y are guaranteed to be in RPN. Solve for f.

One solution is to remove the operator from the end and try different splits for x and y such that they are well-formed and satisfy the equation. How to narrow that down to one solution?

f xs = head . map (\[x, y, operator] -> apply operator (f x) (f y)) . filter (\[x, y, operator] -> isOperator operator && isRPN x && isRPN y) . allSplits $ xs

This will work. However, we want something faster. Can we split such that the x, y, and operator are guaranteed to be in the desired format?

Inverse: Reduce Duplication of Work

Corollary: If inverse-oriented programming works out, then it would reduce a lot of our work.

Right now, we seem to be duplicating work instead of reusing old code.


Test: We can see that toNat = foldr raiseAndAdd 0. Now, given that toDec is the inverse of toNat, why should we search anew for a function that exactly satisfies the equation toNat . toDec = id? – We are disregarding the valuable information we have in terms of toNat’s definition.

Inverse: Algebra - Solve for the Unknown Variables

Hypothesis: f . g . foo . bar $ x = y - you want to solve for foo and bar. But since you have only one equation, you can’t solve them uniquely. You need one more property.

Corollary: Programming becomes like algebra. Holy crap! Algebra of Programming!


Test: Paging - pageReplacement . pageFaultHandling $ addressStream = outputStream - you have two unknowns. – One equation won’t suffice.

Inverse: Implement Laws as Functions

Hypothesis: Implement laws as functions (like Bazerman) -> can understand them faster.


Test: Saw functor algebra as ((+), 0) – really easy to understand. Do one thing for each case of the ADT.

Inverse: Resources

https://courses.cs.ut.ee/2006/algebraofprog/Main/HomePage

Inverse: Interview Questions

Test: Subset sums problem.

Property:

all (\xs -> ((== target) . sum) xs && all (elem target) (nub xs)) $ subsetsWithSum target candidates = True

Fold Fusion

My Attempts

Hypothesis: Fold fusion law:

If we have the following properties:

  1. h is a strict function.
  2. h (f x y) = g x (h y), for all x and y in the appropriate ranges

Put b = h(a).

Then: h . foldr f a = foldr g b.

Don’t know why h has to be a strict function.

Hypothesis: g x (h y) == h (f x y) simply says that the outputs of (A) h . foldr f a and (B) foldr g b are the same for any suffix of the list.

Base case: When it’s just the empty list, A returns h a and B returns b, which we set as h a. So, the claim holds.

Induction step: Assume the claim holds for the suffix y, so that h (foldr f a y) == foldr g b y. When foldr moves to the first element x, A returns h (f x (foldr f a y)) and B returns g x (foldr g b y). The latter is g x (h (foldr f a y)) by the induction hypothesis. By the second assumption, it becomes h (f x (foldr f a y)). Hence, proved.

So, the two assumptions are enough to prove the claim for any list.

Question: How to derive g?

Hypothesis: You can simply set g x y = h (f x y).

Fusion Law: Bird

Hypothesis: Fusion law: f . foldr g a = foldr h b where f is a strict function, b = f a, and f (g x y) = h x (f y) for all x and y.

Has to be strict because we need f undefined = undefined.

b = f a is for the empty list case.

When the list is (x:xs),

LHS = f (g x (foldr g a xs))
RHS = h x (foldr h b xs)
Assume the induction hypothesis holds for xs
So, f (foldr g a xs) = foldr h b xs.
Therefore, if h x (f y) = f (g x y) for all x and y, the law will hold.

Corollary: foldr f a . map g = foldr (f . g) a.


Test:

foldr f a . map g
= foldr f a . foldr ((:) . g) []
-- We want the output of the form `foldr h b`.
-- (i) b = foldr f a [] = a
-- (ii) foldr f a is a strict function (?).
-- (iii) h x (f y) = f (g x y) for all x and y.

-- We have handled the cases where the input is undefined or an empty list.
-- Now for (x:xs).
-- Condition (iii) means f (g x xs) = h x (f xs)
-- foldr f a (((:) . g) x xs) = h x (foldr f a xs) for all x and xs.
-- foldr f a (g x : xs) = h x (foldr f a xs) for all x and xs.
-- (f (g x) (foldr f a xs)) = h x (foldr f a xs) for all x and xs.

-- h x y = f (g x) y [Note: This f is different from the f in the equation.]
-- h = f . g

Test: double . sum = foldr ((+) . double) 0

double . sum
-- somehow (I'm guessing by a natural transformation, even though it's Num a, not a)
= sum . map double
= sum . foldr ((:) . double) []
-- fusion law for foldr . map
-- (i) sum is strict
-- (ii) z = sum []
-- (iii) sum (((:) . double) x xs) == g x (sum xs)
-- WRONG => g x xs = double x : xs
-- sum (double x : xs) == g x (sum xs)
-- foldr (+) 0 (double x : xs) == g x (sum xs)
-- double x + foldr (+) 0 xs == g x (sum xs)
-- double x + sum xs == g x (sum xs)
-- double x + ys == g x ys
= foldr ((+) . double) 0

Test: length . concat = foldr ((+) . length) 0

concat :: [[a]] -> [a] length :: [a] -> Int

length . concat
-- HOW?!
= sum . map length
= foldr (+) 0 . map length
= foldr ((+) . length) 0

Test: length . cp

length . cp
= length . foldr (liftA2 (:)) [[]]
-- fusion law
-- (i) length is strict
-- (ii) z = length [[]] = 1
-- (iii) length (liftA2 (:) x xs) = g x (length xs)
-- length x * length xs = g x (length xs)
-- length x * ys = g x ys
-- g = (*) . length
= foldr ((*) . length) 1
= foldr (*) 1 . map length
= product . map length

Test: maximum . map sum . subseqs

Naively - subseqs will take O(2^n) time. map sum will take O(2^n * n/2) time.

maximum . map sum . subseqs
= maximum . map sum . foldr f [[]] where f x xss = xss ++ map (x:) xss
-- fusion law
-- (i) map sum is strict
-- (ii) z = map sum [[]] = [0]
-- (iii) map sum (f x xss) == g x (map sum xss)
-- = map sum (xss ++ map (x:) xss) == g x (map sum xss)
-- Naturality of map!
-- = map sum xss ++ map sum (map (x:) xss) = g x (map sum xss)
-- = map sum xss ++ map (\yss -> sum (x:yss)) xss = g x (map sum xss)
-- = map sum xss ++ map (\yss -> sum ([x] ++ yss)) xss = g x (map sum xss)
-- = map sum xss ++ map (\yss -> sum [x] + sum yss) xss = g x (map sum xss)
-- = map sum xss ++ map ((sum [x] +) . sum) xss = g x (map sum xss)
-- = map sum xss ++ map (sum [x] +) (map sum xss) = g x (map sum xss)
-- = zss ++ map (sum [x] +) zss = g x zss
-- g x xs = xs ++ map (sum [x] +) xs
-- g x xs = xs ++ map (x +) xs
= foldr (\x xs -> xs ++ map (x +) xs) [0]

This will take: O(1) time for the base case. T(n+1) = 2 * T(n) => T(n) = O(2^n). Not fully sure.

Bird to the fucking rescue:

= maximum . foldr (\x xs -> xs ++ map (x +) xs) [0]
-- fusion law
-- (i) maximum is strict
-- (ii) z = maximum [0] = 0
-- (iii) maximum (f x xs) == g x (maximum xs)
-- maximum (xs ++ map (x+) xs) == g x (maximum xs)
-- maximum xs `max` maximum (map (x+) xs) == g x (maximum xs)
-- maximum xs `max` (x + maximum xs) == g x (maximum xs)
-- ys `max` (x + ys) == g x ys
-- g x y = y `max` (x + y)
= foldr g 0

An O(fucking N) algorithm!

(And, as he notes, you can simply drop negative numbers :P - sum . filter (> 0)).

Universal Property of Fold

Hypothesis: foldr f e (xs ++ ys) = foldr f e xs @ foldr f e ys

It holds when f = (@) with identity e.

Corollary: When you have map f (xs ++ ys), you can simply distribute it: map f xs ++ map f ys.


Test: map sum (xss ++ map (x:) xss) = map sum xss ++ map sum (map (x:) xss) – yeah.

Natural Transformation

Hypothesis: When the types are polymorphic, you can do stuff to the type before or after without affecting the result.


Types:

head :: [a] -> a
tail :: [a] -> [a]
concat :: [[a]] -> [a]
reverse :: [a] -> [a]

Test: f . head = head . map f – yeah.

Test: map f . tail = tail . map f – yeah.

Test: map f . concat = concat . map (map f) – yeah.

Test: map f . reverse = reverse . map f – yeah.

Test: concat . map concat = concat . concat – yeah.

Test: filter p . map f = map f . filter (p . f) - don’t know how you came to that.

Test: map (map f) . cp = cp . map (map f) – ok. Understandable.

where cp :: [[a]] -> [[a]].

Test: filter (all p) . cp = cp . map (filter p) – did it by intuition.

Bird’s Approach in Functional Pearls

I was interested in the specific task of taking a clear but inefficient functional program, a program that acted as a specification of the problem in hand, and using equational reasoning to calculate a more efficient one.

– Bird, Preface, Pearls of Functional Algorithm Design

I’d like to know how to derive a program using just the contract properties and not necessarily a working inefficient model.

For example, given (checkSudoku . solveSudoku) grid == True and the definition of checkSudoku, write solveSudoku.

And I know this can be done because we did it for the toDecimal problem using the definition of toNaturalNumber and the property toNaturalNumber . toDecimal = id.

Created: October 20, 2017
Last modified: December 24, 2021
Status: in-progress notes
Tags: notes, one-button

comments powered by Disqus