Program Design

Central Question: Why is Java code usually bloated?

I’m assuming that code in Java is usually longer than in other languages like Haskell or Python. Why is that?

Law of Demeter

I think most Java programs violate the Law of Demeter: use only one dot in a method call. For example, just printing “Hello World” requires that you use two dots:

System.out.println("Hello World");

This seems to be the case with most programs.

Hypothesis: The problem here is that you’re programming to an implementation, not an interface. You’re depending on the methods of a class instead of an inference. Once you have an interface, you’re forced to generalize your code. Otherwise, you use class-specific methods that you cannot reuse.

Too Specific

The key problem is that you end up writing functions and classes that are not general enough. This means you can’t reuse them elsewhere.

Here’s some n00b Java code I wrote four years ago:

/**
 * Read a list of transactionOperations from inputReader.
 */
public static List<TransactionOperation> getTransactionOperationListFromReader(
    BufferedReader inputReader) throws IOException{
    List<TransactionOperation> transactionOperationList =
            new ArrayList<TransactionOperation> ();
    String currLine;
    while ((currLine = inputReader.readLine ()) != null){
        transactionOperationList.add(new TransactionOperation (currLine));
    }
    return transactionOperationList;
}

What’s wrong with it?

Everything. Let’s try to write the type signature for the function. I want to get a list of lines from BufferedReader and then turn that into a list of TransactionOperation. Basically, I need these fictional (and soon unnecessary) functions:

linesFromBufferedReader :: BufferedReader -> [String]
makeTransactionOperation :: [String] -> [TransactionOperation]

Here’s the key point: the two functions have nothing to do with each other, given their interface. Once makeTransactionOperation is given a list of strings, it doesn’t care how they were obtained, whether from a BufferedReader that reads from IO, a file, or from a previous part of the program. All it cares about is the list of strings (technically without newlines).

Similarly, the pseudo-function linesFromBufferedReader doesn’t care about where the list of strings will be used. All it knows is how to get a bunch of lines from a BufferedReader.

Now that we’ve separated these concerns, suddenly, our original function becomes much simpler.

// This will probably be a method in BufferedReader itself.
public static List<String> lines(BufferedReader inputReader) throws IOException{
    String currLine;
    List<String> list = new ArrayList<String>();
    while((currLine = inputReader.readLine()) != null){
        list.add(currLine);
    }
    return list;
}

// My old code was in Java 6, so I'm not using the far simpler List::map.
public static List<TransactionOperation> getTransactionOperationList(List<String> xs){
    List<TransactionOperation> operations = new ArrayList<TransactionOperation>();
    for (String s : xs){
        operations.add(new TransactionOperation(s));
    }
    return operations;
}

In Haskell, this would just be:

map TransactionOperation . lines

General Functions are Simpler, yet more Reusable

Most importantly, the code is now far easier to understand. Earlier, you would be put off by the fact that the function dealt in the specific, scary-sounding TransactionOperation. But now, it’s broken into comprehensible pieces. You want to get lines from some input and then you want to turn those lines into this data structure called TransactionOperation.

And the cool part is that you need not know anything about TransactionOperation. All that matters is that it is a function that takes a String and returns a TransactionOperation. And you know how to provide a String, so you’re all set.

The reason the original function was hard to understand and modify was because it was unnecessarily specific. I wrote one big function that took a BufferedReader and returned a List<TransactionOperation>, instead of two small functions that I could compose.

The more specific the function, the more complex it will be. There is freedom in generality. The less you know about your inputs, the less you can do, and therefore the less you can do wrong. Take sort. In Haskell, its type signature is

sort :: Ord a => [a] -> [a]

All you can do with the contents of the input list is compare two elements to get LT, EQ, or GT. That is it. Compare that to an integer sort function:

sortInts :: [Int] -> [Int]

The more general a function, the simpler it is. This is because the space of all possible functions of that type is smaller and you only need a little information to identify the function you want. The number of possible functions of type a -> b is B^A, where A and B are the number of values of type a and b respectively. General functions are those that have small possible input and output types, like sort above.

The counter-intuitive thing here is that sort is simpler than sortInts despite being far more reusable. So, general functions are more usable and reusable. Of course, this is only in one specific sense of “usable”. If you wanted some integer-specific calculation like prime factorization, then a general function may not be of much use.

Why are simpler functions more reusable? Because simpler functions come from a smaller function space. The type signature DayOfWeek -> Bool allows only 2^7 = 128 possible functions. The type signature Int -> Bool, however, allows 2^65536 possible functions. Of course, we humans won’t (and can’t) write functions large enough to cover that full space (functions longer than 100 lines are frowned upon). But you still have more possible functions with the latter type than the former. So, if in the future you need a function of type DayOfWeek -> Bool, there’s a high chance it will be the one you wrote before (since there are only 128 possible functions). However, if you want a function of type Int -> Bool, good luck finding an existing one that fits your need. In short, simpler functions are more reusable because a smaller function space means you have a greater chance of hitting the same function again.

As General As Possible

When you don’t really care about the specific details of the input, it’s a mistake to accept them. This is what they call the principle of least knowledge. The more you know, the larger your function space becomes, and the more information you have to provide to pick out one function in that large space. This means more work to design the function and more work to understand it too.

Lesson: Make your function as general as possible. Don’t accept input information that you don’t need.

This is just another way of saying “make your function as simple as possible”.

The same idea sustains the maxim “design to an interface, not an implementation”. When you design to an implementation, you get too much information, and when that information changes, you will have to change too. More to the point, your function will get more complicated, less reusable, and less readable.

This is what Refactoring is all about

I think I’ve finally hit on what it means to write better essays. This is what people mean by “rewriting” an essay or “refactoring” code - make it more simple, general, and reusable.

Other Notes from Steve Yegge’s Scheming is Believing

(Notes on his essay: “Scheming is Believing”)

The hard part is knowing when to stop. If you’re struggling over whether to call a method getCustomersWithRedHairAndBlueNoses or getCustomersByHairAndNoseType(HairType.RED, NoseColor.BLUE), then you’ve officially entered barkingUpWrongTreeTerritory. Over-zealous typing is a dead giveaway that someone junior is doing the design.

What is the type signature of those two functions?

getCustomersWithRedHairAndBlueNoses :: [Customer] -> [Customer]

getCustomersByHairAndNoseType :: HairType -> NoseColor -> [Customer] -> [Customer]

Well, you only need HairType and NoseColor to tell whether a Customer is to be picked or not. The list (or database or whatever) is irrelevant. So, I would split it into two decoupled functions:

selectByHairNose :: HairType -> NoseColor -> Customer -> Bool
filterCustomer :: (Customer -> Bool) -> [Customer] -> [Customer]

In fact, filterCustomer can be generalized even further. Note that it doesn’t care whether the input type is a Customer or something else.

filter :: (a -> Bool) -> [a] -> [a]

Now we can reuse a well-understood existing function and just focus on the selecting function. So, the problem wasn’t the over-zealous typing, it was the failure to generalize.

When to Create a New Type?

At the other exteme, not putting in enough static type information is certain to create problems for you, as you muddle through your nests of untyped lists and hashes, trying to discern the overall structure.

It all boils down to one question: when should you create a new type?

You should create a new type when an existing type is too big and will lead to unnecessary checks. For example, DayOfWeek is preferable to Int if you use days a lot, because you would otherwise need to keep checking if the Int is a valid day.

On the other hand, you should create a new type if you have a bunch of values that you will use together, like name, billing address, and credit card details for a customer.

Aim: Minimum Information at all Times

Our aim is to have minimum information along every step of our program. Otherwise, we will be making our code unnecessarily complex.

That is what pure functional languages do! They give you nothing more than you what you need.

Say you have three steps in your program: (1) A to B, (2) B to C, and (3) C to D. If the third step requires only C and doesn’t depend on A or B, then it’s a mistake to give it information about A and B.

For example, in Martin Fowler’s movie renting example, the steps are calculating movie charges, calculating frequent rental points, and printing those two as ASCII or HTML. The last step doesn’t need to know the details of the database, just the dollar charges and total rental points.

How to get the Causal Structure?

One clean way to ensure that some step doesn’t get unnecessary information is to enclose it in its own function. That way it will only have access to the values passed to it as arguments. Note that having global state hinders this by giving free access to all (global) variables.

TODO: What is the tradeoff between keep a function inline vs extracting it into its own function? Maybe you pollute the global function namespace.

Maybe one way to build a function is to keep all steps in their own functions, and only send extra information when absolutely needed.

TODO: How do you know the steps of a function in the first place? Why isn’t it a blank mystery?

Hypothesis: Maybe you know the direct causes of any given variable.

For example, you know that you get the total movie rental charges by adding the rental charges for each movie. Similarly, you know that you get HTML output by using the dollar amount of the rental charges and the number of frequent rental points.

This way, maybe you can divide the function into smaller, manageable steps, even if those steps aren’t fully clear to you right away. Then, you can use test cases to flesh out each of those steps, perhaps dividing them into further steps too.

In other words, you somehow know the causal structure of the domain, but not necessarily the causal parameters (i.e., how to calculate frequent rental points for a movie).

How did you get to know the causal structure of the domain? I would wager that it comes about due to locality of causality. Things that are far away in time or space are not direct causes. But how do you know that for this abstract rental calculation?

Maybe you’ve seen similar causes elsewhere. For example, I’ve seen lots of places where you get an HTML output from some string inputs, like Django’s templates.

Familiar Causes

In short, the question is: how do you know that strings can be direct causes of HTML output, but not database values?

Hmmm… I’ve seen counter-examples! Assume that database values were necessary for getting HTML output. Then, you would never be able to get HTML output without database values. But I’ve seen functions where you got HTML output from just a string. So, that falsifies the hypothesis that you need database values for HTML.

(Note: I’m assuming that we have simple causal models, where A, B, C cause X means that X happens when all of them have a particular value, and doesn’t happen anytime else.)

In fact, more generally, I’ve seen that you can get every possible type of HTML output given some string input. So, you don’t need database input at all, once you have a string.

Hypothesis: We build our causal structure by looking at all possible known causes of an output variable.

We know we can get HTML output from a string. Similarly, we know that we can get a dollar amount only from other dollar amounts (unless you conjure dollar amounts from thin air in a function). So, there’s no point in trying to get dollar amounts from someone’s name or address. You need some field that has dollar amounts, like their account balance or outstanding rental charges for each movie.

So, maybe we know the direct causes for each part of the output. We know that you can’t magically get a string from an integer (apart from simply converting it to a string), especially if you don’t have any string constants inside the function. So, you would need to call function(s) that convert integer to string somehow.

Similarly, for a list, we know that we humans use only a few limited ways of making lists. Either we use a map to transform one list to another, a fold (reduce), or do some funky recursion. That limits our uncertainty about the function. We know people probably won’t write “weird” functions like f 1 = [2, 3]; f 3 = [1, 1, 1]; ... that don’t follow one of the above patterns.

In fact, map and fold boil down to recursion anyway. So, we essentially know that we only get lists using some recursion. Meaning that f (x:xs) depends on f ss, where ss is any subset of (x:xs) and, crucially, each smaller call to f works the same way.

To go out on a limb, there are only 2^k - 1 smaller subsets for a list of k elements. So, f list-of-k depends on g (subset of list-of-k). We need a way to get the particular subset for each list of k, which will take roughly k bits of information.

(Not fully clear about this.)


Circling back, maybe the way to discover the causal structure is to take each part of the output, and ask for the minimum information you would need to produce it. So, for the rental dollar figure, ask if you could do without the name and address of the customer. Then, if you find that you can’t do it without some variable (using your current knowledge), add it to the causal structure.


Prediction: If known causes lead to a clear causal structure, what causes a confusing and unclear causal structure? Is it that you have no clue by which someone could have got the output from the input?

If so, then increasing skill should be caused by increasing knowledge of real-world causal mechanisms. As you know more causes, more ways of making things happen, you should be less confused by things that happen.

Corollaries and Questions

If general functions are also easier to code, then writing general, reusable code should be easier than writing specific, one-off code. This feels counter-intuitive. How can “better” code be easier to write?

Maybe it’s not really better, or maybe it’s not really easier to write. Perhaps writing general code takes more work because you have to abstract every single part.

example - no work needed to extend sort to work for list of strings or list of lists. They are automatically sortable as long as they satisfy Ord.

Here’s an example of abstraction getting in the way. What if calculating movie charges now depended on the frequent rental points? So, you would need to change the structure. Therefore, if you assumed that you wouldn’t need a value at some point but found out later that you did, then you would need to do a lot of work. This might be a type of premature optimization.


Corollary of minimum information: Everything in a function is there because it needs to know all the details of the function. If it needed less information, it would be in a different function.

example: in calculating rental charges, print HTML doesn’t depend on the database once given the rental amount. So, it is in its own function.

How to use this?

I think this makes functions very small, because very few things depend on all the information in a function, due to locality of causality.

When Polymorphism Fails by Steve Yegge

How would I handle that problem in Haskell?

Take a simple case. You have two types of monsters: Orcs and Elves. You initially want one type of action from them: printRaceName (“Orc” and “Elf”, respectively). Later, you want to add one more action: doesOpinionatedElfLikeYou (“no” and “yes”, respectively).

Same Interface, Different Classes

You want two properties: the ability to have a collection of many objects satisfying the same interface (like a list of monsters - a mixture of Orcs and Elves), and the ability to add one more action to their common interface (like doesOpinionatedElfLikeYou).

In OOP (specifically Java), subclassing lets you do the first. You can just have a list of type Monster and it will be able to hold both Orcs and Elves. However, you lose information about their specific race and you cannot safely cast them back. Similarly, subclassing allows you to add a new method to the superclass so that all the subclasses have to implement it.

In Haskell, you can collect objects satisfying the same interface only by making them part of the same algebraic data type. Otherwise, you would need dynamic dispatch to decide which function to use, which Haskell doesn’t allow. Adding a new action, however, is easy enough. You just make that data type an instance of the new type class (or you extend the old type class).

Add a New Action without Modifying Classes

In Haskell, you can create a new type class and instantiate it for existing data types. In Java, you can’t. You have to go and modify each of the subclasses.

Add a New Class without Modifying Actions

In Java, you can add a new subclass and implement all of the superclass methods without touching any of the other classes. However, in Haskell, if you add a new type to the original data type, you will have to change the pattern-matching code in every single function that uses that type.

Tradeoffs: New Actions vs New Data Types

I guess Haskell takes the approach that your data types are fixed, but your actions keep changing. You may have the same game tree type, but you can evaluate it, pretty print it, transform it, and so on.

Java takes the approach that your actions are fixed but your data types keep changing. You can add new types of Monsters later in the game without touching the existing Monsters.

Stand-Alone Table

Why not have a table of Monster vs Action and simply add entries to it as necessary? If you want to add a new Monster (Troll), then add it and implement the existing Actions. If you want to add a new Action (attack), then add it and implement it for the existing Monsters.

Though it looks tempting, the problem here is that you’re not guaranteed about the behaviour of the collection anymore. Earlier, you could say for sure that all the Monsters would implement all the same actions. What if someone just implemented only a few of the actions for Troll? Maybe we can have type checks for that.

Code Smell: Concrete Types instead of Abstract Types

If most of your functions have concrete types, you’re probably doing something wrong. Surely your code isn’t so specific that you need all the details of the data type for every function?

When all of your parameters are concrete, your code can’t handle new data types.

Top-down vs Bottom-up Programming

I think the difference between the two is that, in the latter, your functions end up being more general. Instead of just trying to implement each branch of the top-down tree in isolation, you find common patterns and then write abstract functions to handle them.

Type Abstraction vs Unbound Variables

For type abstraction, all you need is the Hindley-Milner algorithm. It will tell you that show x can be just x :: Show a => a, instead of something concrete like x :: Int.

But for function abstraction, you need to know unbound variables. H-M will work for (sum (+ 4 5)) :: Int -> a but not for (sum (+ 1 x)). That value still has the type Int -> a, but x is unbound, so it’s actually Int -> Int -> a.

Hypothesis: Non-functional programmers only need to check for unbound variables when refactoring. Functional programmers have to check for type abstraction also, because they’ve got better types.

How far can you get with function extraction alone? Well, you can make provably-correct extractions.

Why not just use lambda calculus, then? Hmmm… our aim is to minimize our uncertainty about the code, not to have one-parameter functions.

What exactly is your aim? Here’s a declarative version: (a) for each term t, variables in scope = variables unbound in t; (b) each term should have the most general type possible; show x ++ "yo" should be Show a => a -> String, not Int -> String.

  1. reduces uncertainty and makes your code more reusable. (b) reduces uncertainty (?) and makes code more reusable too. The reason (a) works is because it turns P(a, b)P(c|a,b) into P(a, b)P(c|a), about which we have lower uncertainty.

Why does (b) reduce uncertainty?

What about ease of change? I don’t yet have a mathematical way to measure “ease of change”. Is it just the “size” of interfaces? Hmmm… it’s the number of other modules you have to change when you change one module.

Uncertainty about a Function

Hypothesis: A function’s type doesn’t fully capture your uncertainty about it.

What about the functions that it depends on?

For example, in sum = foldr (+) 0, sum depends on foldr, (+), and 0, but the external type is just sum :: Num a => [a] -> a.

We care about our uncertainty about the function, its reusability, and its modifiability.

Now, obviously, a function f is more likely to change if it depends on another function g than if it doesn’t. In the former case, when g changes, then we have to change f, which increases the cost of changing g. Whereas in the latter case, when g changes, we don’t have to change f at all.

Similarly, it’s harder to reuse a function when it depends on other functions, because we have to make sure those are available too.

How uncertain are we about the function, though?

Well, when do you care about the function’s details? When you want to use it directly, like in (a) sum . map getScore, but not in (b) when you just take it as a parameter. In the latter case, all you need to know is its type.

For example, foldr :: (a -> b -> b) -> b -> [a] -> b. It doesn’t care about the specifics of the folding function. All it wants is something that it can use to fold the list.

So, your uncertainty about the function as a value depends only its most general type (like a -> b -> b above).

To rephrase, your uncertainty about the behaviour of a function g that takes a function f as a parameter is independent of the implementation of the function f given its type.

So, the “behaviour” of a value is independent of its definition given its type. Look at foldr which needs to know only a -> b -> b about its parameter function.


However, your uncertainty about the function as something to use in a definition depends on all the other functions it includes.

This could be why I’ve had so much trouble translating textbook passages. They may depend on very little by way of parameters (example?) but they depend on a lot by way of requisites.

I’m not sure about this. But there’s more to our uncertainty about a function than just its type.

Request For Experiments

Run experiments on small programs and passages to test my above ideas. See if you can maximally refactor something by trying to minimize interfaces (via the H-M algorithm and function abstraction). Also, check if learning resources really are like unrefactored code.

Resources

Interfaces and multiple inheritance - http://blog.cleancoder.com/uncle-bob/2015/01/08/InterfaceConsideredHarmful.html

Books on Design

Some books that sound interesting:

Created: July 13, 2016
Last modified: September 5, 2022
Status: in-progress notes
Tags: notes, design

comments powered by Disqus