Merge Pipelines using Monoids
Note: These are my rough notes from Gabriel Gonzalez’s equational reasoning talk on How to Prove Large Software Projects Correct.
A Simple Task Pipeline
Say you have a task pipeline, where each task (aka job) goes through five stages:
generate -> package -> upload -> deploy -> log
Let’s say generate
generates a bunch of files, package
puts them together into one package, upload
uploads that package to a server, deploy
runs that package on the server, and log
writes some log information.
Now, you want to get some output from each stage, like “Packaging job #4” from job #4 when it goes through the packaging stage.
So, it would make sense to have a function package
of the type Int -> IO ()
. However, the function for each stage might depend on the outputs from previous stages, not just the overall input.
So then, we could have something like generate :: Int -> IO a
and package :: a -> IO b
and so on.
import Data.Monoid ((<>))
import Control.Monad
instance Monoid b => Monoid (IO b) where
mempty = return mempty
mappend = liftM2 mappend
generate1 :: Int -> IO Int
generate1 n = do
putStrLn ("Generating job #" <> show n)
return n
package1 :: Int -> IO Int
package1 n = do
putStrLn ("Packaging job #" <> show n)
return n
upload1 :: Int -> IO Int
upload1 n = do
putStrLn ("Uploading job #" <> show n)
return n
deploy1 :: Int -> IO Int
deploy1 n = do
putStrLn ("Deploying job #" <> show n)
return n
log1 :: Int -> IO Int
log1 n = do
putStrLn ("Logging job #" <> show n)
return n
job1 :: Int -> IO Int
job1 = generate1 >=> package1 >=> upload1 >=> deploy1 >=> log1
main = job1 1
To run the whole job (as in job1
), you thread each stage’s output into the next stage’s input using the fish operator >=>
.
Here’s the output:
Generating job #1
Packaging job #1
Uploading job #1
Deploying job #1
Logging job #1
How to Run Two Jobs together?
Now, if you had another batch job, you’d write its version of the pipeline, with different functions for generate, package, etc.:
job2 :: Int -> IO Int
job2 = generate2 >=> package2 >=> upload2 >=> deploy2 >=> log2
How to run them together? Simple, just do them one after the other in main:
main = do job1 1; job2 2
You will get the output:
Generating job #1
Packaging job #1
Uploading job #1
Deploying job #1
Logging job #1
Generating job #2
Packaging job #2
Uploading job #2
Deploying job #2
Logging job #2
They are done separately and the messages they print are collected separately. But you want the similar stages of the pipeline to be done together. You want all the generate
stages of the jobs to be done together, and then the package
stages, and so on. How to do that?
One way is to have a messy Job data structure which stores the functions representing each stage of this job’s pipeline (generate2
, etc.) along with this job’s input data.
So, you will get n Job objects for all the different jobs. You will now have to map over the objects and run all of their different generate
stages and store the outputs. Then, you have to map over them and run their different package
stages with the stored result output as input and store this output, and so on through all the stages. Messy.
You have to do quite a bit of book-keeping to thread each job’s information to its next stage. This is nuts. You already specified the pipeline perfectly well once. All you want to do extra is group different pipelines so that they move through the stages together. But you have to repeat the information all over again.
Merging Two Jobs
What if your pipeline knew how to add itself to other pipelines?
Enter the Monoid.
Use the same functions for each job’s pipeline (generate2
, etc.) but put them together in a different way.
job2 :: Int -> IO (IO (IO (IO (IO Int))))
job2 a = do
b <- generate2 a
return (do c <- package2 b
return (do d <- upload2 c
return (do e <- deploy2 d
return (log2 e))))
Now all you need is a deployment function that takes that nested IO subroutine and turns it into a flat IO subroutine.
deploy' :: IO (IO (IO (IO (IO ())))) -> IO ()
deploy' = join . join . join . join
Finally, add all the job pipelines together to create a unified job pipeline and deploy it.
job = job1 1 <> job2 2 <> job3 3
main = deploy' job
Now all the jobs run their stages together:
Generating job #1
Generating job #2
Generating job #3
Packaging job #1
Packaging job #2
Packaging job #3
Uploading job #1
Uploading job #2
Uploading job #3
Deploying job #1
Deploying job #2
Deploying job #3
Logging job #1
Logging job #2
Logging job #3
Monoid Proofs Scale
Why was that? How come we didn’t have to struggle to merge all their respective stages?
Well, that funky nested type IO (IO (IO (IO (IO ()))))
is a monoid. Why? Because its inner funky nested type IO (IO (IO (IO ())))
is a monoid, and so on all the way down to the unit type ()
.
And we had made the IO
monad a monoid by getting the results of the two subroutines and then calling mappend
on those two.
So, the way execution will proceed is by computing the first stage of the pipeline (generate1
, etc.) for all the jobs. That will give us the inner subroutines. Then, by way of doing mappend
on all those subroutines, we will compute the next stage of the pipeline (package1
, etc.), and so on till we reach the end of the pipeline for all the jobs.
Thus, by making our job pipeline a Monoid, we were able to merge different job pipelines so that they went through each stage together.
As Gabriel says, you get a staged deploy system in two lines of code (just the deploy
function and the job
concatenation) without doing any other work. Plus, you can prove the correctness of interleaving the jobs by looking at their type IO(IO(IO(IO(IO()))))
, which is a monoid.
The key insight here is that IO a
is a monoid such that mappend
on two IO a
values will perform their two computations together (respecting the order in which they were mappend
-ed). And it will then handle their nested values together. So, naturally, mappend
on two IO (IO a)
values will do the outer computations first in order, and then the inner computations.
All this is possible because of the automatic typeclass instances such as:
instance Monoid a => Monoid (Maybe a)
instance Monoid b => Monoid (a -> b)
instance (Monoid a, Monoid b) => Monoid (a, b)
This is what lets you scale composition.
(Note that you don’t have this neat property for monads. Monad m
does not imply that Monad (m' m)
. This is why we need the whole idea of monad transformers.)
Monoids let you Compose Complex Types
Basically, the main advantage of monoids is when you have complex data structures (or even functions and mixed up stuff). Then you don’t have to worry about how they go together or what operators they use.
For a function monoid, it’s simple - you can just get the outputs and append them yourself. Big deal. For a unit test monoid, instead of mappend
ing fifteen tests, you can just put them in a list and do a fold using the correct function.
But what if you have a tree of tests (five tests that go together here, seven there, etc.)? How will you fold them? You have to define the tree and make it Foldable
then use foldMap and such to collapse it into one test. With a monoid, though, your job is simplified. And the test runner only has to know how to run one test, that’s all. It doesn’t have to know anything more about the tree structure or anything.
Basically, instead of dealing with complex structures by hand (yes, you can do it, but why?!), you just deal with one simple value at each point - one monoid. If you get one more, you mappend
that with this sucker and end up with… still one monoid. The complexity of your code thus never gets out of hand.
Sometimes it can be hard to get at the “inside” of a monad to extract the value and then do stuff with it. For example, you have three Maybe (String, String)
values and want to concat the strings (if any). Normally, you’d have to use catMaybes :: [Maybe a] -> [a]
to get the existent pairs and then maybe split the pairs and then concatenate the results.
foo = [Just ("foo", "boyz"), Nothing, Just ("ounht", "boeut")]
bar = catMaybes foo
result = (concat (map fst bar), concat (map snd bar)
-- ("fooounht","boyzboeut")
But if you realize that all you want to do with the inner values is mappend
them (like any monoid), then you can see that Maybe m is a monoid when m is a monoid. Now your answer simply becomes:
mconcat foo
-- Just ("fooounht","boyzboeut")
The same goes for functions that have a common input and a monoid output. No need to extract their values and do complicated stuff, because you know that all you want to do is mappend
their results. So, just treat the function like a monoid.
greeting s = "Hello, Mr. " ++ s ++ ", welcome to our store.\n"
offer s = "Here are some great offers we have for you, Mr. " ++ s ++ ".\n"
sendoff s = "It was a pleasure meeting you, Mr. " ++ s
putStr $ greeting <> offer <> sendoff $ "Yoboyz"
-- Hello, Mr. Yoboyz, welcome to our store.
-- Here are some great offers we have for you.
-- It was a pleasure meeting you, Mr. Yoboyz
Notes
Aside: instance Monoid b => Monoid (IO b)
doesn’t seem to be part of Haskell by default and I don’t know if it’s good style to add an instance here (I believe it’s called an orphan instance). But I think it is fine for the current purpose.
comments powered by Disqus