Monday, October 7, 2013

5.4.3: Explicit Stacks

So far the strategy has been to identify what the state is that the call stack is managing for us and build a loop around that state to eliminate the function calls. But sometimes the state is complex enough that we need to bring in the heavy machinery.


func fib1(n int) int {
    if n < 2 {
        return n
    }
    return fib1(n-2) + fib1(n-1)
}

The Fibonacci function is a tricky thing when it comes to eliminating the tail call. First we have to account for the final computation in the function, the adding of the two values. But we also have to account for the fact that the function calls it self recursively twice every time, with two different values. So this is not a linear recursive calculation but an execution tree. There is some branching logic we have to account for if we wish to eliminate the recursion.

A sample call of fib(6) looks like this:


fib(6)
    fib(4)
        fib(2)
            fib(0)
            fib(1)
        fib(3)
            fib(1)
            fib(2)
                fib(1)
                fib(0)
    fib(5)
        fib(3)
            fib(1)
            fib(2)
                fib(0)
                fib(1)
        fib(1)

Flattening this to something that can be run in a for loop will require us to do branch management along with stack management.


func fib2(n int) int {
    for {
        if n < 2 {
            return n
        }
        s1 := fib2(n - 2)
        s2 := fib1(n - 1)
        return s1 + s2
    }
}

First we make some cosmic changes to frame the problem a bit. The for loop is what will replace the recursion. And breaking up the calculation seems to point to a state consisting of three variables, n, s1 and s2. We will use a stack to track these states along with the branch control.


type state3 struct {
    BRANCH, s1, s2, n int
}

func fib3(n int) int {
    var s1, s2, retval int
    BRANCH := 0
    var STACK list.List
    for {
        if n < 2 {
            retval = n
        } else {
            if BRANCH == 0 {
                STACK.PushBack(&state3{BRANCH, s1, s2, n})
                n -= 2
                BRANCH = 0
                continue
            } else if BRANCH == 1 {
                s1 = retval
                STACK.PushBack(&state3{BRANCH, s1, s2, n})
                n -= 1
                BRANCH = 0
                continue
            } else if BRANCH == 2 {
                s2 = retval
                retval = s1 + s2
            }
        }
        if STACK.Len() == 0 {
            return retval
        }
        s := STACK.Back().Value.(*state3)
        STACK.Remove(STACK.Back())
        BRANCH, s1, s2, n = s.BRANCH, s.s1, s.s2, s.n
        BRANCH++
    }
}

The key part here is the value of BRANCH. 0 corresponds to fib(n-2), 1 to fib(n-1) and 2 to s1+s2. If we compare to the fib(6) tree illustrated above, the first three iterations BRANCH will be 0 and n will go 6, 4, 2 as the loop continues from the BRANCH == 0 block.

On the next iteration n will be 0 and we drop to the block below the branching logic. There the stack will be popped bringing back the values for n == 2 (restoring n to 2) and the BRANCH will be increased to 1.

So on the next iteration we hit the BRANCH == 1 block, n becomes 1. As we continue we hit n < 2 again and pop the state from the BRANCH == 1 block bringing BRANCH to 2. As we loop back we hit the BRANCH == 2 bock and perform the addition. The addition is stored in retval. Now we pop n == 4 from the stack. This again brings us to the BRANCH == 1 block and brings n down to 3.

This sequence corresponds to fib(6) -> fib(4) -> fib(2) -> fib(0) -> fib(1) -> fib(3). The tree branching logic has been flattened in such a way that the left child is always travelled first and then we go down the right branch once we pop back. This is called pre-order tree traversal. The rest of the tree will we traversed in the order of fib(1) -> fib(2) -> fib(1) -> fib(0) -> fib(5) -> fib(3) -> fib(1) -> fib(2) -> fib(0) -> fib(1) -> fib(1), completing the same calculation as the recursive version.


type state4 struct {
    BRANCH, s1, n int
}
func fib4(n int) int {
    var s1, retval int
    BRANCH := 0
    var STACK list.List
    for {
        if n < 2 {
            retval = n
        } else {
            if BRANCH == 0 {
                for n >= 2 {
                    STACK.PushBack(&state4{BRANCH, 0, n})
                    n -= 2
                }
                retval = n
            } else if BRANCH == 1 {
                s1 = retval
                STACK.PushBack(&state4{BRANCH, retval, n})
                n -= 1
                BRANCH = 0
                continue
            } else if BRANCH == 2 {
                retval += s1
            }
        }
        if STACK.Len() == 0 {
            return retval
        }
        s := STACK.Back().Value.(*state4)
        STACK.Remove(STACK.Back())
        BRANCH, s1, n = s.BRANCH, s.s1, s.n
        BRANCH++
    }
}

The hard part is over and all that is left is to do some code clean up and optimizations. The BRANCH == 0 can be optimized to include its own for loop to calculate the entire branch from n to 0 without involving the outer loop. So once we hit it with e.g. n == 6, n == 4 and n == 2 get pushed on the stack straight away before we continue with the main loop.

Furthermore, s2 is not really needed. Its value is always either 0 (when BRANCH == 0) or the same as retval (when BRANCH == 1).


func fib5(n int) int {
    var s1, retval int
    BRANCH := 0
    var STACK list.List
    for {
        if n < 2 {
            retval = n
        } else {
            if BRANCH == 0 {
                for n >= 2 {
                    STACK.PushBack(&state4{1, 0, n})
                    n -= 1
                }
                retval = n
            } else if BRANCH == 1 {
                s1 = retval
                STACK.PushBack(&state4{2, retval, n})
                n -= 2
                BRANCH = 0
                continue
            } else if BRANCH == 2 {
                retval += s1
            }
        }
        if STACK.Len() == 0 {
            return retval
        }
        s := STACK.Back().Value.(*state4)
        STACK.Remove(STACK.Back())
        BRANCH, s1, n = s.BRANCH, s.s1, s.n
    }
}

As a final optimization we swap the logic for BRANCH == 0 and BRANCH == 1. It doesn't really matter whether we descend down the execution tree left child first or right child first, the final result is the same. After all, a+b is equal to b+a. But the inner loop for the left path is executed n/2 times while the inner loop for the right path is executed n times. If the optimization is beneficial for n/2, it should be even better for n.

Furthermore, rather than pushing a branch value on the stack and then increment it by one when we get it back, we can simply push the incremented value on the stack and get rid of the trailing BRANCH++ statement.

After all this hard work, the results are a bit disappointing  The optimized version is around 100 times slower than the recursive version.


Fib1: fib(20) = 6765 at 83.407us
Fib5: fib(20) = 6765 at 6.380718ms
Fib1: fib(20) = 6765 at 83.304us
Fib6: fib(20) = 6765 at 5.68734ms
Fib1: fib(20) = 6765 at 83.497us
Fib6: fib(20) = 6765 at 5.682758ms

So apparently Go is doing a good job here, at least better than we can accomplish.

Get the source at GitHub.

Sunday, October 6, 2013

5.4.2: Creating Tail Calls

If a recursive function doesn't have a tail call to eliminate, we can often create it. And then promptly eliminate it. A truly creative destruction.

The binary function accepts an integer and returns a string that is the binary representation of the integer. E.g. binary(10) returns the string "1010".

func binary1(n int) string {
    if n <= 1 {
        return strconv.Itoa(n)
    }
    k := int(n / 2)
    b := int(math.Mod(float64(n), 2))
    return binary1(k) + strconv.Itoa(b)
}

Even though the calls is at the tail of the function, this is not a tail call since the last computation performed in the function is the adding of the two strings; the return of the recursive call and the conversion of b.

func binary2(n int, retval string) string {
    if n < 1 {
        return retval
    }
    k := int(n / 2)
    b := int(math.Mod(float64(n), 2))
    retval = strconv.Itoa(b) + retval
    return binary2(k, retval)
}

To eliminate the final computation we introduce a new state variable, retval, which will contain the state of the computation at all times and be passed along the recursive calls. Now the final computation is the recursive call which is now a tail call that we can eliminate. The cost is that the computation takes one extra recursive call. But that is acceptable since that one extra call will be turned into one extra iteration.

func binary3(n int) string {
    var retval string
    for n >= 1 {
        b := int(math.Mod(float64(n), 2))
        retval = strconv.Itoa(b) + retval
        n = int(n / 2)
    }
    return retval
}

The recursive function now becomes a simple function with a for loop, just as the gcd example. n>=1 guards our loop that allows us to build up the string in retval that we return once the loop exists.

Binary1: binary(10) ="1010" at 13.448us
Binary3: binary(10) ="1010" at 2.461us
Binary1: binary(10) ="1010" at 8.896us
Binary3: binary(10) ="1010" at 1.581us
Binary1: binary(10) ="1010" at 12.516us
Binary3: binary(10) ="1010" at 2.02us

Again, the loop version of the computation destroys the recursive version. And this is even though we are using a somewhat inefficient (and not recommended by the Go authors) method of string concatenation.

The strategy can be summarized as to identify the tail computation that prevents the tail-call elimination, store the computation in a variable that can be passed with the recursion, isolate the tail-call and finally eliminate it.

The source is at GitHub. Also, another example rewriting a factorial function in the same way is available. The strategy is exactly the same so I won't go further into it.

Saturday, October 5, 2013

5.4.1: Someone Else's Problem

A power set P of set S is a collection of sets constructed in such a way that each set Sn contains a different combination of the elements of S and P contains all such distinct sets Sn.

P(S={s1, s2, s3}) = [
    S1={s1, s2, s3},
    S2={s1, s2},
    S3={s1, s3},
    S4={s1},
    S5={s2, s3},
    S6={s2},
    S7={s3}
]

The Perl code in HOP is taken from the book Mastering Algorithms with Perl and the power set is constructed as a hash of hashes. It is full of weird perlisms and I even had to seek the help of MJD (thanks!) to understand it. Still I can't say I fully understand it and why it is implemented the way it is implemented. If anyone is interested, the original code is here.

type set map[string]string
type powerset []set

For me the most natural way to construct this is to represent the powerset as a slice of maps. Perls magic hash foo is hard to grok and I'm not sure what the benefit is to do it like that. Feel free to tell me how my version is way worse Big Oh wise!

func keysAndValues(s set) ([]string, []string) {
    var ks []string
    var vs []string
    for k := range s {
        ks = append(ks, k)
        vs = append(vs, s[k])
    }
    return ks, vs
}

Perl has keys and values to retrieve the keys and values from hashes. We need something similar to retrieve them from maps. Doing it in a single function is beneficial since then we can do it in a single pass.

The signature of the powerset function is powerset_recurse(s set, p powerset, keys, values []string, n, i int) powerset. s is our set S that we wish to use as a seed for the powerset. All of the other parameters are the state that needs to be passed from each invocation to the next. This is something that the function maintains and the caller does not need to know about. In Perl we simply call the function without these parameters:

powerset_recurse($set)

Go has no default parameters which forces us to provide nil values for all the state parameters:

powerset_recurse(s, nil, nil, nil, 0, 0)

This is clumsy and ugly. To avoid this, we can wrap the recursive function in another function that provides the nicer interface to the caller:

func powerset_recurse(s set) powerset {
    var f func(s set, p powerset, keys, values []string, n, i int) powerset
    f = func(s set, p powerset, keys, values []string, n, i int) powerset {
        // function body
        return f(s, p, keys, values, n, i+1)
    }
    return f(s, nil, nil, nil, 0, 0)
}

Now we can call the function and start the recursive computation in the same way as in Perl.

    if p == nil {
        keys, values = keysAndValues(s)
        n = len(keys)
        p = make(powerset, int(math.Pow(2, float64(n)))-1)
        i = 1
    }
    if i-1 == n {
        return p
    }

The first part of the body simply sets up the state variables for the recursion if the power set that we wish to create is nil. We retrieve a slice of keys and values from the source set and make the powerset to contain (2 to the power of n) - 1 values, n being the number of keys in the source set. i is our counter, it represent the number of recursions we've executed. Once i-1 is equal to n the powerset is ready and we stop. Naturally, this initialization code could simply be outside the recursive function and inside the wrapping function that we created to improve the interface. But it is here in the Perl version so I'll do the same.

    c := int(math.Pow(2, float64(i-1)))
    for j := 0; j < c; j++ {
        ss := make(set, i)
        for k := 0; k < i; k++ {
            flag := 1 << uint(k)
            if (c+j)&flag == flag {
                ss[keys[k]] = values[k]
            }
        }
        p[c-1+j] = ss
    }

This is where we generate each Sn in the powerset and where the Perl code contains much of its black magic. My strategy is to use bitmasks to generate the sets.

c is the number of sets to generate in this invocation, 1 for the first, 2 for the second, and then 4, 8, 16, etc., as needed. The invocation counter i controls the size of each set, 1 for the first time, 2 then, and 3, 4, 5, etc.

In the inner loop, we copy keys and values from the source set to set Sn. We generate a flag for each value such that value n is the flag ...0001..., that is, contains the value 1 at the nth position and 0 elsewhere. c+j is the mask for set Sn. A mask such as 1101 means the set contains keys and values 1, 2 and 4 while a mask such as 001011 means that the set contains keys and values 3, 5 and 6.

After this loop is done we assign the set Sn to the correct location in the powerset. The complete function is as follows:

func powerset_recurse(s set) powerset {
    var f func(s set, p powerset, keys, values []string, n, i int) powerset
    f = func(s set, p powerset, keys, values []string, n, i int) powerset {
        if p == nil {
            keys, values = keysAndValues(s)
            n = len(keys)
            p = make(powerset, int(math.Pow(2, float64(n)))-1)
            i = 1
        }
        if i-1 == n {
            return p
        }
        c := int(math.Pow(2, float64(i-1)))
        for j := 0; j < c; j++ {
            ss := make(set, i)
            for k := 0; k < i; k++ {
                flag := 1 << uint(k)
                if (c+j)&flag == flag {
                    ss[keys[k]] = values[k]
                }
            }
            p[c-1+j] = ss
        }
        return f(s, p, keys, values, n, i+1)
    }
    return f(s, nil, nil, nil, 0, 0)
}

To remove the recursion we simply have to add a for loop around the latter part of the body. And now we can remove the inner function body as well, making the whole thing a lot simpler:

func powerset_loop(s set) powerset {
    keys, values := keysAndValues(s)
    n := len(keys)
    p := make(powerset, int(math.Pow(2, float64(n)))-1)
    for i := 1; i <= n; i++ {
        c := int(math.Pow(2, float64(i-1)))
        for j := 0; j < c; j++ {
            ss := make(set, i)
            for k := 0; k < i; k++ {
                flag := 1 << uint(k)
                if (c+j)&flag == flag {
                    ss[keys[k]] = values[k]
                }
            }
            p[c-1+j] = ss
        }
    }
    return p
}

The benchmarks point to something around a 40% improvement in execution speed over the recursive version:

Powerset recursive at 14074
Powerset looping at 8858
Powerset recursive at 15000
Powerset looping at 10000
Powerset recursive at 14138
Powerset looping at 11909

Get the source at GitHub.

Friday, October 4, 2013

5.4.1: Tail-Call Elimination

One of the reasons recursive solutions often look so nice is the automatic stack management the computer performs for us. Every time we call a function the parent functions parameters get saved on the stack and then popped back when the function returns and the parent regains control. Furthermore, any return value of the function gets pushed on the stack and the tail end of the function and then popped back in the parent function. As nice as such automatic stack management is, it is not free. Pushing and popping takes time.

A tail call occurs when a function ends by calling another function (such as itself) and then immediately returning the value of the last function call without any additional computation. A recursive function R that calls itself three times R1, R2, and R3 at the end will end up pushing and popping the same return value three times as the call stack unwinds.

Tail-Call elimination is the process of eliminating this unnecessary pushing-and-popping, simply returning the value of R3 straight to the caller of R1 without going through R1 and R2.

Many compilers and runtimes provide for this optimization automaticly. As I write this, Go does not. But we can usually refactor a recursive tail-call function into a simple for loop without any recursion, therefore removing the tail call.

func gcd(m, n int) int {
    if n == 0 {
        return m
    }
    m = int(math.Mod(float64(m), float64(n)))
   return gcd(n, m)
}

The gcd computes the Greatest Common Divisor between m and n, using Euclid's algorithm. At each call we make m be equal to n and n be equal to the modulus of m and n. Once n hits 0 we are done and return m as the final result. Notice that no additional computation is performed on this value of m, it is simply passed back up the entire call stack as the result of the original computation.

func gcd_notail(m, n int) int {
    for n != 0 {
        m, n = n, int(math.Mod(float64(m), float64(n)))
    }
    return m
}

To eliminate the tail call we rewrite the function as a for loop. n is the condition of the loop as it was the condition of the call stack, once it hits zero we are done. And at each iteration m becomes n and n becomes the modulus of m and n.

GCD Recursive of 48 and 20: 4 at 3.298us
GCD NoTail of 48 and 20: 4 at 431ns
GCD Recursive of 48 and 20: 4 at 3.55us
GCD NoTail of 48 and 20: 4 at 624ns
GCD Recursive of 48 and 20: 4 at 4.669us
GCD NoTail of 48 and 20: 4 at 1.05us

As we can see, the non recursive version of GCD is always faster than the recursive one, by a factor of 4 to 8 in these three samples.

Get the source at GitHub.

Thursday, October 3, 2013

5.3: A Generic Search Iterator

We can encapsulate the core strategy of the previous example in a code structure that we can then reuse to create iterators such as the IntPartition iterator. The idea is to extract the code that revolves around the agenda and the iteration. To use this very generic iteration, we need to supply a function that will generate new items for the agenda from the current item.


type ChildrenFunc func(interface{}) []interface{}

type depthfirst struct {
    children ChildrenFunc
    agenda   []interface{}
    node     interface{}
    err      error
}

The iterator consists of the agenda, the current value in the node item, and a function to generate new items.


func Dfs(root interface{}, children ChildrenFunc) i.Forward {
    dfs := depthfirst{children: children}
    dfs.agenda = append(dfs.agenda, root)
    dfs.Next()
    return &dfs
}

To create the iterator we set the agenda to the root item, call Next() to generate the first value and return.


func (dfs *depthfirst) AtEnd() bool {
    return len(dfs.agenda) == 0
}

func (dfs *depthfirst) Value() interface{} {
    if dfs.AtEnd() {
        dfs.err = fmt.Errorf("Next: Beyond end")
        return nil
    }
    return dfs.node
}

node is the current value of the iteration and the iteration is over once the agenda is empty.


func (dfs *depthfirst) Next() error {
    if dfs.AtEnd() {
        dfs.err = fmt.Errorf("Next: Beyond end")
        return dfs.err
    }
    dfs.node, dfs.agenda = dfs.agenda[0], dfs.agenda[1:len(dfs.agenda)]
    dfs.agenda = append(dfs.agenda, dfs.children(dfs.node)...)
    return nil
}

The iteration should be familiar. Pop the head of the agenda and call the function we received to generate the next batch of items from the current item and append them to the agenda.

The IntPartition iterator is now a short piece of code that focuses on solving only the problem of partitioning an integer; all the iteration bookkeeping is gone. The only new thing here is, rather than adding items to the agenda for further iteration, we create a temporary slice and add our generated nodes to it. This slice is the list of children we return to the Dfs iterator.


func IntPartition(n int) i.Forward {
    return Dfs([]int{n}, func(node interface{}) []interface{} {
        items, _ := node.([]int)
        largest, rest := items[0], items[1:len(items)]
        min, max := 1, largest/2
        if len(rest) > 0 {
            min = rest[0]
        }
        next := make([]interface{}, 0)
        for r := igen.Range(min, max+1); !r.AtEnd(); r.Next() {
            next = append(next, (append([]int{largest - r.Int(), r.Int()}, rest...)))
        }
        return next
    })
}

Now its 15 lines of code rather than 50 lines of code. Yeah for iterator building blocks.

Get the source at GitHub. The Dfs iterator is also available in the iterator library.

Wednesday, October 2, 2013

5.2: How to Convert a Recursive Function to an Iterator

The trick to turn a recursive function is to maintain an agenda of todo items, where each item is the state of each invocation of the recursive version. The agenda replaces the stack management we get when we use recursion. 

The integer partition problem is to break an integer into all possible integer components that sum to the original integer. E.g 6 breaks in to:

6
5 1
4 1 1
3 1 1 1
2 1 1 1 1
1 1 1 1 1 1
2 2 1 1
3 2 1
4 2
2 2 2
3 3

A recursive solution to this problem looks a bit like this:

func IntPartitionRecursive(item []int) {
    fmt.Println(item)
    largest, rest := item[0], item[1:len(item)]
    min, max := 1, largest/2
    if len(rest) > 0 {
        min = rest[0]
    }
    for r := igen.Range(min, max+1); !r.AtEnd(); r.Next() {
        IntPartitionRecursive(append([]int{largest - r.Int(), r.Int()}, rest...))
    }
}

For every slice, starting with, in this case, [6], we print out the slice. We when split it into a first element (6) and rest ([]) and set the min (1) and max (3) values. We then loop through the range [1,4) and call the function again 3 times with the values [5,1], [4,2] and [3,3]. The call with [5,1] will in turn results to calls with [4,1,1] and [3,2,1]. This strategy, of looping only to largest/2, prunes the solution tree so that we don't return solutions such as [2,4] since that's the same as [4,2]. All of our solutions are in decreasing-or-equal order.

To turn this into a iterative solution, we need to identify the essential state of the iteration. In this case, it is the slice of integers that we are generating new solutions from (e.g. the state [5,1] results in new states [4,1,1] and [3,2,1]). On change from the recursive solution is that the iterative solution will return the list in a sorted order such as this:

6
5 1
4 2
4 1 1
3 3
3 2 1
3 1 1 1
2 2 2
2 2 1 1
2 1 1 1 1

To produce the sorted version, we leverage the sort package provided by Go. We need to define three methods on our to-be-sorted collection, Len(), Swap() and Less().

type items [][]int

func (it items) Len() int { 
    return len(it) 
}

func (it items) Swap(i, j int) {
    it[i], it[j] = it[j], it[i]
}

func (it items) Less(i, j int) bool {
    a, b := it[i], it[j]
    for k := 0; k < len(a); k++ {
        if a[k] < b[k] {
            return false
        } else if a[k] > b[k] {
            return true
        }
    }
    // we will never get here since we've alredy pruned the
    // other half of the solution tree
    return true
}

Go's package is geared to sort a container from the smallest element to the largest, we want to sort in reverse order so we switch the expression; return true if item i is larger than j and vice versa.

type intpartition struct {
    agenda items
    item   []int
    err    error
}

agenda is the state of the iteration. item is the current solution that Value() will return.

func IntPartition(n int) i.Forward {
    var p intpartition
    item := make([]int, 1)
    item[0] = n
    p.agenda = append(p.agenda, item)
    p.Next()
    return &p
}

To start the iteration we simply construct a slice from the supplied starting value and call the Next() method.

func (p *intpartition) AtEnd() bool {
    return len(p.agenda) == 0
}
func (p *intpartition) Value() interface{} {
    if p.AtEnd() {
        p.err = fmt.Errorf("Value: Beyond end")
        return nil
    }
    return p.item
}

An empty agenda means the iteration is over.

func (p *intpartition) Next() error {
    if p.AtEnd() {
        p.err = fmt.Errorf("Calling next AtEnd")
        return p.err
    }
    p.item, p.agenda = p.agenda[0], p.agenda[1:len(p.agenda)]
    largest, rest := p.item[0], p.item[1:len(p.item)]
    min, max := 1, largest/2
    if len(rest) > 0 {
        min = rest[0]
    }
    for r := igen.Range(min, max+1); !r.AtEnd(); r.Next() {
        p.agenda = append(p.agenda, (append([]int{largest - r.Int(), r.Int()}, rest...)))
    }
    sort.Sort(p.agenda)
    return nil
}

The Next() method starts py popping the first item from the agenda and assigning it to item. Then we break the item into two parts, largest and rest and proceded exacly like the recursive version, except we push new solutions on the agenda where we would call the function again. Lastly, we sort the agenda.

    itr := IntPartition(n)
    for ; !itr.AtEnd(); itr.Next() {
        fmt.Println(itr.Value())
    }

Looping through the solution space is trivial.

A version that only prints out sums that contain distinct components, e.g. no number repeats itself, is easy. Simply leverage the i.Filter function and write a filter for the serie of integers.

func distinct(itr i.Iterator) bool {
    item, _ := itr.Value().([]int)
    for i := 1; i < len(item); i++ {
        if item[i-1] == item[i] {
            return false
        }
    }
    return true
}

Since the list is in a descending order we simply need to check if the current component is equal to the previous component; if so we return false. If we can run through all components without this happening, we return true.

    itr = hoi.Filter(distinct, IntPartition(n))
    for ; !itr.AtEnd(); itr.Next() {
        fmt.Println(itr.Value())
    }

Get the source at GitHub.

Tuesday, October 1, 2013

5.1: The Partition Problem Revisited

The recurisve partition algorithm form 1.8 solved for a single solution when trying to search a list of treasures for a share of certain value. The iterator version will return single solution on each iteration, successive iterations will return new solutions until all possible solutions have been found when we reach AtEnd().

type entry struct {
    target      int
    pool, share []int
}

type partition struct {
    todo  *list.List
    err   error
    share []int
}

The entry item contains the current target, the pool of items to consider, and the share that make up a possible solution. The todo is a list of entries to inspect.

func Partition(target int, treasures []int) i.Forward {
    p := partition{}
    p.todo = list.New()
    if target == 0 {
        p.todo.PushBack(&entry{target: target, pool: make([]int, 0)})
    } else {
        p.todo.PushBack(&entry{target: target, pool: treasures})
    }
    p.Next()
    return &p
}

To construct the iterator we create the first entry item for the todo list from the target and treasures parameters. If target is zero we put a empty list for the pool so the iterator will return right after the first iteration. Then we execute the first iteration.

func (p *partition) AtEnd() bool {
    return p.todo.Len() == 0
}

func (p *partition) Value() interface{} {
    if p.AtEnd() {
        p.err = fmt.Errorf("Next: Beyond end")
        return nil
    }
    return p.share
}

An empty todo list indicates that we are at the end of the iteration. The share is the current value.

func (p *partition) Next() error {
    if p.AtEnd() {
        p.err = fmt.Errorf("Next: Beyond end")
        return p.err
    }
    for !p.AtEnd() {
        e, _ := p.todo.Front().Value.(*entry)
        p.todo.Remove(p.todo.Front())
        first, rest := e.pool[0], e.pool[1:len(e.pool)]
        if len(rest) > 0 {
            p.todo.PushBack(&entry{e.target, rest, e.share})
        }
        if e.target == first {
            p.share = append(e.share, first)
            return nil
        } else if e.target > first && len(rest) > 0 {
            p.todo.PushBack(&entry{e.target - first, rest, append(e.share, first)})
        }
    }
    return nil
}

We start by popping the first entry off the todo list and splitting the pool into a first item and the rest. If the rest contains items there are more possible solutions to consider so we put it on the todo list for future inspection.
If the target is equal to the first element of the pool we've found a solution and return. Otherwise if the target is larger than the first element and the rest contains more elements there could be a solution or two still in the rest list, so we put it on the todo list to inspect.
If we reach this point we did not find a solution so we pop the next item of the todo list and start again, or return if it is empty.

itr := Partition(target, treasures)
for ; !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Construct the iterator and loop through it to find all solutions.
Get the source at GitHub.

Interlude

Chapter 5 is a short one and itt can be split into logical two parts. The first one is about turning recursive functions into iterators and ends with a general strategy to create generic search iterators.

The second part covers some alternative strategies to turn recursive functions into loops. Here we focus on tail-call elimination and variations thereof.

Thursday, September 19, 2013

4.7: An Extended Example: Web Spiders - Part 4

The Spider is the crawler. It constructs the iterator chain we use to parse the web pages and starts the Fetcher process. Its interface to the world is an i.Forward iterator that streams out urls and some information about those urls. It operates by sending the url to the Fetcher by putting it on the Fetcher queue. It waits then for the Fetcher to return the first page. The Spider then returns the result to the caller and hands the page over to the parsing routine that runs concurrently. The parser runs through the page, queueing any urls that come out of the iterator chain. The user only blocks if there are no pages waiting on the channel from the Fetcher.

type Entry struct {
    Url, Referer string
    StatusCode   int
}

type Spider struct {
    starturl string
    entries  list.List
    queued   map[string]bool
    err      error
    fetcher  *Fetcher
    robot    Robot
    pages    chan *page
}

The Entry struct is what we return to the user. It contains the url, the page that refered it, and the status code that resulted from attempting to fetch it.
The Spider struct contains the state of the operatation: starturl is the starting url, entries contains the data we send to the user, queued we use as a log of urls we've already worked on, fetcher and robot are the two main components of teh system, and pages is the channel we use to send pages to the parsing funciton.

func NewSpider(url string) *Spider {
    url = strings.ToLower(url)
    var s Spider
    s.starturl = url
    s.robot, _ = NewRobot(url)
    s.queued = make(map[string]bool)
    s.queued[url] = true
    s.fetcher = NewFetcher()
    s.fetcher.Queue(url, url)
    s.fetcher.Run()
    s.pages = make(chan *page)
    s.parse()
    p := <-s.fetcher.Pages()
    s.err = s.processEntry(p)
    return &s
}

The constructor creates the initial state of the spider and queues and fetches the starting url. It then waits for the first page and sends it to the parser before returning.

func (s *Spider) Value() interface{} {
    return s.entries.Front().Value
}

func (s *Spider) Error() error {
    return s.err
}
func (s *Spider) AtEnd() bool {
    return s.entries.Len() == 0
}
func (s *Spider) Close() {
    close(s.pages)
    s.fetcher.Close()
}

The value we return to the user is the head of the entries queue. Once its empty, we are done. To clean up resources, the user has to call Close() on the spider, it closes the channel to the parser and shuts down the fetcher.

func (s *Spider) Next() error {
    s.entries.Remove(s.entries.Front())
    if s.AtEnd() {
        for {
            if p, ok := <-s.fetcher.Pages(); ok {
                s.err = s.processEntry(p)
                if !s.AtEnd() {
                    break
                }
            }
        }
    }
    return s.err
}

The Next() function pops the head from the queue. If it is empty it fetches the next page from the channel from the Fetcher and processes it.

func (s *Spider) processEntry(p *page) error {
    if p.err != nil {
        return p.err
    }
    s.entries.PushBack(&Entry{Url: p.url, Referer: p.ref, StatusCode: p.response.StatusCode})
    s.pages <- p
    return nil
}

The processing routine simply creates an Entry to return to the user and then sends the page off to the parsing function.

func (s *Spider) parse() {
    go func() {
        for p := range s.pages {
            if p.response.Body != nil {
                itr :=
                    HostMapper(p.url,
                        NormalizeItr(
                            UrlItr(
                                LinkItr(
                                    NodeItr(p.response.Body, DepthFirst)))))
                if s.robot != nil {
                    itr = RobotItr(s.robot, itr)
                }
                itr = BindByRef(s.starturl, Referer(p.url, itr))
                for ; !itr.AtEnd(); itr.Next() {
                    urlpair, _ := itr.Value().([]string)
                    url := urlpair[0]
                    if _, ok := s.queued[url]; !ok {
                        s.queued[url] = true
                        s.fetcher.Queue(url, urlpair[1])
                    }
                }
                p.response.Body.Close()
            }
        }
    }()
}

The parsing function loops on the channel of pages, blocking until a new one is available. When it comes through it starts a chain of iterators to filter the links on the page. Every link that comes through the chain is queued to be processed by the Fetcher.


The iterator channel built in the parsing routine is made of eight components that transform the html nodes to a pair of strings representing the url and its referer. Those processes are for the most part specialitations of i.Map and i.Filter.
The complete process of the spider looks like this:

The two red sections represent two independent threads of execution, the Fetcher and the parse() method.
Using the spider is a simple matter of iterating through the urls that the spider returns. In this case, we want to further filter them by only printin the urls that return a error code. This is accomplished by using a hoi.FilterFunc function that checks for the StatusCode. Now we have an url checker that checks the sites for urls that are not available.

func find4xx(itr i.Iterator) bool {
    e, _ := itr.Value().(*spider.Entry)
    return e.StatusCode >= 400 && e.StatusCode < 500
}

func main() {
    s := spider.NewSpider(os.Args[1])
    itr := hoi.Filter(find4xx, s)
    count := 0
    for ; !itr.AtEnd(); itr.Next() {
        e, _ := itr.Value().(*spider.Entry)
        count++
        fmt.Printf("%d: Url: %s, Code: %d, Referer: %s\n", count, e.Url, e.StatusCode, e.Referer)
    }
    if itr.Error() != nil {
        fmt.Println(itr.Error())
    }
    s.Close()
}

Both of these source files are on GitHub at hog/spider/spider and hog/checkurls.

Wednesday, September 18, 2013

4.7: An Extended Example: Web Spiders - Part 3

The robots package is very simple since github.com/temoto/robotstxt-go does the heavy lifting of parsing and querying the robot.txt file.

type Robot interface{}

func NewRobot(url string) (Robot, error) {
    src := hostFromBase(url) + "/robots.txt"
    resp, err := http.Get(src)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    robots, err := robot.FromResponse(resp)
    if err != nil {
        return nil, err
    }
    return robots.FindGroup("GoGoSpider"), nil
}
func RobotItr(rules Robot, itr i.Forward) i.Forward {
    rrules, _ := rules.(*robot.Group)
    filter := func(itr i.Iterator) bool {
        url, _ := itr.Value().(string)
        return rrules.Test(url)
    }
    return hoi.Filter(filter, itr)
}

The NewRobot() constructor uses the host name of our root url to fetch the robots.txt file and returns the rules for the GoGoSpider group (thats us!). The iterator ueses these rules to filter the url stream, rejecting any urls that are not allowed according to the rules in the robots.txt file.
The Fetcher object is the first, and only object in this system that has nothing at all to do with iterators. It doesn't consume them, doesn't produce them. What it does do is fetch and produce web pages concurrently.

type urlpair struct {
    url, ref string
}

type page struct {
    url, ref string
    err      error
    response *http.Response
}
type Fetcher struct {
    pages  chan *page
    done   chan bool
    queue  list.List
    lockq  sync.Mutex
    client http.Client
}

The urlpair is used to transport an url and its referer from the user of the Fetcher to the engine that fetches the pages. The page contains those url pairs along with the http response recieved. The Fetcher maintains page channel, a queue and a mutex to guard access to the queue.

func NewFetcher() *Fetcher {
    f := Fetcher{}
    f.pages = make(chan *page, 5)
    f.done = make(chan bool)
    return &f
}

func (f *Fetcher) Stop() {
    f.done <- true
    <-f.done
}

The constructor creates buffered channels to use in the fetching operations. It can fetch 5 pages before it blocks and has to wait.

func (f *Fetcher) Pages() <-chan *page {
    return f.pages
}
func (f *Fetcher) Queue(url, ref string) {
    f.lockq.Lock()
    defer f.lockq.Unlock()
    f.queue.PushBack(&urlpair{url: url, ref: ref})
}

The Queue() method is where links enter the fetching system. They get stored on a queue so the caller won't block (for long, only if the mutex is locked). The Pages() method gives the user access to the page channel to retrieve them. Its where the results of the queueing operation come back out to the user.

func (f *Fetcher) Run() {
    go func() {
        for {
            f.lockq.Lock()
            if f.queue.Len() > 0 {
                e := f.queue.Front()
                urlpair, _ := e.Value.(*urlpair)
                f.queue.Remove(e)
                f.lockq.Unlock()
                headResp, err := f.client.Head(urlpair.url)
                var p *page
                if err == nil {
                    content := headResp.Header.Get("Content-Type")
                    if !strings.HasPrefix(content, "text/html") || !strings.HasPrefix(content, "text/xhtml") {
                        headResp.Body.Close()
                        getResp, err := f.client.Get(urlpair.url)
                        if err == nil {
                            p = &page{url: urlpair.url, response: getResp, ref: urlpair.ref}
                        } else {
                            p = &page{url: urlpair.url, ref: urlpair.ref, err: err}
                        }
                    } else {
                        p = &page{url: urlpair.url, ref: urlpair.ref, response: headResp}
                    }
                } else {
                    p = &page{url: urlpair.url, ref: urlpair.ref, err: err}
                }
                select {
                case f.pages <- p:
                case <-f.done:
                    p.response.Body.Close()
                    for {
                        select {
                        case sentpage := <-f.pages:
                            sentpage.response.Body.Close()
                        default:
                            close(f.pages)
                            close(f.done)
                            return
                        }
                    }
                }
            } else {
                f.lockq.Unlock()
                time.Sleep(1)
            }
        }
    }()
}

The Run() method is the engine of the Fetcher. It starts a Go routine that fetches links of the queue. It performes a HEAD request on the link, checking the response and the content type of the response. If the content is an html document it performes a GET request, fetching the document. The response of the request that succeded (or the fact that it failed) is then packaged with the url and the referer variables and sent over the pages channel.
If the done channel is sending a value, the fetcher retrieves any documents from the pages channel to close them and free any resources. Once that is done, it quites the Go routine.


The red line represents the boundaries between the two threads of execution. The queue and the channel cross those boundaries.
Both of these source files are on GitHub at hog/spider/robots and hog/spider/fetcher.

Tuesday, September 17, 2013

4.7: An Extended Example: Web Spiders - Part 2

The job of the linkitr.go package is to operate on a stream of html nodes from the NodeItr iterator and transform it into a stream of strings.

The first iterator is the LinkItr.

func links(itr i.Iterator) bool {
    n, _ := itr.Value().(*html.Node)
    if n.Type != html.ElementNode {
        return false
    }
    if n.Data != "a" && n.Data != "img" && n.Data != "link" && n.Data != "style" && n.Data != "script" {
        return false
    }
    if n.Data == "style" || n.Data == "script" {
        src := attr("src", n.Attr)
        return src != ""
    }
    return true
}
func LinkItr(itr i.Forward) i.Forward {
    return hoi.Filter(links, itr)
}

The LinkItr is an i.Filter iterator that runs through the node stream and removes any nodes that are not a, img, link or style nodes.

func geturl(itr i.Iterator) interface{} {
    n, _ := itr.Value().(*html.Node)
    var url string
    if n.Data == "a" {
        url = attr("href", n.Attr)
    } else if n.Data == "img" {
        url = attr("src", n.Attr)
    } else if n.Data == "link" {
        url = attr("href", n.Attr)
    } else if n.Data == "style" {
        url = attr("srr", n.Attr)
    } else if n.Data == "script" {
        url = attr("src", n.Attr)
    }
    return url
}

func UrlItr(itr i.Forward) i.Forward {
    return hoi.Map(geturl, itr)
}

The UrlItr uses hoi.Map to transform the node stream into a string stream, from now on we are operating on a stream of urls.

func attr(name string, attrs []html.Attribute) string {
    for _, a := range attrs {
        if a.Key == name {
            return a.Val
        }
    }
    return ""
}

The attr() method is a helper that gets the value of attribute name from the node. One improvement that could be made to this code would be to implement both the links method and the geturl method with a DispatchTable as discussed in chapter 2 of HOP. After this we hand the stream over to the iterators in the urlitr.go package. It contains five iterators, NormalizeItr, HostMapper, Referer, BindByRef and BindByHost.

func removeUrl(itr i.Iterator) bool {
    url, _ := itr.Value().(string)
    url = strings.TrimSpace(url)
    if url == "" {
        return false
    }
    if strings.HasPrefix(url, "tel:") || strings.HasPrefix(url, "mailto:") {
        return false
    }
    return true
}

func remHash(u string) string {
    idx := strings.Index(u, "#")
    if idx == -1 {
        return u
    }
    if idx == 0 {
        return ""
    }
    return u[0 : idx-1]
}

func normalizeUrl(itr i.Iterator) interface{} {
    url, _ := itr.Value().(string)
    url = strings.TrimSuffix(remHash(url), "/")
    return strings.ToLower(url)
}
func NormalizeItr(itr i.Forward) i.Forward {
    return hoi.Filter(removeUrl, hoi.Map(normalizeUrl, itr))
}

The job of NormalizeItr is to trasform the url into a canalogical version of the url. The hash portion gets cut of, trailing slash chomped, the rest lower cased, emtpy strings excluded along with mailto and tel links. The iterator uses both hoi.Map and hoi.Filter to accomplish this.

func hostFromBase(base string) string {
    slashslashidx := strings.Index(base, "//")
    idx := strings.Index(base[slashslashidx+2:], "/")
    if idx > 0 {
        return base[0 : slashslashidx+2+idx]
    }
    return base
}

func hostMapper(base string) i.MapFunc {
    host := hostFromBase(base)
    if !strings.HasSuffix(base, "/") {
        base = base + "/"
    }
    return func(itr i.Iterator) interface{} {
        url, _ := itr.Value().(string)
        if strings.Contains(url, "://") {
            return url
        }
        if strings.HasPrefix(url, "/") {
            return host + url
        }
        return base + url
    }
}

func HostMapper(base string, itr i.Forward) i.Forward {
    return hoi.Map(hostMapper(base), itr)
}

The HostMapper runs over the url stream, turning any relative urls into absolute urls.

func referer(ref string) hoi.MapFunc {
    return func(itr i.Iterator) interface{} {
        url, _ := itr.Value().(string)
        return []string{url, ref}
    }
}

func Referer(ref string, itr i.Forward) i.Forward {
    return hoi.Map(referer(ref), itr)
}

The Referer transforms the url stream into a stream of twinlets, the url and the page it was found on.

func removeIfNotReferedBy(ref string) hoi.FilterFunc {
    host := hostFromBase(ref)
    return func(itr i.Iterator) bool {
        ref, _ := itr.Value().([]string)
        return strings.HasPrefix(ref[1], host)
    }
}

func BindByRef(ref string, itr i.Forward) i.Forward {
    return hoi.Filter(removeIfNotReferedBy(ref), itr)
}

The BindByRef removes any urls from the stream that are not refered by the reference url. This is to stop the crawler to run away into the distance, we are only checking the urls that are refered by the host of the starting url. Those urls might take us to another website but we wont go any further.

func removeIfNotOnHost(url string) hoi.FilterFunc {
    host := hostFromBase(url)
    return func(itr i.Iterator) bool {
        u, _ := itr.Value().([]string)
        return strings.HasPrefix(u[0], host)
    }
}

func BindByHost(host string, itr i.Forward) i.Forward {
    return hoi.Filter(removeIfNotOnHost(host), itr)
}

The BindByHost is an iterator that removes any urls from the stream that are not on the same host as the referencing url. If you use this iterator on the stream, you are only crawling urls that are one the same domain as the starting url. Both of these source files are on GitHub at hog/spider/linkitr and hog/spider/urlitr.

Monday, September 16, 2013

4.7: An Extended Example: Web Spiders - Part 1

The spider example is, like the Flat File Database, long enough to need more than one post. And as with Ffdb I will attempt to explain the code from the bottom up.

But first the birds view of the system. The Node iterator gives us access to the html document. We then use various iterators to transform that stream of nodes into a stream of links. The fetcher's job is to retrieve documents from the web, and the robot's job is to parse the robot.txt file so we can act as good citizens when hitting some poor server. The spider runs the whole show and maintains the state of the crawling. One crucial change I made from the example in HOP: the fetching operation is now a concurrent operation. This is a blog about Go and its characteristics after all.




The Node iterators job is to take a document tree of html nodes and turn it into a one dimensional stream of nodes. The iterator offers two strategies to do this, a Depth-First and a Breath-First.


type SearchTactic int

const (
    DepthFirst SearchTactic = iota
    BreathFirst
)

type nodeitr struct {
    err   error
    cur   *html.Node
    next  func() error
    queue list.List
}

func NodeItr(in io.Reader, stactic SearchTactic) i.Forward {
    n := nodeitr{}
    n.cur, n.err = html.Parse(in)
    if stactic == DepthFirst {
        n.next = n.depthFirst
    } else {
        n.queue.PushBack(n.cur)
        n.next = n.breathFirst
        n.next()
    }
    return &n
}

The nodeitr maintains a reference to the current node, a queue for the Breath-First search and function pointer to the search algorithm. The constructor retrieves the root node and assigns the correct search function to next according to the SearchTactic value.

func (i *nodeitr) Value() interface{} {
    return i.cur
}

func (i *nodeitr) Error() error {
    return i.err
}

func (i *nodeitr) AtEnd() bool {
    return i.cur == nil
}

func (i *nodeitr) Next() error {
    return i.next()
}

The only thing to node here is that the only thing Next() does is to forward the operation to whatever function is in the next variable, be that the Depth-First or the Breath-First search.

func (i *nodeitr) depthFirst() error {
    if i.err != nil {
        return i.err
    }
    if i.cur.FirstChild != nil {
        i.cur = i.cur.FirstChild
    } else if i.cur.NextSibling != nil {
        i.cur = i.cur.NextSibling
    } else if i.cur.Parent != nil {
        for i.cur.Parent != nil {
            i.cur = i.cur.Parent
            if i.cur.NextSibling != nil {
                i.cur = i.cur.NextSibling
                return i.err
            }
        }
        i.cur = nil
    }
    return i.err
}

The depthFirst() simply uses the FirstChild, NextSibling and Parent references that every node has to traverse the tree in a Depth-First fashion. If the current node has a child we go there. Else if it has a sibling we go there. If both fails we start looping up the parent references, stopping on the first one that has a sibling. If that fails we are back at the root node and the iteration is finished.

func (i *nodeitr) breathFirst() error {
    if i.err != nil {
        return i.err
    }
    i.cur = nil
    if i.queue.Len() > 0 {
        i.cur = i.queue.Front().Value.(*html.Node)
        i.queue.Remove(i.queue.Front())
        if i.cur.FirstChild != nil {
            for c := i.cur.FirstChild; c != nil; c = c.NextSibling {
                i.queue.PushBack(c)
            }
        }
    }
    return i.err
}

Our breathFirst() function uses a queue to maintain a list of nodes that we are yet to visit. If the queue is not empty we pop the first node of the queue; that is our current position. Then we check if this node has any child nodes; if so they get appended to the queue. If the queue is empty we are finished.

Get the source at GitHub.

Sunday, September 15, 2013

4.6.2: An Iterator with an each-Like Interface

The concept behind each-Like is to run a Map operation transforming a value and returning it along with the original to the caller.
This is a Map iterator in almost every way.

type eachlike struct {
    itr i.Forward
    f   hoi.MapFunc
}

func EachLike(f i.MapFunc, itr i.Forward) *eachlike {
    return &eachlike{f: f, itr: itr}
}

The constructor takes a f.MapFunc function and a i.Forward iterator to work on.

func (i *eachlike) Value() interface{} {
    ret := make([]interface{}, 2)
    ret[0] = i.itr.Value()
    ret[1] = i.f(i.itr)
    return ret
}

func (i *eachlike) ValuePair() (interface{}, interface{}) {
    res := i.Value().([]interface{})
    return res[0], res[1]
}

The Value() method returns an interface{} value that contains both the original value and the transformed value. The ValuePair() method unpacks the interface{} value into a interface{} slice.

func (i *eachlike) Error() error {
    return i.itr.Error()
}

func (i *eachlike) Next() error {
    return i.itr.Next()
}
func (i *eachlike) AtEnd() bool {
    return i.itr.AtEnd()
}

These functions simply forward to the wrapped iterator.

func multiply(n int) i.MapFunc {
    return func(itr i.Iterator) interface{} {
        return n * itr.Value().(int)
    }
}

This is a simple closure that creates a i.MapFunc that multiplies two integers together.

itr := EachLike(multiply(10), icon.List(1, 2, 3, 4, 5, 6, 7, 8, 9))
for ; !itr.AtEnd(); itr.Next() {
    val, res := itr.ValuePair()
    fmt.Printf("Value: %v, result: %v\n", val, res)
}

To use it we construct the iterator from the i.MapFunc and the i.Forward iterator. We then loop over it and print out every value, result pair.
The rational behind this is to exploit a feature in Perl where the caller can indicate to a function whether it expects a single value or a list value returned. The EachLike would return the transformed value in the single context, and the (transformed value, original value) in the list context. This is not possible in Go. And the core functionality of returning the transformed value with the origianl can easily be achived with the regular i.Map iterator and a i.MapFunc function that simply returns a slice with both values.
Get the source at GitHub.

Concerning the i library

During the last few days, the spinof library that is the i library has seen many changes. I've refactored it into subpackages (i, hoi, icon, iadapt, ityped, itk), added tests and examples, and a first draft of a documentation is up on godoc.org. Over the next days I will add to the documentation and examples.

The hoi package is the Higher Order Iterator package, as such it contains mainy useful higher order functions to use with iterators. As of today it contains the following iterators:
Documentation for those functions is up on GoDoc.

As a result of the refactoring, I've had to go back over some of the examples in chapter 4 and change the code in the hog repository. I've also gone through the past blogposts and updated the links to files in the i repository.

Saturday, September 14, 2013

4.6.1: Using foreach to Loop Over More Than One Array

Zipping lists is to take some lists A, B, C .. and produce a new list containing the elements (a1, b1, c1..), (a2, b2, c2, ..) .. A key decision here is does the new list stop producing new items once the shortest list is exhausted, or does it continue to produce elements until the longest list is finished, and fill in nil values for the shorter lists.

type stop int

const (
    StopAtMin stop = iota
    StopAtMax
)

type eacharray struct {
    itrs   []i.Forward
    err    error
    stopAt stop
}

func EachArray(stopAt stop, itrs ...i.Forward) i.Forward {
    return &eacharray{itrs: itrs, stopAt: stopAt, err: nil}
}

The constructor takes an enum to control the behaviour vis-a-vis the length of individual streams, and a list of iterators.

func (i *eacharray) Value() interface{} {
    ret := make([]interface{}, len(i.itrs))
    for idx, itr := range i.itrs {
        if !itr.AtEnd() {
            ret[idx] = itr.Value()
        } else {
            ret[idx] = nil
        }
    }
    return ret
}

func (i *eacharray) Error() error {
    return i.err
}

func ( *eacharray) SetError(err error) {
    i.err = err
}

The Value() function loops over all iterators, collecting a value from each of them. If the iterator is at end, we use a nil value for that iterator.

func (i *eacharray) Next() error {
    for _, itr := range i.itrs {
        if !itr.AtEnd() {
            err := itr.Next()
            if err != nil {
                i.err = err
                break
            }
        }
    }
    return i.err
}

func (i *eacharray) AtEnd() bool {
    atEndCount := 0
    for _, itr := range i.itrs {
        if itr.AtEnd() {
            if i.stopAt == StopAtMin {
                return true
            }
            atEndCount++
        }
    }
    return atEndCount == len(i.itrs)
}

The Next() method loops over all iterators, calling Next() on each iterator that is not at end yet. The AtEnd() method loops over all the iterator and checks each one for EOS. If stopAt is StopAtMin, we return true on first EOS we encoundter, otherwise we return false until all iterators return EOS.

itr := EachArray(
    StopAtMax,
    i.List(1, 2, 3, 4, 5, 6),
    i.List(6.4, 7.1, 8.2, 9.9),
    i.List("A", "B", "C", "D", "E"))

for ; !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Usage is simple and there is no need for all the lists to be of the same type.

Get the source at GitHub. It is also available in the iterator library as Zip.

Friday, September 13, 2013

4.4.4: append()

Go has an append function, it simply appends two slices and returns the resulting slice. An append iterator is similar, the major difference being that you can start to work on the joined list without actually going through the process of copying them together. If the lists are long, this is a good strategy.

type iappend struct {
    itrs []i.Forward
    pos  int
    err  error
}

func Append(itrs ...i.Forward) i.Forward {
    return &iappend{itrs: itrs}
}

func (i *iappend) AtEnd() bool {
    if i.pos < len(i.itrs)-1 {
        return false
    } else if i.pos >= len(i.itrs) {
        return true
    }
    return i.itrs[i.pos].AtEnd()
}

func (i *iappend) Next() error {
    if i.pos >= len(i.itrs) {
        i.err = fmt.Errorf("Appending beyond last iterator")
        return i.err
    }
    i.itrs[i.pos].Next()
    for i.itrs[i.pos].AtEnd() {
        i.pos++
        if i.pos >= len(i.itrs) {
            return nil
        }
    }
    return nil
}

func (i *iappend) Value() interface{} {
    return i.itrs[i.pos].Value()
}

func (i *iappend) Error() error {
    if i.err != nil {
        return i.err
    }
    for _, v := range i.itrs {
        err := v.Error()
        if err != nil {
            return err
        }
    }
    return nil
}

func (i *iappend) SetError(err error) {
    i.err = err
}

Append simply runs through each iterator until it hits the end, then it switches to the next one. Once the last iterator is exhausted, Append quits.

The usage is very simple:

for itr := Append(i.List(1, 2, 3), igen.Range(10, 20)); !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Get the source at GitHub. It is also available in the iterator library.

Thursday, September 12, 2013

4.4.3: list_iterator()

In this section MJD introduces the imap and igrep iterators. We've already used them; imap was introduced in 4.3.3: Example - A Flat-File Database - part 2 (available in the iterator library as i.Map) while igrep was introduced in 4.3: Examples (available in the iterator library as i.Filter). I've won't spend any more time on them here.

A list_iterator() transforms a Perl array into an iterator. Go's native container is the slice so we will build an adaptor for the same purpose here.

One problem we run into is that a slice of some type, say []int, can not be typecasted to a []interface{} type, we would need to create a interface{} of equal length as the []int and then copy the values over. This is not efficient and not something I want to do. I have two solutions to this problem. The first one is to write an adapter speficic for the type you want to use.

type intslice struct {
    slice []int
    pos   int
}

func IntSlice(vals []int) i.Forward {
    return &intslice{slice: vals}
}

func (i *intslice) AtEnd() bool {
    return i.pos >= len(i.slice)
}

func (i *intslice) Next() error {
    i.pos++
    return nil
}

func (i *intslice) Value() interface{} {
    return i.slice[i.pos]
}

func (i *intslice) Error() error {
    return nil
}

func (i *intslice) SetError(err error) {
}

This is a typical i.Forward iterator that simply wraps around the int slice you provide.

itr := IntSlice([]int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11})
itr = i.Filter(
    func(itr i.Iterator) bool {
        n, _ := itr.Value().(int)
        return math.Mod(float64(n), 3) == 0
    }, itr)

for !itr.AtEnd() {
    fmt.Println(itr.Value())
    itr.Next()
}

I've created a script that generates iterators such as this for all the base types in Go. They are all available in the iterator library.


A second strategy is not to wrap a slice but to use Go's support for variable arguments. Yes, if you are forced to wrap a slice this will not help you but sometimes all we need is to wrap a list of values that we've written out in the source file.

type list struct {
    slice []interface{}
    pos   int
}

func List(vals ...interface{}) i.Forward {
    return &list{slice: vals}
}

func (i *list) AtEnd() bool {
    return i.pos >= len(i.slice)
}

func (i *list) Next() error {
    i.pos++
    return nil
}

func (i *list) Value() interface{} {
    return i.slice[i.pos]
}

func (i *list) Error() error {
    return nil
}

func (i *list) SetError(err error) {
}

This will work for any type, even a mix of different types.


itr = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
itr = i.Filter(func(itr i.Iterator) bool {
    n, _ := itr.Value().(int)
    return math.Mod(float64(n), 2) == 0
}, itr)

for !itr.AtEnd() {
    fmt.Println(itr.Value())
    itr.Next()
}

Get the source at GitHub. The variable argument version is also available in the iterator library.

Wednesday, September 11, 2013

4.3.6: Example - Random Number Generator

Our final example of an iterator is short and simple, it is a random number generator.

type Rnd struct {
    seed int64
    cur  float64
    gen  *rand.Rand
}

func Rand() *Rnd {
    var r Rnd
    r.seed = time.Now().UnixNano()
    r.First()
    return &r
}

The structure contains the seed used to generate the sequence, the current value (a floating number in the [0..1] range) and a reference to Go's random number generator.
The constructor simply seeds a random sequence from the current time and generates the first value.

func (r *Rnd) First() error {
    r.gen = rand.New(rand.NewSource(r.seed))
    r.cur = r.gen.Float64()
    return nil
}

func (r *Rnd) Next() error {
    r.cur = r.gen.Float64()
    return nil
}

The iterator is a BoundedAtStart iterator, i.e. it has a beginning and can easily be restored to it through the First() method. The Next() generates the next random value.

func (r *Rnd) AtEnd() bool {
    return false
}

func (r *Rnd) Error() error {
    return nil
}

This iterator represents an infinite stream of values on the form [rval1, rval2, ...), therefore the only job of the AtEnd() method is to return false.

func (r *Rnd) Value() interface{} {
    return r.cur
}

func (r *Rnd) Float64() float64 {
    return r.cur
}

The iterator has a special type casting function so we can read float64 values from it without have to do the type assertion ourselves.

func (r *Rnd) SetSeed(seed int64) {
    r.seed = seed
    r.First()
}

func (r *Rnd) Seed() int64 {
    return r.seed
}

Lastsly, the iterator provids methods to retrieve the seeding used to generate the sequence, as well as reset and iterator to a previously saved seeding value.
A sampel usage is as follows:

fmt.Print("A quarted of random pairs: ")
ritr1, ritr2 := Rand(), Rand()
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

seed1, seed2 := ritr1.Seed(), ritr2.Seed()
fmt.Println("")
fmt.Print("A quarted of another random pairs: ")
ritr1, ritr2 = Rand(), Rand()
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

fmt.Println("")
fmt.Print("The first quarted of random pairs: ")
ritr1.SetSeed(seed1)
ritr2.SetSeed(seed2)
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

We generate three list of a quarted of a pair of random numbers, where the first sequence and the last sequence are the same.
Get the source at GitHub. The Random iterator is also awailable in the iterator library as a RandomAccess iterator.

Tuesday, September 10, 2013

4.3.3: Example - A Flat-File Database - part 3

We conclude this example by going over the Query package.

type (
    FieldQueryFunc func(string) bool
)

A FieldQueryFunc is similar to a i.FilterFunc with the exception that it is specialized to work on string values rather than i.Iterator values.

func (db *Ffdb) Query(f i.FilterFunc, d Direction) i.Forward {
    return i.Filter(f, RecordItr(db, d))
}

func (db *Ffdb) QueryField(fieldname string, f FieldQueryFunc, d Direction) i.Forward {
    fieldname = strings.ToUpper(fieldname)
    if _, ok := db.fieldnames[fieldname]; !ok {
        panic(fmt.Errorf("Unknown field: %q\n", fieldname))
    }

    return db.Query(func(itr i.Iterator) bool {
        r, _ := itr.Value().(*Record)
        return f(r.Value(fieldname))
    }, d)
}

The Query package contains two fundamental methods that serve as building blocks for all other queries. The first, Query(), is an iterator that uses the i.Filter iterator to construct a i.Forward iterator that will filter a Record stream in the direction of Direction with the filter f.
The QueryField() constructs a i.FilterFunc and binds it to a field name and a FieldQueryFunc. We can now iterate over the stream of Records and ask specific questions about individual field values. Two examples are provided in the Query package.

func (db *Ffdb) QueryFieldRx(fieldname, rxs string, d Direction) i.Forward {
    rx := regexp.MustCompile(rxs)
    return db.QueryField(fieldname, func(val string) bool {
        return rx.MatchString(val)
    }, d)
}

func (db *Ffdb) QueryGreater(fieldname string, check float64, d Direction) i.Forward {
    return db.QueryField(fieldname, func(fieldvalue string) bool {
        value, _ := strconv.ParseFloat(fieldvalue, 64)
        return value > check
    }, d)
}

QueryFieldRx is a query that checks field named fieldname against a regular expression rsxQueryGreater is a query that returns records where the float64 value of field fieldname is greater than some value suppled to the query.

An entire pipeline of iterators, from query to text file using e.g. the QueryFieldRx, would look something like this:


The blue lines represent how data moves through the construction of the pipeline. FieldName, Rx and Direction are send to the QueryFieldRx which constructs a FieldQueryFunc and sends it along with the FieldName and Direction to the QueryField object. It constructs a i.FilterFunc and sends it with the Direction to the Query object. The Query constructs the Record iterator, which in turns constructs the Reader from the Direction.
Now that the pipeline is ready, we can read bytes from the file. The Reader turns those into a string for each line in the file and hands those over to the Record iterator. The Record iterator, being a i.Map iterator, maps the lines to a Record which the Query then filters and returns to the user.
Following is an example of how to use this package.

    db, _ := ffdb.NewFfdbHeader(os.Args[1], "[:]")
    defer db.Close()

    fmt.Println("From state: MA & NY")
    itrNY := db.QueryFieldRx("state", "NY", ffdb.Reverse)
    itrMA := db.QueryFieldRx("state", "MA", ffdb.Forward)
    for !itrNY.AtEnd() || !itrMA.AtEnd() {
        if !itrNY.AtEnd() {
            fmt.Println(itrNY.Value())
            itrNY.Next()
        }
        if !itrMA.AtEnd() {
            fmt.Println(itrMA.Value())
            itrMA.Next()
        }
    }
    fmt.Println("\nOwes more than 100")
    for itr := db.QueryGreater("owes", 100, ffdb.Forward); !itr.AtEnd(); itr.Next() {
        fmt.Println(itr.Value())
    }

We open the database, build some query iterators and run through them printing out the result.
The two source files discussed here are available at GitHub: Query and ffdb