Thursday, July 18, 2013

1.7: Html

This section took more time than I expected simply because I was hit by the double whammy of not knowing Perl well enough and not knowing Go well enough at the same time.

What I failed to appreciate in the previous section (1.5: Applications and Variations of Directory Walking) was that Perls Push method would, if it received an array to push onto a array, flatten out the second array so I would end up with a single array rather than an array of arrays. This would blow up the second use (promoting elements, see below) since I would end up with a tree of slices rather than a single slice of strings. 

My second problem was that once I figured out the Push problem, I failed to figure out the proper syntax for appending slices to slices in Go. In the end Effective Go lead me to the correct solution, to append "..." to the second slice argument.

We start by declaring some types and functions for the generic walking function. These are analogues to the previous dirwalk function.
type (
 ResultType  interface{}
 TextFunc    func(n *html.Node) ResultType
 ElementFunc func(n *html.Node, results []ResultType) ResultType
)
htmlwalk is our html walker. It will accept a html.Node and call the TextFunc on text elements and ElementFunc on other elements. It will recursively call itself and check if the result is a single ResultType element or a slice of ResultType elements and append them to the final result list.

func htmlwalk(n *html.Node, textf TextFunc, elementf ElementFunc) ResultType {
 if n.Type == html.TextNode {
  return textf(n)
 }
 results := make([]ResultType, 0)
 child := n.FirstChild
 for {
  if child == nil {
   break
  }
  result := htmlwalk(child, textf, elementf)
  if multiresults, ok := result.([]ResultType); ok {
   results = append(results, multiresults...)
  } else {
   results = append(results, result)
  }
  child = child.NextSibling
 }
 return elementf(n, results)
}
We apply this function to two use cases. The first is to simply walk through the html file and return all the text stripped of tags.

Simple and very permissive typecasting function. If it fails it simply returns an empty string.
func stringValue(v interface{}) string {
 if s, ok := v.(string); ok {
  return s
 }
 return ""
}
This is called on text elments and it simply returns the text from the html node.

func untagText(n *html.Node) ResultType {
 return n.Data
}
The element function. Run through the result list and concatenate all the strings. 

func untagElement(n *html.Node, results []ResultType) ResultType {
 var b bytes.Buffer
 for _, v := range results {
  s := stringValue(v)
  _, err := b.WriteString(s)
  if err != nil {
   panic(err)
  }
 }
 return b.String()
}
The second use case for the html walker is to construct code that will allow us to dig into an html document and print out only the text contained in the tag we specify. In this example we ask for the h1 tag.

First we define our tag structure (which htmlwalker will pass around as ResultType). Strings that we want to be promoted will be tagged with Keep, other strings will be tagged with Maybe.
type TagType int16

const (
 Maybe TagType = iota
 Keep
)

type tag struct {
 Type  TagType
 Value string
}
Extract the value from ResultType. It can be both either a *tag structure or a slice of ResultType.

func tagValue(v ResultType) (*tag, []ResultType) {
 if t, ok := v.(*tag); ok {
  return t, nil
 } else if t, ok := v.([]ResultType); ok {
  return nil, t
 }
 panic(fmt.Sprintf("Unknown value %v", v))
}
Any text node is a Maybe


func promoteText(n *html.Node) ResultType {
 return &tag{Maybe, n.Data}
}
This functions constructs the ElementFunc function. If the tagname is found we concatenate all the strings in the result list and promote the result to Keep. If not, we simply return the result list that htmlwalk will the flatten into the final result list. 

func promoteElementIf(tagname string) func(n *html.Node, results []ResultType) ResultType {
 return func(n *html.Node, results []ResultType) ResultType {
  if n.Data == tagname {
   var b bytes.Buffer
   for _, v := range results {
    t, _ := tagValue(v)
    _, err := b.WriteString(t.Value)
    if err != nil {
     panic(err)
    }
   }
   return &tag{Keep, b.String()}
  }
  return results
 }
}
The htmlwalk function will return a list of ResultType elements. We run through it, find all elements that are tagged with Keep and concatenate them into one string that we then return.

func extractPromoted(n *html.Node) string {
 _, results := tagValue(htmlwalk(n, promoteText, promoteElementIf("h1")))

 if results != nil {
  var b bytes.Buffer
  for _, r := range results {
   tag, _ := tagValue(r)
   if tag.Type == Keep {
    _, err := b.WriteString(tag.Value + " ")
    if err != nil {
     panic(err)
    }
   }
  }
  return b.String()
 }
 return ""
}
The final main function simply tests both of these use cases. You can call it either on a local html file or supply an url using the -u flag.

var url = flag.Bool("u", false, "Use -u for url")

func main() {
 flag.Parse()

 if len(os.Args) < 2 {
  fmt.Printf("Usage %s [-u] NAME\n", os.Args[0])
  os.Exit(0)
 }

 var in io.Reader
 if *url {
  resp, err := http.Get(os.Args[2])
  if err != nil {
   panic(err)
  }
  defer resp.Body.Close()
  in = resp.Body
 } else {
  fi, err := os.Open(os.Args[1])
  if err != nil {
   panic(err)
  }
  defer fi.Close()
  in = bufio.NewReader(fi)
 }

 n, err := html.Parse(in)
 if err != nil {
  panic(err)
 }

 fmt.Println(stringValue(htmlwalk(n, untagText, untagElement)))
 fmt.Println(extractPromoted(n))
}
Get the source at GitHub.

No comments:

Post a Comment