JSP Parser
>([^<>]+)<
It seems that this solution is simple enough with resonable results (about 10% of errors). Regular expression is such a powerful and handy tool for string manipulation. It is popular in scripting, yet not many java programmers know or learn it.
So i decided to refine the pattern to more complicated pattern, in order to reduce the error rate.
Then i found this pattern <([A-Z][A-Z0-9]*)\b[^>]*>(.*?) from http://www.regular-expressions.info/examples.html. This is the first time i learn backreference (\1 in the pattern). I thought it really solved my problem as it can scanned for matched tags. While this solution works most of the times, it does have a minor drawback- it cannot handle nested tag (such as <td>...<td> ...</td> ...</td>).
It seems that although this is acceptable, it is not perfect solution. Tedious human verification is required.
...
...
...
After trying it, it leaves a final issue to solve...
HTML Parser cannot handle properly handle tags </tag between <SCRIPT> </SCRIPT>. HTML Parser assumes that first </tag marks the end of <SCRIPT>. Although this is hardly a issue in normal HTML, in many JSPs, JSP tags such a <logic:present ... can appear everywhere, including between script tags.
Since HTML Parser is subject to GNU Lesser General Public License, so it is legal to modify it ; P
After trying to understanding its code, i modified its org.htmlparser.lexer.Lexer.parseCDATA(boolean) parse for matched ending tag to opening tag. It works!
Lesson learnt: a good software is easy to understand, modify/fix (without breaking the code) and test. Its related behaviours are put into a single place (i.e. cohesive).
HTML Parser is a good example of quality software which undergoes continuous unittesting, enhancement and refactoring.
