HtmlCleaner - A HTML parser in JAVA

I wanted to parse webpages of a website. At first I looked at the pages' design and guessed that the HTML pages were well-formed. However, DOM parser couldn't parse the pages and informed the pages were not well-formed. More closer look revealed that some of the tags were not closed.

My next step was to search tools that  facilitate parsing of HTML pages using JAVA. I found that  a number of HTML Parsers are available to do so. Among them I chose HtmlCleaner, a tool that can CLEAN HTML web pages and can give us the DOM document. Since the pages contains Nepali characters, I must use UTF-8 encoding. Fortunately, HtmlCleaner has that capacity.

The website of HtmlCleaner doesn't show a complete sample example. However, a user has posted a sample program ( given in this URL ) that really helped me to start HTML parsing.

0 comments:

Post a Comment