Comparison of HTML parsers
Appearance
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
HTML Tidy | W3C license | ANSI C | 2021-07-17[2] | Yes[3] | Yes | Yes[3] | Yes |
HtmlUnit | Apache License 2.0 | Java | 2023-10-31[4] | Yes | ? | No | No |
Beautiful Soup | MIT License | Python | 2023-04-07[5] | Yes | Yes | ? | No |
jsoup | MIT License | Java | 2024-07-10[6] | Yes | Yes | Yes | Yes |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with
style="text-align:center;"
).