When working with Python, Beautiful Soup is my go to library to parse HTML and XML. But for Javascript (NodeJS), I was missing an equivalent, so I decided to search and try some alternatives.
Hacker News website already has a very simple design, but sometimes you just want the very minimum, the most extremely minimalistic version, so I decided to practice with it and build a "top HN news" CLI tool that simply writes the top 30 (first page) news at the time of executing the script. Perfect for a morning coffee.
The packages settled on using to accomplish this task are the following:
- node-fetch: Polyfill meanwhile Node 18 becomes stable to use.
- htmlparser2: One of the two halves of the scrapping/parsing. In this case, the one in charge of generating a DOM tree from the parsed HTML/XML.
- css-select: The other half for parsing. A selector engine to easily get DOM fragments (just using htmlparser2 is too hardcore).
- dom-serializer: In case you need it, the opposite of htmlparser2, given a DOM fragment, renders the equivalent markup into a string.
- BONUS: htmlparser2-without-node-native: No node native modules fork, in case you want to use it client-side.
The NodeJS code is small and not complex, but as the documentation of the libraries is quite bad (or almost non-existing), with the following lines I showcase how to fetch & parse a webpage, loop through a selectAll
result set, a selectOne
selector, bits of how operating with child nodes works, and how to get the data (text) of an element and an attribute (href
property):
import fetch from "node-fetch";
import * as htmlParser2 from "htmlparser2";
import * as cssSelect from "css-select";
const parseHtml = async (url) => {
const dom = htmlParser2.parseDocument(await fetch(url).then((res) => res.text()));
for (let newsRow of cssSelect.selectAll("table tr.athing", dom)) {
const rank = cssSelect.selectOne("span.rank", newsRow).firstChild.data;
const titleBlock = cssSelect.selectAll("td.title", newsRow)[1].firstChild;
console.log(`${rank} ${titleBlock.firstChild.data} ${titleBlock.attribs.href}`);
}
};
parseHtml("https://news.ycombinator.com/");
And if you want to get HTML back, it's a single line of code:
import render from "dom-serializer";
// ...
console.log(render(dom));
And that's pretty much it.
Tags: Development HTML Javascript XML