Parsing HTML and XML in Javascript

When working with Python, Beautiful Soup is my go to library to parse HTML and XML. But for Javascript (NodeJS), I was missing an equivalent, so I decided to search and try some alternatives.

Hacker News website already has a very simple design, but sometimes you just want the very minimum, the most extremely minimalistic version, so I decided to practice with it and build a "top HN news" CLI tool that simply writes the top 30 (first page) news at the time of executing the script. Perfect for a morning coffee.

The packages settled on using to accomplish this task are the following:

node-fetch: Polyfill meanwhile Node 18 becomes stable to use.
htmlparser2: One of the two halves of the scrapping/parsing. In this case, the one in charge of generating a DOM tree from the parsed HTML/XML.
css-select: The other half for parsing. A selector engine to easily get DOM fragments (just using htmlparser2 is too hardcore).
dom-serializer: In case you need it, the opposite of htmlparser2, given a DOM fragment, renders the equivalent markup into a string.
BONUS: htmlparser2-without-node-native: No node native modules fork, in case you want to use it client-side.

The NodeJS code is small and not complex, but as the documentation of the libraries is quite bad (or almost non-existing), with the following lines I showcase how to fetch & parse a webpage, loop through a selectAll result set, a selectOne selector, bits of how operating with child nodes works, and how to get the data (text) of an element and an attribute (href property):

import fetch from "node-fetch";
import * as htmlParser2 from "htmlparser2";
import * as cssSelect from "css-select";

const parseHtml = async (url) => {
  const dom = htmlParser2.parseDocument(await fetch(url).then((res) => res.text()));
  for (let newsRow of cssSelect.selectAll("table tr.athing", dom)) {
    const rank = cssSelect.selectOne("span.rank", newsRow).firstChild.data;
    const titleBlock = cssSelect.selectAll("td.title", newsRow)[1].firstChild;
    console.log(`${rank} ${titleBlock.firstChild.data} ${titleBlock.attribs.href}`);
  }
};

parseHtml("https://news.ycombinator.com/");

And if you want to get HTML back, it's a single line of code:

import render from "dom-serializer";

// ...

console.log(render(dom));

And that's pretty much it.

Tags: Development

Parsing HTML and XML in Javascript article, written by Kartones

. Published @ 2022-07-03