Last Updated on
Parsing is the process of breaking down code into individual chunks of code, verifying that all necessary inputs are included in the code, and acting on the instructions dictated by the code.
When website files are received by a browser, each file will be parsed individually, and parsing takes places in two steps:
Lexical analysis: During lexical analysis the code is analyzed and broken down into individual tokens, or bits of code, that the parser can work with to create a hierarchical model of the contents of the document. Lexical analysis of HTML is also sometimes referred to as tokenization.
Syntax analysis: After the lexical analysis has broken the code into workable chunks, the syntax analysis determines how these chunks relate to each other and builds a model of how the rendering engine should process the code based on this analysis. Syntax analysis of HTML is also sometimes referred to as tree construction since it is the process of arranging the tokens into something called a DOM tree that will define the overall structure of the web page.
Here’s a simplified example of how a browser’s HTML parser would handle a short bit of HTML:
<p>Random paragraph text.</p>
First, lexical analysis would tokenize this HTML into the following chunks: HTML element, body element, paragraph element, text. Second, syntax analysis would fashion the elements into a tree looking something like this:
HTML --> Body --> Paragraph --> Text
In this simplified example, the tree only has a single branch. In virtually every real life example there would be many branches.
Frequently Asked Questions
Are languages other than HTML also parsed?
Every programming language is parsed as a part of the process of executing the code. How parsing happens varies from one programming language to the next, but the basic premise remains the same with all programming languages.
It’s also worth noting that our discussion of parsing has been limited to the parsing of web pages in a browser. However, parsing happens on all types of computers and in all types of applications.
Do all browsers parse HTML exactly the same way?
A programming language’s standard specification will define how the language should be parsed. In the case of HTML, the HTML5 standard from the World Wide Web Consortium defines how HTML should be parsed. However, HTML is a very forgiving language, and many browsers include a variety of fixes that will allow a wide range of HTML mistakes to parse acceptably. As a result, the HTML parser of one browser may or may not render and HTML exactly the same was as another browser’s parser.