Contents
Parsing
Parsing is the process of breaking down code into individual chunks of code, verifying that all necessary inputs are included in the code, and acting on the instructions dictated by the code.
In the context of the Web, parsing most commonly happens when a web browser receives the files that comprise a website. Every web browser is equipped with a rendering engine that converts the files into the web page you see in your browser. The rendering engine contains several parsers – the part of the program that parses code prior to rendering the web page. There is a different parser for every language. At a minimum, any modern browser can parse HTML, CSS, and JavaScript.
When website files are received by a browser, each file will be parsed individually, and parsing takes places in two steps:
Lexical analysis: During lexical analysis the code is analyzed and broken down into individual tokens, or bits of code, that the parser can work with to create a hierarchical model of the contents of the document. Lexical analysis of HTML is also sometimes referred to as tokenization.
Syntax analysis: After the lexical analysis has broken the code into workable chunks, the syntax analysis determines how these chunks relate to each other and builds a model of how the rendering engine should process the code based on this analysis. Syntax analysis of HTML is also sometimes referred to as tree construction since it is the process of arranging the tokens into something called a DOM tree that will define the overall structure of the web page.
Here’s a simplified example of how a browser’s HTML parser would handle a short bit of HTML:
<html>
<body>
<p>Random paragraph text.</p>
</body>
</html>
First, lexical analysis would tokenize this HTML into the following chunks: HTML element, body element, paragraph element, text. Second, syntax analysis would fashion the elements into a tree looking something like this:
HTML --> Body --> Paragraph --> Text
In this simplified example, the tree only has a single branch. In virtually every real life example there would be many branches.
Also See: HTML, Browser, World Wide Web
Frequently Asked Questions
Are languages other than HTML also parsed?
Every programming language is parsed as a part of the process of executing the code. How parsing happens varies from one programming language to the next, but the basic premise remains the same with all programming languages.
In order for a program, such as a web browser, to be able to parse a certain type of code it must be equipped with a parser suitable to process that type of code. This is why a browser will have separate parsers from HTLM, CSS, and JavaScript. CSS code has to be understood in a fashion completely unlike HTML, and the same is true of JavaScript. As a result, while the basic ideas behind parsing are relatively constant from one language to the next, the actual mechanics are parsing code vary drastically from one language to the next.
It’s also worth noting that our discussion of parsing has been limited to the parsing of web pages in a browser. However, parsing happens on all types of computers and in all types of applications.
Do all browsers parse HTML exactly the same way?
A programming language’s standard specification will define how the language should be parsed. In the case of HTML, the HTML5 standard from the World Wide Web Consortium defines how HTML should be parsed. However, HTML is a very forgiving language, and many browsers include a variety of fixes that will allow a wide range of HTML mistakes to parse acceptably. As a result, the HTML parser of one browser may or may not render and HTML exactly the same was as another browser’s parser.