Markup Languages

HOME Desktop Computing List of Markup Languages Programming Language

What is a Markup Language

There is an overview on Languages. Here comes an introduction for what is called "Markup" and Markup languages. A markup is anything added to a text document that conveys an extra information. That is, if we want to display a word in italic form or in bold form, we use the corresponding markup tags for that particular word. Thus markup describes exactly how the document should appear on the screen or on the printed page. Markup is being used extensively not in computing world but also in electronic documents, such as word processor files and Latex files so that the marked up words can be displayed on the screen and the printed page in the intended format.

Markup languages are a product of the information age. They are a formalization of the codes used to markup the content of electronic documents that is, a set of conventions defining things such as a) what marks or elements are allowed, b) where elements may occur, c) whether any or all of the elements must occur somewhere in a document to which the language has been applied.

There are two types of markup and hence there exists two types of markup languages. They are procedural and generalized.

Procedural Markup Languages - Procedural markup is typified by its use in typesetting and publishing systems, including word processors. The elements are placed right in the flow of text, and the markup languages that define them have the following characteristics:

1. Documents marked up with procedural markup languages contain clear instructions for the document-rendering program, so that it produces output of the original content in a particular format and style

2. The formatting instructions are likely to be specific to the output medium, so the document containing the original content interspersed with markup is not portable across different output media.

A common procedural markup language is the Rich Text Format (RTF). There are markup elements such as \par for paragraph, \b for bold, \i for italic and so on. This sort of markup languages is good for the task of formatting if the documents are always destined for the printed page or any other single medium. PostScript and TeX are the most popular procedural markup languages.

Generalized Markup Languages (GML) - There is a major shortcoming in the procedural markup languages. That is, if we intend to extract information from the documents, procedural markup language is found wanting. To meet this challenge, a generalized markup language marks up documents in a different way. The characteristics of these languages are:

1. The elements have logical names, rather than expressing detailed formatting instructions. For example, an element H1 is being used to mark up text that is intended to be a first-level header.

2. Software applications that read documents marked up using a GML are free to present them as they see fit, using formatting rules for particular elements that are either defined internally, or specified elsewhere. When displayed in the screen, the H1 element could be associated with a particular combination of font size and weight.

GML elements usually involve both start and end tags so that the original content is fully contained inside an element. Also there is no hint about how this document should be presented. That is, the web browsers are free to reflect the meanings of these elements. The most commonly used generalized markup language is HTML for web documents and WML for mobile contents.

Generalized Markup Rule-sets Each GML has its own elements, its own rules, and its own particular area of application and in order for the language to function properly as a language, those must all somehow be defined. GMLs are themselves written using Generalized Markup Rule-sets (GMRS), also called meta-languages. The two famous meta-languages are Standard Generalized Markup Language (SGML) and the Extensible Markup Language (XML). In a way, XML is a subset of SGML and WML is derived from XML.

SGML is an international standard designed to integrate documents in different proprietary formats, and to enable sharing of documents among the text editing, formatting, and retrieval subsystems. SGML has been approved by the International Standards Organization. SGML is not used to mark up a document, but it is a meta-language that is used to create markup languages that suit different application domains.

The basic design principles of SGML emphasize the importance of separating formatting instructions from content. When Internet Explorer displays the HTML document that contains h1 and em elements, it do so by using an internal stylesheet that specifies how these elements have to be represented on the screen. In general, it is possible to create several stylesheets that contain instructions for outputting the content of a marked up document to a variety of output devices. SGML insists that the names of the elements will describe what their content represents in the application domain.

A Document Type Definition (DTD) is used to define the set of valid elements for a particular GML as well as the content model of each element. To facilitate electronic processing, documents marked up using GMLs are highly structured and the notion of elements being able to contain other elements is a powerful one in SGML. For instance, element always contain one or more elements, which in turn would contain many elements.

There are a couple of rules for using SGML. The elements in SGML-based languages are not case sensitive, unless specified explicitly in the DTD. SGML allows tag omission, cross-element nesting and mixed-case element names in its applications. Finally, a DTD is required for every SGML document to check for its validity. This means that documents can not use any element which is not mentioned in the DTD file.

SGML, as a meta-language, is being used to derive markup languages for different communities such as publishers, academic institutions, and government organizations. These derived languages are the applications of SGML. Hypertext Markup Language (HTML) is the DTD used to describe content in a web document.

HTML has been an excellent standard for web publishing. It has been developed into a powerful tool that provides a wealth of well-defined presentation elements for marking up web documents. However, we can not make use of elements that are specific to another application domain to structure our document content. That is, we are confined to the set of HTML elements in all application domains. The solution for this issue came in the form of Extensible Markup Language (XML), a subset of SGML.

XML is intended to retain SGML's ability to define new sets of elements. That is, XML is a meta-language for creating other markup languages. Also XML documents contain markup that describes precisely what the marked up content is. That means an XML document can simply use text to store data, making delivery of information over the Internet easy, fast, and independent of any particular platform.

XML should support document publishing as strongly as HTML, but the goal of separating presentation from data is to be upheld. Stylesheets are being involved in this process. Also XML enforces strict rules on its applications than what SGML does to its applications. This helps to reduce the complexity of software, such as XML parser processing XML documents. This ultimately paves for better performance.

Recently another markup language called as Extensible HTML (XHTML) for future web publishing came out. XHTML is a reformulation of HTML 4 in XML. XHTML has to be designed to avoid some undesirable features of HTML.

HTML defines a fixed set of elements that any HTML document can use. There is no flexible mechanism to extend the set of elements as the needs arise in our application domain. HTML does not provide a mechanism to expand the valid element set to include newer and informative tags. XHTML has been blessed with the feature of expanding the element set. Also HTML documents can not be used for data processing by other software. XHTML make rendering information with the format one intended harder to achieve using HTML.

Though all these features can be accomplished easily using XML, XHTML has been formulated for keeping the large investment made in HTML.