The value of XML and HTML in society today: a short analysis and reflection
XML was originally developed by the World Wide Web Consortium (W3C) to overcome the limitations of HTML, which is the markup language for web page content. XML owes its name as an ‘extensible markup language’ to the fact that it can be used for a great variety of purposes because of the much greater freedom it allows, compared to HTML, to design various markup terms for different purposes. Therefore, its usages are many. It has been described as a basis for the ‘semantic web’, or the creation of a web of ‘linked data’ through the development of varied schema and customised markup vocabularies (or ‘languages’).(1) It has been applied frequently for purposes of data exchange and information management. Its usability for the former is great, so much so that XML has become the basis for ‘most electronic commerce applications’ and this has long been its ‘most popular’ usage.(2) However, XML has also been of value in the field of information management and, to some extent, publishing.
In the field of library and information studies, one good example of its usage has been the creation of the Encoded Archival Description (EAD) standard to better link the contents of various marked up archival catalogues, reflecting the intrinsic value of XML to the development of useful metadata standards.(3) A key feature of XML, as well as all good standards, is that they are non-proprietary in nature: their usage is not dependent on particular commercial software. Indeed, an XML document can be created, read and shared offline as well as online. This trait might be said to reflect the fact that ‘a formative influence’ on the creation of XML was the pre-existing Text Encoding Initiative (TEI), which has been called ‘the de facto standard for literary computing’ for the past few decades.(4) Its creation was motivated by the desire to create digital scholarly editions of texts than can be preserved perpetually.
A common schema in XML and TEI is the definition of particular document types. This is a process known as Document Type Definition (DTD). It is a schema that originated with SGML and creates a necessity for internal consistency within a marked up document for it to be ‘well formed’. Some well-formed XML and a valid TEI document are distinct entities, however. The schema adopted by the TEI programme is closer to the International Organisation for Standardisation (ISO) standard of RELAX NG than the XML schema as defined by the W3C and while the TEI guidelines for the creation of suitable elements in a text encoding are remarkably extensive they are also very specific.(5) In total, there are a total of 503 defined elements and 210 attributes, organised into 21 modules, included within the TEI Guidelines. However, this was simplified during 1995 to a ‘TEI Lite’ edition of the full TEI encoding schema and this consists of only 145 elements. TEI Lite has been judged to ‘meet the needs of 90% of the TEI community 90% of the time’.(6)
Although ‘XML exists because HTML was successful’,(7) in contrast to XML, which is usually used to cover the back end process of data management (making it particularly useful for the maintenance of very large websites, such as online archives and commercial ventures), HTML might be described as the markup language that is used for the ‘front page’ presentation of information online. Indeed, the very existence of HTML is fundamentally tied into the development of the Internet as a media or communications tool,(8) which has made an impact on society comparable to that made by the development of mass print journalism in the mid-nineteenth century or the development of television in the mid-twentieth century.(9)
The usage of HTML has transcended two limitations of traditional print media, of being bound by a physical format and the associated costs of production, precisely because a HTML file, or ‘document’, is essentially a computer file that can be viewed remotely using a web browser. The very meaning of the acronym HTML—Hypertext Markup Language—reflects the fact that hypertext is the technology that allows for the creation of links on the web, which could be said to be the most important feature within HTML. It is this process that enables HTML-based projects to facilitate the presentation, or linking, of multimedia content (such as audio and visual content in addition to text) at the one location or to link the locations of various computer files on different servers by means of the use of Uniform Resource Locators (URLs).(10)
Like many computer files, a HTML file, or ‘document’, is alterable and versatile. It can be combined with other technologies (including Cascading Style Sheets or ‘CSS’) to enhance its own text formatting, or presentation, options. Its functionality can be enhanced by the use of Hypertext Preprocessor (PHP) code, which can turn HTML files, or web pages, into ‘dynamic pages’ that can be processed by means of Relational Database Management Systems (RDBMS), facilitating the creation of ‘big data’ from website content.(11) The possibility of altering HTML files, or ‘web pages’, is what has created the idea of interactive, as opposed to static, websites (a development first nicknamed as ‘Web 2.0’). The options that exist in defining their functionality is also what enables them to be ‘responsive’: they can be designed so as to be represented differently depending on what device on which they are displayed. They can also be made searchable online through embedding ‘meta[data] tags’, or associated keywords, into the documents.
The centrality of XML to commercial transactions in the business world is undoubtedly the best, or most valuable, example of the use of text encoding within society today. Citing specific examples of this is not practicable because the schema used in various commercial transaction programmes are necessarily confidential in order to preserve, or protect, their integrity. If this is the reality of the world of data exchange, what can we say in conclusion about text encoding within the world of publishing?
Like any language, the value of markup languages is only as good as the uses for which they are applied. Markup has been defined as ‘any means of making explicit an interpretation of a text’ while a knowledge of markup techniques has been described as ‘a core competence of digital humanities’, so much so that text encoding (including TEI) ‘should be a central plank’ of digital humanities curricula. This is because text encoding creates ‘the foundation for almost any use of computers in the humanities’.(12) Effective practice is dependent on the existence of effective standards. This is why the creation of schema and standards for the presentation, processing and preservation of literary documents in a digital format through the TEI is undoubtedly important. However, an ability to use HTML for web design and XML for information management is also valuable. As a practice, digital humanities is related to information and library studies and archival science. However, the ‘digital humanities’ is also a scholarly discipline in the sense that it exists to encourage all students of the humanities to not only become literate in the use of text-encoding techniques but also to realise their value in both pursuing research questions and presenting research answers. In so far as this technological reorientation takes place, scholars within the humanities may be said to be effectively following what has already occurred within the world of government and business in terms of the effective management and presentation of information (a.k.a. data) so that it can be more readily, or easily, processed with a specific purpose in mind.
(2)Benoit Marchal, XML by example (Indianapolis, 2000), 2 (quote), 6-7
(4)Julianne Nyhan, ‘Text encoding and scholarly digital editions’, in C. Warwick, M. Terras, J. Nyhan (eds)Digital Humanities in practice (London, 2012), 117 (quote)
(7)Benoit Marchal, XML by example (Indianapolis, 2000), 7 (quote)
(8)Lee M. Cottrell, HTML and XHTML demystified (New York, 2011), chapter 1
(10)Lee M. Cottrell, HTML and XHTML demystified (New York, 2011), 4
(12)C. Warwick, M. Terras, J. Nyhan (eds)Digital Humanities in practice(London, 2012), 121