XML stands for eXtensible Markup Language. XML is a meta-markup language developed by the World Wide Web Consortium(W3C) to deal with a number of the shortcomings of HTML. As more and more functionality was added to HTML to account for the diverse needs of users of the Web, the language began to grow increasingly complex and unwieldy. The need for a way to create domain-specific markup languages that did not contain all the cruft of HTML became increasingly necessary and XML was born.
The main difference between HTML and XML is that whereas in HTML the semantics and syntax of tags is fixed, in XML the author of the document is free to create tags whose syntax and semantics are specific to the target application. Also the semantics of a tag is not tied down but is instead dependent on the context of the application that processes the document. The other significant differences between HTML and XML is that the an XML document must be well-formed.
Although the original purpose of XML was as a way to mark up content, it became clear that XML also provided a way to describe structured data thus making it important as a data storage and interchange format. XML provides many advantages as a data format over others, including:
Since XML is a way to describe structured data there should be a means to specify the structure of an XML document. Document Type Definitions (DTDs) and XML Schemas are different mechanisms that are used to specify valid elements that can occur in a document, the order in which they can occur and constrain certain aspects of these elements. An XML document that conforms to a DTD or schema is considered to be valid. Below is listing of the different means of constraining the contents of an XML document.
SAMPLE XML FRAGMENT
<gatech_student gtnum="gt000x">
<name>George Burdell</name>
<age>21</age>
</gatech_student>
DTD FOR SAMPLE XML FRAGMENT
<!ELEMENT gatech_student (name, age)>
<!ATTLIST gatech_student gtnum CDATA>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
The DTD specifies that the gatech_student element has two child
elements, name and age, that contain character data as well as
a gtnum attribute that contains character data.
XDR FOR SAMPLE XML FRAGMENT
<Schema name="myschema" xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="age" dt:type="ui1" />
<ElementType name="name" dt:type="string" />
<AttributeType name="gtnum" dt:type="string" />
<ElementType name="gatech_student" order="seq">
<element type="name" minOccurs="1" maxOccurs="1"/>
<element type="age" minOccurs="1" maxOccurs="1"/>
<attribute type="gtnum" />
</ElementType>
</Schema>
The above schema specifies types for a name element that
contains a string as its content, an age element that contains
an unsigned integer value of size one byte (i.e. btw 0 and
255), and a gtnum attribute that is a string value. It also
specifies a gatech_student element that has one occurence each
of a name and an age element in sequence as well as a gtnum
attribute.
XSD FOR SAMPLE XML FRAGMENT
<schema xmlns="http://www.w3.org/2001/XMLSchema" >
<element name="gatech_student">
<complexType>
<sequence>
<element name="name" type="string"/>
<element name="age" type="unsignedInt"/>
</sequence>
<attribute name="gtnum">
<simpleType>
<restriction base="string">
<pattern value="gt\d{3}[A-Za-z]{1}"/>
</restriction>
</simpleType>
</attribute>
</complexType>
</element>
</schema>
The above schema specifies a gatech_student complex type
(meaning it can have elements as children) that contains a name
and an age element in sequence as well as a gtnum attribute.
The name element has to have a string as content, the age
attribute has an unsigned integer value while the gtnum element
has to be matched by a regular expression that matches the
letters "gt" followed by 3 digits and a letter.//emp[name="Fred"]/salary *
12document("zoo.xml")//chapter[2 TO
5]//figure
<emp empid = {$id}>
{$name}
{$job}
</emp>
Generate an <emp> element that has an
"empid" attribute. The value of the attribute
and the content of the element are specified by
variables that are bound in other parts of the
query.
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan Kaufmann"
AND $b/year = "1998"
RETURN $b/title
List the titles of books published by Morgan Kaufmann
in 1998.
<big_publishers>
{
FOR $p IN distinct(document("bib.xml")//publisher)
LET $b := document("bib.xml")//book[publisher = $p]
WHERE count($b) > 100
RETURN $p
}
</big_publishers>
List the publishers who have published more than 100
books.
FOR $h IN //holding
RETURN
<holding>
{$h/title,
IF ($h/@type = "Journal")
THEN $h/editor
ELSE $h/author
}
</holding>
SORTBY (title)
Make a list of holdings, ordered by title. For
journals, include the editor, and for all other
holdings, include the author.
FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
(contains($p, "sailing") AND contains($p, "windsurfing"))
RETURN $b/title
Find titles of books in which both sailing and
windsurfing are mentioned in the same paragraph.
FOR $b IN //book
WHERE EVERY $p IN $b//para SATISFIES
contains($p, "sailing")
RETURN $b/title
Find titles of books in where sailing is mentioned in
every paragraph.
NAMESPACE xsd = "http://www.w3.org/2001/XMLSchema"
DEFINE FUNCTION depth($e) RETURNS xsd:integer
{
# An empty element has depth 1
# Otherwise, add 1 to max depth of children
IF (empty($e/*)) THEN 1
ELSE max(depth($e/*)) + 1
}
depth(document("partlist.xml"))
Find the maximum depth of the document named
"partlist.xml."As was mentioned in the introduction, there is a dichotomy in how XML is used in industry. On one hand there is the document-centric model of XML where XML is typically used as a means to creating semi-structured documents with irregular content that are meant for human consumption. An example of document-centric usage of XML is XHTML which is the XML based successor to HTML.
SAMPLE XHTML DOCUMENT
<html xmlns ="http://www.w3.org/1999/xhtml">
<head>
<title>Sample Web Page</title>
</head>
<body>
<h1>My Sample Web Page</h1>
<p> All XHTML documents must be well-formed and valid. </p>
<img src="http://www.example.com/sample.jpg" height ="50" width = "25"/>
<br />
<br />
</body>
</html>
The other primary usage of XML is in a data-centric model. In a data-centric model, XML is used as a storage or interchange format for data that is structured, appears in a regular order and is most likely to be machine processed instead of read by a human. In a data-centric model, the fact that the data is stored or transferred as XML is typically incidental since it could be stored or transferred in a number of other formats which may or may not be better suited for the task depending on the data and how it is used. An example of a data-centric usage of XML is SOAP. SOAP is an XML based protocol used for exchanging information in a decentralized, distributed environment. A SOAP message consists of three parts: an envelope that defines a framework for describing what is in a message and how to process it, a set of encoding rules for expressing instances of application-defined datatypes, and a convention for representing remote procedure calls and responses.
SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body>
<m:GetLastTradePrice xmlns:m="Some-URI">
<symbol>DIS</symbol>
</m:GetLastTradePrice>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
In both models where XML is used, it is sometimes necessary to store the XML in some sort of repository or database that allows for more sophisticated storage and retrieval of the data especially if the XML is to be accessed by multiple users. Below is a description of storage options based on what model of XML usage is required.
SAMPLE DB2 XML EXTENDER TABLE AND QUERY
TABLE mail_user
user_name VARCHAR(20) NOT NULL PRIMARY KEY
passwd VARCHAR(10)
mailbox XMLVARCHAR
SELECT user_name FROM mail_user WHERE extractVarchar(mailbox,"/Mailbox/Inbox/Email/Subject") LIKE "%XML%"
The above query returns the names of all the users that have any email in their inbox that
contains the string "XML" in its subject. To improve the performance of the XPath query it is
necessary to index the mailbox XMLVARCHAR.
Oracle has completely integrated XML into it's Oracle 9i database as well as the rest of its family of products. XML documents can be stored as whole documents in user-defined columns [of type XMLType or CLOB/BLOB] where they can be extracted using XMLType functions such as Extract() or they can be stored as decomposed XML documents that are stored in object relational form which can be recontituted using the XML SQL Utility (XSU) or SQL functions and packages. For searching XML, Oracle provides Oracle Text which can be used to index and search XML stored in VARCHAR2 or BLOB variables within a table via the CONTAINS and WITHIN operators used in collusion with SQL SELECT queries. XMLType columns can be queried by selecting them through a programming interface (e.g. SQL, PL/SQL, C, or Java), by querying them directly and using extract() and/or existsNode() or by using Oracle Text operators to query the XML content. The extract() and existsNode() functions uses XPath expressions for querying XML data. Oracle 9i also allows one to create relational views on XML documents stored in XMLType columns which can then be queried using SQL. The columns in the table are mapped to XPath expressions that query the document in the XMLType column.
SAMPLE ORACLE 9i TABLE AND QUERY
CREATE TABLE mail_user(
user_name VARCHAR2(20),
passwd VARCHAR2(10),
mailbox SYS.XMLTYPE );
SELECT user_name FROM mail_user m WHERE m.mailbox.extract('/Mailbox/Inbox/Email/Subject/text()').getStringVal() like '%XML%'
The above query returns the names of all the users that have any email in their inbox that
contains the string "XML" in its subject. To improve the performance of the XPath query it is
necessary to index the mailbox XMLType.
Microsoft's
SQL Server 2000 also supports XML operations being
performed on relational data . XML data can be
retrieved from relational tables using the FOR XML clause.
The FOR XML clause has three modes: RAW, AUTO and EXPLICIT.
RAW mode sends each row of data in the resultset back as a
XML element named "row" and with each column
being an attribute of the "row" element. AUTO
mode returns query results in a nested XML tree where each
element returned is named after the table it was extracted
from and each column is an attribute of the returned
elements. The hierarchy is determined based on the order of
the tables identified by the columns of the SELECT
statement. With EXPLICIT mode the hierarchy of the XML
returned is completely controlled by the query which can be
rather complex. SQL Server also provides the OPENXML clause
which to provide a relational view on XML data. OPENXML
allows XML documents placed in memory to be used as
parameters to SQL statements or stored procedures. Thus
OPENXML is used to query data from XML, join XML data with
existing relational tables, and insert XML data into the
database by "shredding" it into tables. Also W3C
XML schema to can be used to provide mappings between XML
and relational structures. These mappings are called XML
views and allow relational data in tables to be viewed as
XML which can be queried using XPath.The following people helped in reviewing and proofreading this paper: Dr. Sham Navathe, Kimbro Staken, Dmitri Alperovitch, Sam Collins, Omri Gazitt and Dennis Lu.