Web Query Languages, Intelligent Information IntegrationCSC-671 Department of Mathematical
Sciences AbstractThe following is a review of two different documents dealing with information integration on the World-Wide Web. The Web is made up of numerous sources of information that we as the user would like to find out about. The layman in the computer world does not have the expertise to know how to access this data. It is up to the database integrator to bring this information together for the user. One paper deals with the more abstract problems facing the web searcher and data integrator and the other presents a practical language used to query XML data, a real life application on the web. OverviewThe two papers were written with different purposes in mind. The authors of "XML-QL: A Query Language for XML" [DFFLS98], presented to the World-Wide Web, XML-QL, a hands on working standard for querying XML data on web sites. This is in contrast to the purpose of the paper "Database Techniques for the World-Wide Web: A Survey" [BPS98], written to provide insight into what is being done in the area of web data integration and querying. The following sections give a detailed summary of what obstacles the database integrator faces in their effort to present World-Wide Web information to the end user. The authors of Database Techniques [BPS98] tried not to cover too broad of a subject area in their short survey of the database techniques on the web. The article did include a good deal of information, but did not give enough examples to give a good flavor of what is being done in the database area on the World-Wide Web. Some areas of work were only mentioned in a short paragraph. A reference to a related article would point you in the right direction for further research. Information Management of Data on the World-Wide WebD. Florescu, A. Levy, and A. O. Mendelzon presented three areas of discussion for Querying on the World-Wide Web.
How to represent the data structure of the Web for Web/DB?The search engines of today do not take the structure of the Internet into account, but instead throw vast amounts of data at you. The user must sort through the data to find what they want. The first aspect to querying is that web pages and links must be modeled. Different data models have been used to model the data found on the web. The majority of the data on the web is semistructured and therefore has no fixed schema. In addition these pages, data sources, etc. change from day to day making modeling of the data more difficult. The following models have been used to represent Web/DB structures.
What about interactive sites?All of the above models, model structures of the web that do not change interactively. A great deal of the web is becoming more and more interactive and a page is presented to the user based on how they respond to prompts. This is a vast area of the web that was not covered in the survey - Database Techniques for the World-Wide Web: The authors did acknowledge this short coming and pointed out that this is an area that needs to be researched. Querying of the data structures modeled.As stated above the content only search of current search engines on the Internet ignore the structure of pages in their search. Once we have created a model that takes this structure into account, we must develop a method of querying this model. Query languages must be developed that a person familiar with query languages can use. Structural searches look at the structure of the web site for patterns that match what is being asked for. This allows the search to return a list of pages or data that is more closely related to your search string. Instead of returning all 8,000 pages with a certain string the search can return a list that points to a group of pages that give a broader coverage of the search string subject. A prototype next-generation web search engine named Google was mentioned in the paper. Theory of web queries.Original web query theory was based on the fact that the only possible way to access the web is to navigate links from known starting points. The query "list all web documents that no other document points to" is not a query that can be solved. Related Query paradigmsThe authors touch on other related query paradigms that have developed, but not specifically for querying the Web. These query languages are similar to the web query languages.
First Generation - Web Query
Languages.
Second generation: Web Data Manipulation Languages.The second generation languages are much more powerful and model the internal structure of web pages in addition to the links between the pages. These languages can create new structures based on the queries of the web pages.
Table 1. Comparison of query systems.
Summary of web query languages.All of the above languages are too complex to be used directly by interactive users. Work in the area of interactive query interfaces suitable for the casual user is being done. This is the area of data integration that has the most potential of making information on the web available to the public. Information Integration.The web can be thought of as containers of sets of tuples, embedded in HTML, or hidden behind forms interfaces. The method of accessing all of this data is to create a wrapper to give the illusion that the web site is serving sets of tuples. This association is a web source. These sources can then be combined to answer queries that use data from various web sources. It is not vary simple task to develop these wrappers to integrate the data. Problems to deal with in web integration.
Two approaches to dealing with these problems of the vast amount of web data are proposed.
The Virtual data integration approach was the one focused on by the authors.. Two major differences from traditional database system are pointed out.
Specifications of mediated schema and reformulation:The end user is given a intermediate schema to query that is designed to make it easier for the user. This schema is the set of collection and attribute names used in the queries. The mediated schema is translated to the data source schema via the Mediator program. Two Mediator approaches are used to present the schema - Global as View, Local as View.
Obstacles in creating query structures on the web.
Overall view of a data integration system
The above diagram was from Database Techniques for the World-Wide Web [FLM98] Web site construction and restructuring.The final section of the article discusses building a web site to support data integration. Creation of web sites is normally broken into the following tasks.
The task of updating a site, restructuring a site, or enforcing integrity constraints on a site's structure, are tedious to perform. The rewards of designing the site correctly are the justification of doing so. If the web site is declared declaritively as a query and not procedurally by a program it is easy to change the query to create multiple views of the information for different classes of users. How is the web site presented to the user?Normally a web site is created a page at a time and the person building the page must keep track of all of the pages and how they are related and work to present a seamless graphical presentation to the user. The figure below shows how the web site would be designed using query structures. Wrappers translate the data from the different sources on the site and a mediator program presents these various sources of data to the declarative web site structure. The declarative structure is defined as views over the data presented by the wrapper interfaces. Since these views are defined by a query of the data it is possible to present different logical views of the web site to different classes of users. Finally a consistent graphical presentation specification ensures that the theme of the site looks and feels the same from page to page. Example ( Users on the Internet only see what is defined for external viewers, but users within the organization see an Intranet view of the web site which presents much more information to the them. Different levels of security are also possible using a login/password verification when a person accesses the web site.)
Architecture for Web Site Management Systems Summary of web construction and restructuringA big advantage of using a query view of the structure of the site is the ability to easily redefine a site just by changing the site definition. Normally a great deal of time is required to recreate new HTML pages to present the site and it is difficult to integrate the changes smoothly. As the underlying data of a site changes the queries can be changed to facilitate the upgrade of the site. XML-QL a query language for XML.The document on XML-QL authored by Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, Dan Suciu is a practical discussion of a possible standard to use for the web querying community. The standard suggested is based on the new XML standard of data exchange on the WEB. The language was suggested by the authors to fill the void of extracting and integrating data from XML sources into a web client. This language could also be used to send a query to the XML host requesting certain data. The language follows the SELECT-WHERE structure of SQL. It uses a semistructured data query language. The XML language that it uses is a simple structure and is very flexible. It only requires that the tags match and be properly nested. example: <person><name> Alan </name><phone> 775-5555 </phone><address> P.O. Box 56, Kernersville, NC </address</person> The actual schema of the XML data is contained in the data itself. This allows the XML structure to be varied as the web site changes. The creator of the web site is not limited to what complex structures they use in the web pages. Questions about XML data.How will data be extracted?, How will data be exchanged, via the raw data or just sending the query to the host? How will data be translated between different user domains? How do you integrate data from multiple XML sources? These are some of the questions that remain to be solved. XML-QL is a new language and undoubtedly will change and mature as more the Internet web users start to use the XML standard for structuring their web site. Modeling XML-QLA variation of the semistructured data model is used as a model in the proposed XML-QL query language. Research in the area of semistructured data was used to design the language. Definition. An XML Graph consists of:
Examples from XML-QL: A Query Language for XML
<title><CDATA> A Trip to </CDATA><titlepart><CDATA> the Moon </CDATA></titlepart></title>
The authors also mention XSL which is a standard intended for specifying style and layout of the XML documents. This is not the same idea as the XML-QL language. XML-QL is capable of much more data-intensive operations and transformations of data. Already the XML language has facilitated exchange of data over the Web. The language facilitates this by not limiting the tags in a document. The user is able to create all the tags that they wish. The tags themselves define the schema of the data in the pages. There are Industry initiatives and current applications that are growing rapidly(see [Cover98]). The XML-QL language is presented by the authors through the use of example queries and a language syntax.The following is an example from the paper First a simple query that extracts data from an XML document is presented. The DTD, Document Type Descriptor is as follows: <!ELEMENT book (author+, title, publisher)> The query statement is made up of a WHERE clause and a CONSTRUCT clause: WHERE
<book> This query is then applied to a small data structure: <bib> and the following result block is produced: <result> This is just a simple example, but as complex of a structure as you can imagine can be produced using the tags that you define.
ConclusionsThe authors of "Database Structures for the World-Wide Web: A Survey" presented a good selection of ideas and examples of database query languages for the web. The purpose of the paper was to get more of the web community interested in developing and applying methods of querying the Internet. The paper was not geared toward a person writing applications on the internet, but more toward the theoretical group. I think to spur more excitement in the subject of web querying a more hands on approach to the paper could have been taken. Querying of the Internet is a tool that can be very useful. I am often discouraged by the information that I get from todays search engines and look forward to the day when we will be able to ask the web a question and get back an authoratative list of information. The authors of "XML-QL: A Query Language for XML" did a very good job of presenting a working representation of a standard. The purpose of the paper was to present the symantics and syntax of the language and I think through the examples this was acomplished. The idea of using XML as a database standard along with XML-QL is good. The use of a tag based langauge is well know and can easily represent very complex structures through the use of tags. The next step after a language is defined is to determine how it is to be implemented? this is the part that will take a lot of work and cooperation within the Internet community. The internet is a vast source of data and finding ways to make it more accessable will benefit us all. Appendix: Grammar for XML-QLGrammar for XML-QL from
|
| XML-QL Grammar | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| Condition | |||||||||
|
A query block consists of one or more queries and zero or more subblocks. Each subblock applies to the query it is following. Otherwise, the order of the queries is irrelevant. The queries are executed unconditionally; the subblocks are executed only if and when, in addition to the query's conditions, their conditions hold as well.
| Select Clause | ||||||||||||
|
A select-clause constructs a piece of the query's result. It consists of one or more of a variable, or some element, or some literal, or some other query block. Elements may have associated semantic oid's, also called Skolem Functions
| Where Clause | ||||||||||||||||||||||||
|
A Where-clause consists of a series of conditions. Each condition binds some variable(s) with a tag pattern, or imposes more restrictions on previously bound variables in a predicate.
| Rest | |||||||||||||||||||||||||||||||||||||||||||||||||||
|
Grammar from "XML-QL: A Query Language for XML"
[DFFLS98] Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu, XML-QL: A Query Language for XML, Submission to the World Wide Web Consortium, 19-August-1998
[FLM98] D. Florescu, A. Levy, and A. O. Mendelzon, Database Techniques for the World Wide Web: A Survey. Sigmond Record, Vol. 27, No. 3, 1998, Pages 59-74
[BPS98] Tim Bray, J ean Paoli , C. M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0l, W3C Recommendation, 10-February-1998
[COVER98] Robin Cover, The SGML/XML Web Page, Extensible Markup Language (XML), November 11, 1998
[FLORID98] The FLORID Project, http://www.informatik.uni-freiburg.de/~dbis/florid/
[LOUVRE98] The Louvre Palace and Museum, http://www.lourve.fr/
[ARANEUS98] Database Group of Università di Roma Tre and Database Group of Università della Basilicata, The ARANEUS Project, ongoing web site.
http://www.cs.indiana.edu/~adippel/csc671/web_query_lang.htm