Advanced use-cases for Kweelt: Web integration
(contributed by Arnaud Sahuguet)

The use-cases presented so far have dealt with standard queries, either from a pure database perspective or from a pure document perspective. In these use-cases, we want to present some applications of the Kweelt query language that are more focused towards information integration.

The examples we will present will make Kweelt look like an Excel-like language, but there is nothing to afraid of.

It is important to note that all the queries presented above are valid Kweelt queries that can be evaluated on the Kweelt engine.
By lack of interesting XML sources on the Web, we use wrappers that will translate on-the-fly HTML pages into XML documents, using the World Wide Web Wrapper Factory (W4F).

We present 3 use-cases:

All the use-cases have been tested on a Kweelt evaluator. Check here for the interactive demo.

Portfolio Management

In this example, we want to use Kweelt to compute the current value of our portfolio of stocks. To do so, given a stock, we need to fetch from the Web its current price. The value of the entire portfolio is simply the sum of price * quantity for each stock.

The stock information can be extracted from the Yahoo! Finance web site. The information is available in HTML but can be mapped to XML. We have written a wrapper to do that. The wrapper is used as a Web service using the following syntax /cgi-bin/YFS.pl?LU+NOK&d=t when one wants to get back the price for Lucent (LU) and Nokia (NOK).
You can try for yourself with http://db.cis.upenn.edu/cgi-bin/YFS.pl?LU+LU&d=t. (For some technical reasons, the last item of the list is always discarded. Simply add an extra entry to make it work: LU+LU to get Lucent's price or LU+NOK+IBM+NOK to get Lucent, Nokia and IBM's prices.).

The mapped information is described by the following DTD:

<!ELEMENT Portfolio (Stock*)>
<!ELEMENT Stock (Name,LastPrice,Volume,Change,Day_Range,Year_Range)>
<!ATTLIST Stock
        Market CDATA #IMPLIED
        Ticker CDATA #IMPLIED>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT LastPrice (#PCDATA)>
<!ELEMENT Volume (#PCDATA)>
<!ELEMENT Change (#PCDATA)>
<!ELEMENT Day_Range (DMin,DMax)>
<!ELEMENT DMin (#PCDATA)>
<!ELEMENT DMax (#PCDATA)>
<!ELEMENT Year_Range (YMin,YMax)>
<!ELEMENT YMin (#PCDATA)>
<!ELEMENT YMax (#PCDATA)>
DTD for Yahoo! Finance

Given the description of a portfolio, the query will return a PORTFOLIO element with a sequence of ASSET with attributes NAME, QTY, PRICE and VALUE. It will also return a TOTAL element that will sum the values of the assets.
Variable $url represents the way to call the Web service that will return the current price of the stock. $yahooFinance points to the corresponding XML document, generated on-the-fly by the web service.

The query is presented below.

FUNCTION YAHOO_FINANCE( $portfolio )
{
	document( CONCAT( "http://db.cis.upenn.edu/cgi-bin/YFS.pl?",
		          JOIN("+", $portfolio/ASSET/@NAME, "NOK"),
                          "&d=t" ) )
}

LET $portfolio :=
      [[
      <PORTFOLIO>
        <ASSET NAME="LU"   QTY="35"/>
        <ASSET NAME="IBM"  QTY="20"/>
        <ASSET NAME="NOK"  QTY="120"/>
        <ASSET NAME="CSCO" QTY="65"/>
        <ASSET NAME="TXN"  QTY="80"/>
      </PORTFOLIO>
      ]],
    $yahooFinance := YAHOO_FINANCE($portfolio)

RETURN

  <PORTFOLIO>

    (
    FOR $asset IN $portfolio/ASSET
    LET $price := $yahooFinance/Stock[ ./@Ticker = $asset/@NAME ]/LastPrice
    RETURN
      <ASSET NAME  = $asset/@NAME
             QTY   = $asset/@QTY
             PRICE = CONCAT("$", TRIM($price))
             VALUE = CONCAT("$", $price * $asset/@QTY) />
     SORTBY ( ./@NAME )
    ),

    <TOTAL>
    CONCAT("$",
      SUM( FOR $asset IN $portfolio/ASSET
           LET $price := $yahooFinance/Stock[ ./@Ticker = $asset/@NAME ]/LastPrice
           RETURN $asset/@QTY * $price ) )
    </TOTAL>

  </PORTFOLIO>     

Movie Mining

The second use-case explores integration between various relations offered by the same data-source (the Internet Movie Database).
We have built three different wrappers for movie entries, people (actors/actresses/directors) and for the list of the best 250 movies.

The corresponding DTDs are given below:

<!ELEMENT MOVIE (Title,DirectedBy,Categories,Cast)>
<!ATTLIST MOVIE
        Year CDATA #IMPLIED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT DirectedBy (Director*)>
<!ELEMENT Director EMPTY>
<!ATTLIST Director
        URL CDATA #IMPLIED
        Name CDATA #IMPLIED>
<!ELEMENT Categories (Genre*)>
<!ELEMENT Genre (#PCDATA)>
<!ELEMENT Cast (Actor*)>
<!ELEMENT Actor EMPTY>
<!ATTLIST Actor
        URL CDATA #IMPLIED
        Name CDATA #IMPLIED>
DTD for a movie entry, generated from a W4F wrapper.

<!ELEMENT MoviePerson (ActingFilmography,DirectingFilmography)>
<!ATTLIST MoviePerson
        Name CDATA #IMPLIED>
<!ELEMENT ActingFilmography (Movie*)>
<!ELEMENT Movie (Title)>
<!ATTLIST Movie
        URL CDATA #IMPLIED
        Year CDATA #IMPLIED>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT DirectingFilmography (Movie*)>
DTD for a movie person, generated from a W4F wrapper.

<!ELEMENT BestMovies (Movie*)>
<!ELEMENT Movie EMPTY>
<!ATTLIST Movie
	Rank CDATA #IMPLIED
	URL CDATA #IMPLIED
	Title CDATA #IMPLIED
	Year CDATA #IMPLIED>
DTD for the list of the best 250 movies, generated from a W4F wrapper.

The first query returns the movies released after 1999 where Keanu Reeves played, with the name of the director. When there is more than one director the names are concatenated with "and".
The query is given below (note: the query can be rewritten in a more concise way.):

FUNCTION IMDB_Person( $name )
{
  document( CONCAT( "http://db.cis.upenn.edu/cgi-bin/MDB_P.pl?", $name ) )
}

FUNCTION IMDB_Movie( $title )
{
  document( CONCAT( "http://db.cis.upenn.edu/cgi-bin/MDB_M.pl?", $title ) )
}	
      
FUNCTION Director( $directors )
{
  IF COUNT($directors) = 1 THEN TRIM($directors/@Name)
                           ELSE JOIN( " and ", $directors/@Name )
}


LET $actor := IMDB_Person("Reeves,+Keanu")
  FOR $movie in $actor/ActingFilmography/Movie[@Year >= 1999]
    LET $movie :=  IMDB_Movie( $movie/@URL )
    LET $directors := $movie//Director
        RETURN
	  <Result>
	    <Movie title=TRIM($movie/Title) by=Director($directors) />
	  </Result>

The second query computes the ranking of a director, based on the number of movies that appear in the Top 250 movies from the Internet Movie Database. The query iterates through all the movies, gets its director(s) and then returns the list with a value Count that represents this count. The query can be run on the full 250 movies or only a subset (like below: [ position() <= 10 ]).
The query needs to remove duplicates from the set of all directors. Since XPath defines node equality in terms of text nodes (ignoring attributes), the query has to artificially return a list of directors where the name of the director is now PCDATA.
The query is given below:

FUNCTION IMDB_Movie( $title )
{
  document( CONCAT( "http://db.cis.upenn.edu/cgi-bin/MDB_M.pl?", $title ) )
}	

LET $list_of_directors :=
<RESULTS>
 FOR $movie in document("Best250.xml")/Movie[ position() .<=. 10 ]
  LET $movie := IMDB_Movie( $movie/@URL )
  FOR $director in $movie//Director 
   RETURN <Director>$director/@Name</Director>
</RESULTS>
FOR $director in DISTINCT $list_of_directors/Director
RETURN
	<Director Name=$director
		  Count=COUNT($list_of_directors/Director[ . = $director ]) />
	SORTBY (@Count DESCENDING, @Name)

TV Agent

The last use-case proposes to integrate information from two distinct sources: the Internet Movie Database (presented above), the Yahoo! TV Listings.
The TV listings consist of a big table where a row for each channel and columns for time slots. TV listings are accessed for a given starting time. We use -- once again -- a W4F wrapper to extract the HTML information and map it to XML. The corresponding DTD is given below:

<!ELEMENT TVGuide (Listing)>
<!ATTLIST TVGuide
        StartAt CDATA #IMPLIED>
<!ELEMENT Listing (Channel*)>
<!ELEMENT Channel (Entries)>
<!ATTLIST Channel
        Name CDATA #IMPLIED>
<!ELEMENT Entries (Entry*)>
<!ELEMENT Entry (Movie)>
<!ATTLIST Entry
        StartAt CDATA #IMPLIED
        Duration CDATA #IMPLIED
        Type CDATA #IMPLIED>
<!ELEMENT Movie EMPTY>
<!ATTLIST Movie
        title CDATA #IMPLIED
        year CDATA #IMPLIED
        score CDATA #IMPLIED
        rating CDATA #IMPLIED>       
DTD for Yahoo! TV Listings

The query returns for a given time of the day (right now the XML file that describes the TV programs is fixed and has been wrapped using a W4F wrapper) the list of movies released between years minYear and maxYear (specified by the user), with some given actors/actressesor directors, and available from some given channels. Given the structure of the TV listing, the Kweelt query needs to compute the actual starting time of the show, based on the starting time the TV listing and the number of timeslots before. This is done by CONCAT( $start_time, " + ", SUM($prog/preceding-sibling::Entry/@Duration), " x 30min")

FUNCTION IMDB_Movie( $title, $year )
{
  LET $url := CONCAT( "http://db.cis.upenn.edu/cgi-bin/MovieMatcher.pl?",
                      "TITLE=", SUBST(" ", "+", $title), "&",
                      "YEAR=", $year )
  RETURN document($url)
}

LET $prefs := [[
  <Prefs>
    <Channels>
      <Channel name="HBO"/>
      <Channel name="SHO2 25"/>
      <Channel name="MAX 16"/>
    </Channels>
    <Movies maxYear="1999" minYear="1994">
      <FavoriteActors>
        <Actor name="Kate Beckinsale"/>
        <Actor name="Annette Bening"/>
      </FavoriteActors>
      <FavoriteDirectors>
        <Director name="Stanley Kubrick"/>
      </FavoriteDirectors>
    </Movies>   
  </Prefs>
              ]]
LET $start_time := document("TV.xml")/@StartAt
FOR $prog IN document("TV.xml")//Channel[ @Name = $prefs//Channel/@name]//Entry[@Type="#b0e0e6"]
                                      [//Movie/@year .>=. $prefs//Movies/@minYear
                                   AND //Movie/@year .<=. $prefs//Movies/@maxYear ]
LET $movie := $prog//Movie
LET $imdb_movie := IMDB_Movie( $movie/@title, $movie/@year )
WHERE $imdb_movie//Actor/@Name = $prefs//Actor/@name
   OR $imdb_movie//Director/@Name = $prefs//Director/@name
RETURN
<Movie title=$movie/@title
       year=$movie/@year
       rating=$movie/@rating
       score=$movie/@score
       at=$prog/ancestor::TVGuide/@StartAt
       on=$prog/ancestor::Channel/@Name
       offset=CONCAT( $start_time, " + ", SUM($prog/preceding-sibling::Entry/@Duration), " x 30min") >
$imdb_movie/DirectedBy,
$imdb_movie//Actor[ position() .<=. 3]
</Movie>        
SORTBY (@score DESCENDING, @title)


That's all folks.

Last Update: 15-Aug-2000. Page created and maintained by Arnaud Sahuguet.
All Rights Reserved, 2000.