Making Headlines with
RSS
Using Rich Site
Summaries To Draw New Visitors
By Jonathan Eisenzopf
In the early years of the Web, most sites were not
concerned about sharing data with other sites. Today, the
trend is that sites are increasingly interdependent and many
rely upon integrating content that originates somewhere else.
Such content might include news feeds, events listings, a set
of project updates, and even interchange of corporate
information. Effective integration usually requires a good
deal of effort on the part of the information provider, as
well as the recipient of each unique data source.
Sharing content among sites is most often called
syndication, a term we associate with licensed content such as
TV reruns and newspaper columns. Providing content from one
source for distribution in many different channels is what a
syndicate does, and it usually requires an established
business relationship. Companies like iSyndicate.com and
specifications such as Internet Content Exchange (ICE) are
examples of attempts to apply the traditional syndication
model to the Web. (For more information on ICE, see "Self-Service
Syndication with ICE," Web Techniques, November 1999.)
However, the Web also offers a new open-ended syndication
model that's hardly traditional.
The basis for this new model is an XML-based format known
as Rich Site Summary. RSS was first developed by Netscape to
drive channels for Netscape Netcenter. Netscape no longer
seems to be leading the RSS effort, but others, such as Dave
Winer of Userland Software, have picked it up. More
importantly, content providers like Slashdot, the Motley Fool,
Wired News, and Linux Today have been adopting RSS as a means
of circulating headlines and links to new stories on their
sites. RSS is becoming a vital "What's New" mechanism that
serves a variety of purposes while helping to attract traffic
from many different locations on the Web. RSS seems to be
succeeding because it's a simple way to solve a common problem
that extends far beyond the idea of syndication. RSS is a
better way to share data than more common approaches, such as
fetching and parsing HTML, or using proprietary APIs, database
dumps, and cobranding.
Grabbing and parsing the HTML from a provider's Web site is
the most common way to share data. The problem with this
cut-and-paste method is that an application must be developed
and maintained for each data source. These applications will
most likely have to change each time the provider changes the
HTML presentation. This can quickly become cumbersome and cost
prohibitive when gathering information from multiple sources.
APIs that let partners access data are an improvement, but
they also can create problems. First, APIs are usually
language dependent, and hence may require core competencies
unavailable in-house. Second, APIs are not extensible: You are
constrained to the data and functionality that the API
provides. Third, each API will be implemented differently
based on the habits and needs of the programmers that
developed it. You'll have to maintain in-house expertise for
each API you use.
Web sites also exchange data via database dumps. But the
data must be converted on both ends and you don't necessarily
eliminate the problem of dealing with multiple data formats.
This option would actually work if all content providers used
the same data model for delivering information, an improbable
scenario.
Cobranding is a method in which the information provider
hosts custom versions of the application for each customer.
This works out nicely for subscribers that don't have any
programming resources. The problem is that the data is either
presented in a generic format that doesn't fit the customer's
interface, or it requires that the content provider maintain a
cobranding template for each customer. While this is a good
solution, the functionality is limited to what the Web
application can provide. It also requires a large amount of
planning and development on the provider's part. However, this
technique has worked out nicely for companies like Amazon.com
that allow users to sign up and sell books from their own Web
sites.
Under the RSS model, each site publishes a file describing
the contents of its "channel." Other sites can subscribe to
that channel and grab its contents. The RSS file could be
converted to HTML and displayed directly on a subscriber site,
or it might be edited first to select only those items that
are appropriate for the site's audience. The nice thing about
RSS, of course, is that once you've built the system to
subscribe to one RSS channel, you can subscribe to thousands
of them.
RSS Syntax
RSS is an XML grammar for sharing data. That means that an
RSS file contains placeholders for data, which are identified
by a starting and ending tag. The first task required to RSS-enable
your site is to create such a file on your Web server. This
RSS file contains the title and description of items that you
want to promote on your site. As you'll see, an RSS file is
usually generated by a simple program but it can also be
created by hand.
Like any XML document, the first line of an RSS file
contains an XML declaration:
<?xml version="1.0"?>
While the XML declaration isn't required, it is recommended
for backwards compatibility.
The next item in an RSS file is the DTD that identifies the
file as an RSS document. This is necessary to determine
whether the file is valid when tested against the rules of the
RSS DTD:
<!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD
RSS 0.91//EN"
"http://my.netscape.com/publish/
formats/rss-0.91.dtd">
The rss element is the root or top-level
element of an RSS file. The rss element must
specify the version attribute. (The current
version is 0.91). It may also contain an
encoding attribute (the default is UTF-8):
<rss version="0.91" encoding=
"ISO_8859-1">
The root element is the top-level element that contains the
rest of an XML document.
An rss element may contain one and only one
channel element. This element will contain the individual
items. Each channel must contain the following elements:
title - the name of the channel
description - a short description of the
channel
link - an HTML link to the channel Web site
language - the language encoding of the
channel. A list of values is available from my.netscape.com.
The code for U.S. English is en-us
- one or more
item elements
A channel may also contain the following optional elements:
rating - the PICS rating for the channel Web
site. PICS ratings are assigned by an independent agency. A
list is available at
www.w3.org/PICS/raters
copyright - content copyright
pubDate - date the channel was published
lastBuildDate - date the RSS was last
updated
docs - additional information about the
channel
managingEditor - channel's managing editor
webMaster - channel Webmaster
image - channel image
textinput - allows a user to send an HTML
form text input string to a URL
skipHours - the hours that an aggregator
should not collect the RSS file
skipDays - the weekdays that an aggregator
should not collect the RSS file
See
Listing One for a complete example of an RSS channel for
XML.com. [Editor's Note: All listings referenced in this
article are available online at www.webtechniques.com/sourcecode.]
A channel may contain an image or logo. The image
element must contain the image title, commonly used as the
ALT attribute when converted to an HTML image
element, and the URL of the image itself.
The image element may also include the
following optional elements:
link - a URL that the image should be
linked to
width - the image width
height - the image height
description - an area for additional text
The textinput element lets users input data in
an HTML text field:
title - label of the submit button
description - text input description
name - text input name
link - URL to which to send the input
For example, the Freshmeat channel in
Listing Two (available online) contains a textinput
element that lets users search the application database.
Each channel can contain up to 15 items. Actually, you can
include more, but if you do, Netscape Netcenter won't accept
the file. Each item contains a title, link, and description.
The item elements are the real meat of the RSS
file. They provide the headlines and summaries of the content
you want to share with other sites.
The RSS specification includes all HTML entities for
convenience; however, you can't include any HTML elements,
such as <p>. For the RSS file to remain valid you
should use only those elements that have been defined in the
specification. Additionally, you must follow a few basic XML
guidelines for the file to be well formed. An XML parser can't
properly parse an XML file unless it follows the following
well-formed rules:
1) Each starting tag must have an ending tag.
2) Internal entities such as &, ",
<, >, must be encoded.
3) XML elements must be well balanced; that is, the end tag
should be at the same level in the tree as the start tag.
Creating an RSS
File
I've written a Perl module that makes it easy to maintain
and parse RSS files. XML::RSS also requires the XML::Parser
module maintained by Clark Cooper. Both are available through
CPAN. In addition to those described in this article, I've
also developed a number of other freely available RSS tools
for gathering, editing, and displaying RSS files, most of
which are available at motherofperl.com. Instructions on
installing the XML::RSS module are also available from the
site.
To use the module in a Perl program, you must first load
the module into memory and create a new instance of the class:
use XML::RSS;
my $rss = new XML::RSS;
Optionally, you can pass the RSS version and the language
encoding into the new method when creating a new instance:
my $rss = new XML::RSS (version=> '0.91',
encoding=>'ISO_8859-1');
XML::RSS simplifies several common tasks related to
maintaining an RSS file. First, the module abstracts the XML
syntax into a number of class methods. For each RSS element,
there is a related method. Each element method operates in a
similar fashion. For example, to set values for the channel
element, we would call the channel method and pass it an
associative array, which contains the names and values of each
channel subelement (see
Example 1).
You can also use these methods to modify values of the RSS.
For example, to change the URL of the RSS image, you might use
the following:
$rss->image(url => 'http://fresh
meat.net/images/fm.mini.jpg');
Because there are multiple items in an RSS file, the
add_item method is used to add a new RSS item, as shown
in
Example 2.
By default, the add_item method appends the
item to the list, but you can also force the item to be
inserted by setting the mode to insert the code, as shown in
Example 3.
Retrieving the values of an RSS file is also simple, but
first, you probably want to parse an RSS file. The
parsefile method takes the RSS filename as its only
parameter and transforms it into a multidimensional hash:
$rss->parsefile("fm.rss");
To access the value of a subelement, simply pass the name
of the subelement into the method. For example, to retrieve
the value of the textinput description:
my $ti_desc = $rss->
textinput("description");
The element method will return the value of
the subelement. Once you've created and/or modified the RSS
file, you can save it with the save method:
$rss->save("fm.rss);
Before you can begin syndicating content, you'll need to
set up a process to keep the RSS file up-to-date. Optimally,
when a new item is posted to your Web site, it will also show
up in the RSS file. You can maintain this file by hand, but I
suspect most will prefer to automate the process with a
script.
Listing Two is a Perl script that uses the XML::RSS module
that creates a channel for Freshmeat and saves it to fm.rdf.
The output of the script is contained in
Listing Three.
The XML::RSS module also makes it easy to update an RSS
file.
Listing Four is a short script that inserts a new item in
our Freshmeat RSS file. Notice the order of the script. First,
we load the module into memory with the use
statement. Then we create a new instance of the class with the
new method, setting the RSS version to 0.91.
Next we parse an RSS file with the parsefile
method, insert a new item with the add_item
method, and then save the RSS to a file with the save
method.
Converting an RSS
File to HTML
The previous two examples demonstrated how to maintain an
RSS file, but what if you're on the receiving end? The easiest
method of displaying an RSS file on a Web site is to convert
it to HTML and use an SSI to bring the content into a
template.
Listing Five does just that. It's a command-line script
that takes a filename or URL as a parameter, iterates through
the XML::RSS internal structure, and prints the HTML
equivalent. If the command-line parameter is an HTTP URL, the
RSS file is fetched from the remote Web server via the
LWP::Simple module.
In
Listing Five, we iterate through items inside a
foreach loop, printing the corresponding title and
link. The last part of the subroutine prints the HTML form
using the textinput subelements. The result is an
HTML form field that lets a user search for applications on
Freshmeat.
Listing Six is the output of the script when using
Listing One as the input file.
Now that we have the XML.com channel in an HTML format, we
can include it on our Web site. The majority of the script is
contained in the print_html subroutine, which
handles the RSS-to-HTML conversion. Most of the subroutine is
actually HTML code.
The first few lines of the subroutine print a table header
that contains the channel title, link, and image. As I
mentioned previously, the XML::RSS module builds a
multidimensional hash that represents the RSS file. The hash
can be accessed directly instead of using the class methods.
For example, the channel title and link are contained in
$rss->{'channel'}->{'title'}
and in
$rss->{'channel'}->{'link'},
respectively. The image URL and link would be contained in
$rss->{'image'}->{'url'}
and
$rss->{'image'}->{'link'}
variables.
$rss->{'items'}
is a reference to the array of RSS items.
Syndication
Once an RSS file exists, any other site can grab it
regularly. RSS standardizes a format for the delivery of
content. This makes it easier for a content provider to
distribute content broadly, and for an affiliate to receive
and process content from multiple sources. However, in most
cases, the actual content is not really distributed, only the
headlines are, which means that users will come back to your
affiliate site if they're interested in the story. For
example, many content providers use ad banners as a primary
source of revenue. This model depends on a large volume of
users reading their content on a regular basis. The RSS format
is a marriage made in heaven for extending readership. This
explains why most early adopters have been news providers.
Here's how it works. First, start generating one or more
RSS files for your Web site. Drop the headline into the
title element and give a teaser or summary in the
description element. Drop the content URL into the
link element. Second, make the RSS file available
on your Web server and register it on as many aggregators as
you can (see "
Online"). Traffic to your Web site will increase as users
add your RSS channel to their Web sites and news readers.
Once you're publishing an RSS file you can begin to flow
content into new venues such mailing lists, PDAs, cell phones,
and set-top boxes. For example, you may decide to offer
headlines in a PDA-friendly format, or create a weekly email
newsletter comprising what's new on your Web site. More
importantly, you can now flow data between partner or
affiliate Web sites.
Let's pretend for a moment that your site is part of a Web
development affiliate network. Each site focuses on a
particular specialty or area of interest. You would like to
cross-promote headlines among sites to maximize readership.
Also, there are times when one affiliate may carry content
that directly relates to another site's readership.
Cross-posting this information is in the interest of all
affiliates. If a Web site makes its RSS files available, its
affiliates can easily integrate the providers' headlines. When
users read the headline and click on the link to read the
story, both sites get their page views.
Aggregation
The practice of gathering multiple RSS channels into one
central location is called aggregation. While most aggregator
Web sites share a common goal -- gathering content -- they
serve different purposes. For example, my.netscape.com offers
its feeds as channels to Netcenter users, whereas
iSyndicate.com offers news feeds primarily for use on other
Web sites. Another implementation of aggregation is Dave
Winer's my.userland.com, which offers a service similar to
my.netscape.com. However, the aggregator also offers aggregate
feeds, which send new content to partners via XML-RPC function
calls. The benefit of using aggregators is that they make many
feeds available from one place. Furthermore, an aggregator may
offer tools or solutions that allow partners to customize
feeds and minimize the integration effort. In addition, an
aggregator site might provide tools and services that make it
easier for content providers to syndicate their information.
Weblogs
One of the more interesting trends the Web has seen in the
past months is the advent of the Weblog. A Weblog is a portal
to the life of an individual or group. The ideas posted on a
Weblog often include personal, political, technical, or
editorial comments that are significant to the author. The Web
site that popularized the Weblog is probably Slashdot.org, a
site that posts interesting technology tidbits for computer
geeks. Scripting.com, an earlier example of a Weblog, is a
site at which readers get a personal insight into the mind of
Dave Winer. Dave often combines his opinions of technical
innovations with politics, philosophy, and history, which
makes for an interesting daily read.
It turns out that RSS is a good foundation for creating a
Weblog. An example is PerlXML.com, a site containing Perl/XML
resources and news. A simple CGI script that uses the XML::RSS
Perl module is used to add new headlines. The script updates
both the front-page HTML file and the RSS headlines, which are
then picked up by several aggregators including
my.netscape.com and my.userland.com. This dual-purpose method
alleviates the Weblog editor from updating multiple files.
Instead, the editor can focus on his or her job and let an
application on the Web server do the work behind the scenes.
The Future
RSS can be used easily as a generic format for exchanging
content on the Web. More Web sites are using XML and RSS as
they discover that the technologies help promote traffic to a
site. RSS is a good starting point for many Webmasters who
aren't ready to immerse themselves in XML yet.
It's important to note that while RSS is capable of
syndicating content headlines, there are other XML formats
like XMLNews and ICE that are better suited for handling
larger syndication systems.
Jonathan is vice president of technology at Whirlwind
Interactive (www.wwind.com)
and can be reached via email at
eisen@wwind.com