UpdatingGoodRelationsData

From Wiki of the E-Business and Web Science Research Group
Jump to: navigation, search
gr_banner_small.png
GoodRelations is a standardized vocabulary for product, price, and company data that can be embedded into existing static and dynamic Web pages.

Project Main Page
Vocabulary
Documentation
Developer's Wiki
GoodRelations Cookbook

Author: Martin Hepp, mheppATcomputerDOTorg


Much more than any other type of rich meta-data on the Web, GoodRelations-related data is subject to change and updates, e.g.

  • new prices,
  • changes in availability, or
  • new features.

Search engines in the traditional Web crawl your page only once in a couple of weeks or so. Usually, the pagerank or other popularity metrics are used to decide on which pages to crawl more frequently than others. So BestBuy pages may be crawled every day, while Peter Miller's Hardware Shop Site in rural Kentucky will be checked only once every two months. Only a fraction of Web resources, which can be expected to change very often, like Twitter and several blogs, get updated much faster.

The simple reason is that crawling consumes a lot of resources on boths ends - both the search engine and the servers hosting the site will be subject to CPU and IP traffic load. Since those resources are costly and limited, a good search engine will emply sophisticated algorithms to decide on when to visit your page again.

Now, it is important that you help the search engines to decide on when to crawl your page again.

Validity of GoodRelations Data

The first important thing is to be clear about the expected validity of the statements in your GoodRelations data. For example,

  • for how long will the general statement that you sell pianos be valid?
  • for how long will a certain price specification be valid?
  • for how long will a particular offering hours specification be valid? (This is a feature that will be added in the next service update only.)

In many cases, it is hard to determine exact dates for each statement in advance, so we need to apply heuristics (rules of thumb). You should find a good balance between the two extremes, i.e.

  1. too short validity intervals: Your data is marked to have expired, but you did not yet update the data in the meantime. For example, if you publish a static company profile on your Web page and say it was valid for 30 days but never update the file to renew the statement, then search engines will assume that your data is no longer valid and may ignore it.
  2. too long validity intervals, i.e. using default validity intervals of a couple of months while your prices change on a daily basis. In that case, someone may falsely assume that your data was still valid while it is already outdated.

A good heuristic is 48- 72 hours of validity for typical shop pages that are generated dynamically from a database and one year for static company profiles, both counted from the creation of the respective RDF/XML or HTML+RDFa resource.


So for a database-driven shop application, you could estimate the validity as

  • valid from: current date and time
  • valid through: valid from + 48 hours


Techniques for Indicating Validity

Now, there are several techniques that can be used to communicate the validity of your GoodRelations data, and which may be used by search engines and indexing services to decide upon the intervals for checking back with your site:

gr:validFrom and gr:validThrough

The GoodRelations vocabulary itself defines two datatype properties that can be attached to instances of gr:Offering, gr:OpeningHoursSpecification, or gr:PriceSpecification.

They should be used at least for gr:Offering nodes.

foo:myOffering
	a gr:Offering;
	foaf:page <http://www.example.com/xyz>;
	rdfs:comment "We sell and repair computers and motorbikes"@en ;
	gr:includes foo:myProducts ;
	gr:hasBusinessFunction gr:Sell, gr:Repair ;
	gr:validFrom "2010-03-04T00:00:00+01:00"^^xsd:dateTime ;
	gr:validThrough "2010-03-06T00:00:00+01:00"^^xsd:dateTime.

HTTP Response Headers

The HTTP Protocol specifies several means for the server to indicate for how long using a cached copy of a resource (more precisely: its representation) is safe. This is explained in more detail in the section 13.2.1 Server-Specified Expiration of the RFC 2616, the authoritative specification of the HTTP 1.1 protocol.

Basically, you should configure your HTTP server so that it specifies a cache expiration time sooner or identical to the end of the validity of the GoodRelations data contained in the representation. If there exist multiple validity specifications in the same resource (e.g. one for the offering and one for the price), then the one that expires first should be used.

Servers specify explicit expiration times using either the Expires header, or the max-age directive of the Cache-Control header.


Sitemaps

Sitemap documents following the Sitemap Protocol are an important technique to help search engines and other crawlers find all relevant resources in a given Web site. You should always provide a sitemap file for your shop and list the URIs of all individual pages that contain GoodRelations markup in RDF/XML or RDFa.

Besides helping a crawler to initially discover all pages contained in your site, you can also use the sitemap document to indicate how frequently the content is expected to change, and in particular, which pages will change more frequently than others.

Here is an minimal example of a sitemap:

<xml> <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>
     <loc>http://www.example.com/</loc>
     <lastmod>2010-03-04</lastmod>
     <changefreq>monthly</changefreq>
     <priority>0.8</priority>
  </url>

</urlset>

</xml>

The interesting elements for us in here are lastmod, changefreq, and priority.

The following definitions are slightly adapted excerpts from the sitemap protocol specification:

  • lastmod: This field should contain the date or date and time on which the respective file was created or modified. The date should be in W3C Datetime format. This format allows you to omit the time portion, if desired, and use YYYY-MM-DD.
  • changefreq: This field can be used to indicate how frequently the page is likely to change. This value provides general information to search engines and may not correlate exactly to how often they crawl the page. Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers may consider this information when making decisions, they may crawl pages marked "hourly" less frequently than that, and they may crawl pages marked "yearly" more frequently than that. Crawlers may periodically crawl pages marked "never" so that they can handle unexpected changes to those pages. Valid values are
    • always
      • The value "always" should be used to describe documents that change each time they are accessed.
    • hourly
    • daily
    • weekly
    • monthly
    • yearly
    • never
      • The value "never" should be used to describe archived URIs.
  • priority: The priority of this URI relative to other URIs on your site. Valid values range from 0.0 to 1.0. This value does not affect how your pages are compared to pages on other sites. It only lets the search engines know which pages you deem most important for the crawlers.
    • The default priority of a page is 0.5.
    • Please note that the priority you assign to a page is not likely to influence the position of your URIs in a search engine's result pages. Search engines may use this information when selecting between URIs on the same site, so you can use this tag to increase the likelihood that your most important pages are present in a search index.
    • Also, please note that assigning a high priority to all of the URIs on your site is not likely to help you. Since the priority is relative, it is only used to select between URIs on your site.

Semantic Sitemap Extension (tbd)

You can use the Semantic Sitemap Extension for helping crawlers find and fetch data dumps and SPARQL endpoints related to your data.

How to Combine the Techniques?

All three techniques should be used in parallel and should ideally indicate the same validity period, at least approximately.

As a minimal solution,

  • attach gr:validFrom and gr:validThrough to each gr:Offering and each gr:UnitPriceSpecification. When in doubt, take the current date for the beginning and the current date plus 48 hours for the end of the interval
  • indicate the current lastmod date for each resource in the sitemap file. That means you have to update that file each time you update a single page. For a simple database-driven Web shop, you could create the file every day anew and insert the current date for each and every URI listed in the sitemap. The changefreq element should be set to always or daily.

Also, your server should not specify a longer cache validity for any single page than the day on which the first gr:Offering or gr:UnitPriceSpecification contained in that resource (!)will expire.

If you want to minimize crawler / spider traffic and maximize the freshness of your data in search engines and indexing services, you should try to use priority to point the crawler to the subset of products / data  that change more frequently than others. For example, memory chips and hard disk drives are usually subject to more substantial variations in prices than are books or cables.

Acknowledgements

Thanks to Giovanni Tumarello from DERI for raising this topic and for Kingsley Idehen from OpenLink Software for initially pointing me to the HTTP caching mechanisms.

References

  1. RFC 2616
  2. Sitemap Protocol
  3. GoodRelations Language Reference: validFrom and validThrough
  4. GoodRelations Primer