Hi all,
It seems there is a quick and easy way to get a full RDF/XML
representation of all 20 Million Amazon offers.
Here is how it will likely work:
1. Take the Amazon sitemap index files, as given by
http://www.amazon.com/robots.txt# Sitemap files
Sitemap:
http://www.amazon.de/sitemap_index_0.xmlSitemap:
http://www.amazon.de/sitemap_index_1.xmlSitemap:
http://www.amazon.de/sitemap_index_2.xmlSitemap:
http://www.amazon.de/sitemap_index_3.xmlSitemap:
http://www.amazon.de/sitemap-manual-index.xmlSitemap:
http://www.amazon.de/sitemap_wishlist_index.xml2. Take the individual sitemap files from all of those, e.g.
http://www.amazon.de/sitemap_page_0.xml.gz from
http://www.amazon.de/sitemap_index_0.xml<sitemapindex xmlns="
http://www.google.com/schemas/sitemap/0.84">
<sitemap>
<loc>
http://www.amazon.de/sitemap_page_0.xml.gz</loc>
<lastmod>2006-10-16</lastmod>
</sitemap>
3. Now, for each of those ca. 20 Million entries given as <loc>
elements, e.g.
http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/<url>
<loc>
http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/</loc>
</url>
use the URIburner service (
http://uriburner.com/sparql/) to extract the
complete commercial meta-data in GoodRelations.
Note that not all URIs are current and that URIburner cannot produce
GoodRelations data for not all pages, but it can for the majority of the
ca. 20 Million pages.
You will get, on average, 200 GoodRelations triples per Amazon page, so
the total will be in the order of magnitude of 4 billion !
(If you want to check it for yourself, try
select COUNT (*) WHERE
{?s ?p ?o.
FILTER (regex(?o, "^
http://purl.org/goodrelations/v1#", "i") or
regex(?p, "^
http://purl.org/goodrelations/v1#", "i"))
}
against the URI
http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/Important: Using URIburner on the full set of Amazon URIs will likely
impose a great load on the underlying server, operated by OpenLink
Software. If you want to use this option, in particular for commercial
purposes, please contact Kingsley Idehen before you start. His e-mail is
<
kidehen at openlinksw.com>.
Best wishes
Martin Hepp
--
--------------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen
e-mail:
hepp at ebusiness-unibw.orgphone: +49-(0)89-6004-4217
fax: +49-(0)89-6004-4620
www:
http://www.unibw.de/ebusiness/ (group)
http://www.heppnetz.de/ (personal)
skype: mfhepp
twitter: mfhepp
Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================
Project page:
http://purl.org/goodrelations/Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelationsWebcasts:
Overview -
http://www.heppnetz.de/projects/goodrelations/webcast/How-to -
http://vimeo.com/7583816Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkeyTalk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.htmlTutorial materials:
ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009