This is the archive of the goodrelations dicussion list

goodrelations

Hi Martin,

while i am all for the sitemap and finding the individual pages,

the step going trough the URI burner is perilous, i mean when the data
is explicitly there in RDFa then its clear the producer wants it to be
reused. When its not then you're into the scraping business (ok the
uriburner might use the direct APIs sometimes but still)

I think with bestbuy having RDFa and with the support by google it
shouldnt  be long to see amazon put actual RDFa on their pages?
(anyone knows someone at amazon? :-) )

cheers
Giovanni

On Fri, Jan 15, 2010 at 3:08 PM, Martin Hepp (UniBW)
<martin.hepp at ebusiness-unibw.org> wrote:
> Hi all,
>
> It seems there is a quick and easy way to get a full RDF/XML
> representation of all 20 Million Amazon offers.
>
> Here is how it will likely work:
>
> 1. Take the Amazon sitemap index files, as given by
> http://www.amazon.com/robots.txt
>
> # Sitemap files
> Sitemap: http://www.amazon.de/sitemap_index_0.xml
> Sitemap: http://www.amazon.de/sitemap_index_1.xml
> Sitemap: http://www.amazon.de/sitemap_index_2.xml
> Sitemap: http://www.amazon.de/sitemap_index_3.xml
> Sitemap: http://www.amazon.de/sitemap-manual-index.xml
> Sitemap: http://www.amazon.de/sitemap_wishlist_index.xml
>
>
> 2. Take the individual sitemap files from all of those, e.g.
>
> http://www.amazon.de/sitemap_page_0.xml.gz from
> http://www.amazon.de/sitemap_index_0.xml
>
> <sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
> <sitemap>
> <loc>http://www.amazon.de/sitemap_page_0.xml.gz</loc>
> <lastmod>2006-10-16</lastmod>
> </sitemap>
>
> 3. Now, for each of those ca. 20 Million entries given as <loc>
> elements, e.g.
>
> http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/
>
> <url>
> <loc>http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/</loc>
> </url>
>
> use the URIburner service (http://uriburner.com/sparql/) to extract the
> complete commercial meta-data in GoodRelations.
>
> Note that not all URIs are current and that URIburner cannot produce
> GoodRelations data for not all pages, but it can for the majority of the
> ca. 20 Million pages.
>
> You will get, on average, 200 GoodRelations triples per Amazon page, so
> the total will be in the order of magnitude of 4 billion !
>
> (If you want to check it for yourself, try
>
> select COUNT (*) WHERE
> {?s ?p ?o.
> FILTER (regex(?o, "^http://purl.org/goodrelations/v1#", "i") or
> regex(?p, "^http://purl.org/goodrelations/v1#", "i"))
> }
>
> against the URI
>
> http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/
>
> Important: Using URIburner on the full set of Amazon URIs will likely
> impose a great load on the underlying server, operated by OpenLink
> Software. If you want to use this option, in particular for commercial
> purposes, please contact Kingsley Idehen before you start. His e-mail is
> <kidehen at openlinksw.com>.
>
> Best wishes
>
> Martin Hepp
>
>
> --
> --------------------------------------------------------------
> martin hepp
> e-business & web science research group
> universitaet der bundeswehr muenchen
>
> e-mail:  hepp at ebusiness-unibw.org
> phone:   +49-(0)89-6004-4217
> fax:     +49-(0)89-6004-4620
> www:     http://www.unibw.de/ebusiness/ (group)
>         http://www.heppnetz.de/ (personal)
> skype:   mfhepp
> twitter: mfhepp
>
> Check out GoodRelations for E-Commerce on the Web of Linked Data!
> =================================================================
>
> Project page:
> http://purl.org/goodrelations/
>
> Resources for developers:
> http://www.ebusiness-unibw.org/wiki/GoodRelations
>
> Webcasts:
> Overview - http://www.heppnetz.de/projects/goodrelations/webcast/
> How-to   - http://vimeo.com/7583816
>
> Recipe for Yahoo SearchMonkey:
> http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey
>
> Talk at the Semantic Technology Conference 2009:
> "Semantic Web-based E-Commerce: The GoodRelations Ontology"
> http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287
>
> Overview article on Semantic Universe:
> http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html
>
> Tutorial materials:
> ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey
> http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
>
>
>
> _______________________________________________
> goodrelations mailing list
> goodrelations at ebusiness-unibw.org
> http://ebusiness-unibw.org/cgi-bin/mailman/listinfo/goodrelations
>

This is the archive of the goodrelations dicussion list

[goodrelations] Get 4 Billion Triples of Current GoodRelations RDF/XML Data for 20 Million Amazon Pages