COLD 2015 Datasets

This Web page presents the datasets used in the scientific paper entitled Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce (Alex Stolz and Martin Hepp), published in the Sixth International Workshop on Consuming Linked Data (COLD 2015).

Contents

  1. Datasets
    1. Initial Sample of Web Sites
    2. Reduced Dataset
  2. Results
  3. References

Datasets

Initial Sample of Web Sites

We started with a random sample (in the www-namespace -- for more information, have a look into the paper) of 100 Web sites drawn from Web Data Commons (WDC) dataset, as listed below:

www.abloomaboveflorist.com
www.acuderm.com
www.antik-zentrum-alling.de
www.askariel.com
www.beburn.com
www.bellyarmor.com
www.bestpet.co.uk
www.bettymills.com
www.bezlimitno.ru
www.biascosmeticos.com.br
www.brasilpostos.com.br
www.buybuybaby.com
www.buynzuri.com
www.centralepneus.fr
www.chapmanpartnership.org
www.costumehut.com.au
www.ctcshop.org.uk
www.da-x.de
www.dairydoo.com
www.deelishables.com
www.descargargratis.com
www.district17.com
www.dogabisiklet.com
www.doorsandspecialties.com
www.dresses.ie
www.edwardarms.com
www.eimass.co.uk
www.epocaimobiliaria.com.br
www.eyepiecesetc.com
www.fdmbenzinpriser.dk
www.flightcentre.co.za
www.flowershopcarmichael.com
www.fmeextensions.com
www.gadgetbox.gr
www.gameof3halves.com
www.gymgrossisten.com
www.hamiltonparkhotel.com
www.handhmusic.co.uk
www.heryerehali.com
www.hypnoticworld.com
www.imoveissaojose.com
www.kahri.com
www.karben.com
www.kidssocks.com
www.kitdecomotoshop.fr
www.lautsprechertest.de
www.lease-termination.com
www.lindsborgcity.org
www.merin.pl
www.mgc.fr
www.migolondrina.com
www.millerrecyclingmanteca.com
www.missgirliegirl.com
www.missionmitsubishi.com
www.missmary.se
www.motorsport-event-company.com
www.musiccenter.com.pl
www.nanninember.it
www.nded.fr
www.newcenturyhobbies.com
www.nieuweruitenwissers.nl
www.northridgeflowers.net
www.nudampen.nl
www.onebrooklynfurniture.com
www.pennypackflowers.com
www.philips.co.in
www.polar.com
www.prokerala.com
www.promoprizewheels.com
www.radioassistance.com
www.recambiosparatractores.com
www.royalpizza.ru
www.rsberkeley.com
www.rue-des-puzzles.com
www.saltandpepperpotshop.co.uk
www.security-wizard.net
www.shapewearaustralia.com
www.shpuntik.com.ua
www.solo24.pl
www.sonomaharvestfoods.com
www.spectergear.com
www.speedousa.com
www.suplments.com
www.tao-distribution.com
www.tauchshoponline.ch
www.tenspros.com
www.testeri-bg.com
www.thecraft-studio.com
www.therealthingonline.co.za
www.tunelab-pro.software.informer.com
www.tutumommies.com
www.vansantdistributing.com
www.vgomes.com.br
www.vitamingrocer.com
www.waterbedbargains.com
www.wedoboxes.co.uk
www.windberflorist.com
www.xpradlo.sk
www.zapatosparatodos.es
www.zbrane-bazar.cz

Reduced Dataset

After excluding all domains without a sitemap.xml file, for our analysis there remained 68 domains that are outlined in the following table. For every domain in the list, we provide links to the metadata collected during the sitemap crawl, the raw sources of the sitemap.xml files, the Web pages with structured product offers detected for this domain, as well as the respective data from the Common Crawl (CC) / Web Data Commons (WDC) corpus.

DomainSitemapCommon Crawl
Site MetadataSitemap.xmlSitemap CrawlCC PagesWDC Pages with Structured Product Offer Data
1www.abloomaboveflorist.com link link link link
2www.acuderm.com link link link link
3www.antik-zentrum-alling.de link link link link
4www.askariel.com link link link link
5www.bellyarmor.com link link link link
6www.bestpet.co.uk link link link link
7www.bettymills.com link not crawled (>10000 URLs) link link
8www.bezlimitno.ru link link link link
9www.buynzuri.com link link link link
10www.chapmanpartnership.org link link link link
11www.da-x.de link link link link
12www.dairydoo.com link link link link
13www.district17.com link not crawled (>10000 URLs) link link
14www.doorsandspecialties.com link link link link
15www.dresses.ie link link link link
16www.eimass.co.uk link link link link
17www.eyepiecesetc.com link link link link
18www.fdmbenzinpriser.dk link link link link
19www.flightcentre.co.za link link link link
20www.flowershopcarmichael.com link link link link
21www.fmeextensions.com link link link link
22www.gadgetbox.gr link link link link
23www.hamiltonparkhotel.com link link link link
24www.handhmusic.co.uk link link link link
25www.hypnoticworld.com link link link link
26www.kahri.com link link link link
27www.karben.com link link link link
28www.kidssocks.com link link link link
29www.kitdecomotoshop.fr link link link link
30www.lautsprechertest.de link link link link
31www.lease-termination.com link not crawled (>10000 URLs) link link
32www.merin.pl link link link link
33www.mgc.fr link link link link
34www.migolondrina.com link link link link
35www.missgirliegirl.com link link link link
36www.missionmitsubishi.com link link link link
37www.motorsport-event-company.com link link link link
38www.musiccenter.com.pl link not crawled (>10000 URLs) link link
39www.nded.fr link link link link
40www.newcenturyhobbies.com link link link link
41www.nieuweruitenwissers.nl link link link link
42www.northridgeflowers.net link link link link
43www.nudampen.nl link link link link
44www.pennypackflowers.com link link link link
45www.philips.co.in link not crawled (>10000 URLs) link link
46www.polar.com link link link link
47www.promoprizewheels.com link link link link
48www.radioassistance.com link link link link
49www.royalpizza.ru link link link link
50www.rsberkeley.com link link link link
51www.rue-des-puzzles.com link link link link
52www.saltandpepperpotshop.co.uk link link link link
53www.security-wizard.net link link link link
54www.shapewearaustralia.com link link link link
55www.shpuntik.com.ua link link link link
56www.solo24.pl link link link link
57www.spectergear.com link link link link
58www.suplments.com link not crawled (>10000 URLs) link link
59www.tao-distribution.com link link link link
60www.tauchshoponline.ch link link link link
61www.tenspros.com link link link link
62www.thecraft-studio.com link link link link
63www.therealthingonline.co.za link link link link
64www.vitamingrocer.com link not crawled (>10000 URLs) link link
65www.waterbedbargains.com link link link link
66www.wedoboxes.co.uk link link link link
67www.windberflorist.com link link link link
68www.xpradlo.sk link link link link

Results

The main findings of our analysis can be looked up in the paper. However, due to space restrictions, Table 1 in the paper only reports a subset of the full comparison between sitemaps and Common Crawl. In here, we provide the comprehensive overview:

Domain Common Crawl Sitemaps
URLs Data URLs Data
1 www.abloomaboveflorist.com 1 1 424 409
2 www.acuderm.com 24 22 794 661
3 www.antik-zentrum-alling.de 1 1 1218 1163
4 www.askariel.com 53 14 107 43
5 www.bellyarmor.com 12 1 40 8
6 www.bestpet.co.uk 10 5 3363 2640
7 www.bezlimitno.ru 4 1 442 19
8 www.buynzuri.com 7 4 99 53
9 www.chapmanpartnership.org 6 1 166 11
10 www.da-x.de 19 13 4895 3547
11 www.dairydoo.com 2 1 71 25
12 www.doorsandspecialties.com 1 1 637 609
13 www.dresses.ie 17 14 156 142
14 www.eimass.co.uk 2 1 372 321
15 www.eyepiecesetc.com 33 17 367 333
16 www.fdmbenzinpriser.dk 106 103 2244 2084
17 www.flightcentre.co.za 101 1 798 24
18 www.flowershopcarmichael.com 1 1 391 377
19 www.fmeextensions.com 214 139 157 95
20 www.gadgetbox.gr 10 1 1328 1228
21 www.hamiltonparkhotel.com 9996 7 63 21
22 www.handhmusic.co.uk 4 2 117 87
23 www.hypnoticworld.com 761 165 2022 560
24 www.kahri.com 2 1 351 246
25 www.karben.com 68 21 465 412
26 www.kidssocks.com 72 12 295 117
27 www.kitdecomotoshop.fr 4 2 1375 1165
28 www.lautsprechertest.de 6 5 426 222
29 www.merin.pl 6 3 2936 2324
30 www.mgc.fr 9 1 3692 3200
31 www.migolondrina.com 2 1 120 0
32 www.missgirliegirl.com 2 1 495 377
33 www.missionmitsubishi.com 6 2 1007 0
34 www.motorsport-event-company.com 3 1 415 10
35 www.nded.fr 33 3 1072 931
36 www.newcenturyhobbies.com 2 1 75 45
37 www.nieuweruitenwissers.nl 9 6 5476 0
38 www.northridgeflowers.net 1 1 428 412
39 www.nudampen.nl 1 1 456 435
40 www.pennypackflowers.com 1 1 397 382
41 www.polar.com 6891 470 7285 3091
42 www.promoprizewheels.com 16 16 81 60
43 www.radioassistance.com 3 1 102 74
44 www.royalpizza.ru 3 1 418 0
45 www.rsberkeley.com 5 2 397 324
46 www.rue-des-puzzles.com 57 42 7265 7265
47 www.saltandpepperpotshop.co.uk 3 2 467 451
48 www.security-wizard.net 2227 126 1807 672
49 www.shapewearaustralia.com 1 1 22 5
50 www.shpuntik.com.ua 4 2 2998 663
51 www.solo24.pl 2 1 1241 1217
52 www.spectergear.com 19 15 259 219
53 www.tao-distribution.com 5 4 790 743
54 www.tauchshoponline.ch 11 10 3268 3168
55 www.tenspros.com 18 11 366 311
56 www.thecraft-studio.com 5 1 4647 1270
57 www.therealthingonline.co.za 2 2 47 24
58 www.waterbedbargains.com 27 21 315 299
59 www.wedoboxes.co.uk 2 1 66 50
60 www.windberflorist.com 1 1 431 416
61 www.xpradlo.sk 1 1 1433 0

References

Stolz, Alex; Hepp, Martin: Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce, in: Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015), CEUR Workshop Proceedings Vol. 1426, ISSN 1613-0073, October 12, 2015, Bethlehem, PA, USA.

@inproceedings{StolzHepp:COLD2015,
address = {Bethlehem, Pennsylvania, USA},
title = {Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce},
booktitle = {Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015)},
year = 2015,
author = {Stolz, Alex and Hepp, Martin},
url = {http://www.heppnetz.de/files/commoncrawl-cold2015.pdf},
crossref = {COLD2015},
}