This Web page presents the datasets used in the scientific paper entitled Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce (Alex Stolz and Martin Hepp), published in the Sixth International Workshop on Consuming Linked Data (COLD 2015).
We started with a random sample (in the www-namespace -- for more information, have a look into the paper) of 100 Web sites drawn from Web Data Commons (WDC) dataset, as listed below:
www.abloomaboveflorist.com www.acuderm.com www.antik-zentrum-alling.de www.askariel.com www.beburn.com www.bellyarmor.com www.bestpet.co.uk www.bettymills.com www.bezlimitno.ru www.biascosmeticos.com.br www.brasilpostos.com.br www.buybuybaby.com www.buynzuri.com www.centralepneus.fr www.chapmanpartnership.org www.costumehut.com.au www.ctcshop.org.uk www.da-x.de www.dairydoo.com www.deelishables.com www.descargargratis.com www.district17.com www.dogabisiklet.com www.doorsandspecialties.com www.dresses.ie www.edwardarms.com www.eimass.co.uk www.epocaimobiliaria.com.br www.eyepiecesetc.com www.fdmbenzinpriser.dk www.flightcentre.co.za www.flowershopcarmichael.com www.fmeextensions.com www.gadgetbox.gr www.gameof3halves.com www.gymgrossisten.com www.hamiltonparkhotel.com www.handhmusic.co.uk www.heryerehali.com www.hypnoticworld.com www.imoveissaojose.com www.kahri.com www.karben.com www.kidssocks.com www.kitdecomotoshop.fr www.lautsprechertest.de www.lease-termination.com www.lindsborgcity.org www.merin.pl www.mgc.fr www.migolondrina.com www.millerrecyclingmanteca.com www.missgirliegirl.com www.missionmitsubishi.com www.missmary.se www.motorsport-event-company.com www.musiccenter.com.pl www.nanninember.it www.nded.fr www.newcenturyhobbies.com www.nieuweruitenwissers.nl www.northridgeflowers.net www.nudampen.nl www.onebrooklynfurniture.com www.pennypackflowers.com www.philips.co.in www.polar.com www.prokerala.com www.promoprizewheels.com www.radioassistance.com www.recambiosparatractores.com www.royalpizza.ru www.rsberkeley.com www.rue-des-puzzles.com www.saltandpepperpotshop.co.uk www.security-wizard.net www.shapewearaustralia.com www.shpuntik.com.ua www.solo24.pl www.sonomaharvestfoods.com www.spectergear.com www.speedousa.com www.suplments.com www.tao-distribution.com www.tauchshoponline.ch www.tenspros.com www.testeri-bg.com www.thecraft-studio.com www.therealthingonline.co.za www.tunelab-pro.software.informer.com www.tutumommies.com www.vansantdistributing.com www.vgomes.com.br www.vitamingrocer.com www.waterbedbargains.com www.wedoboxes.co.uk www.windberflorist.com www.xpradlo.sk www.zapatosparatodos.es www.zbrane-bazar.cz
After excluding all domains without a sitemap.xml file, for our analysis there remained 68 domains that are outlined in the following table. For every domain in the list, we provide links to the metadata collected during the sitemap crawl, the raw sources of the sitemap.xml files, the Web pages with structured product offers detected for this domain, as well as the respective data from the Common Crawl (CC) / Web Data Commons (WDC) corpus.
The main findings of our analysis can be looked up in the paper. However, due to space restrictions, Table 1 in the paper only reports a subset of the full comparison between sitemaps and Common Crawl. In here, we provide the comprehensive overview:
Domain | Common Crawl | Sitemaps | |||
---|---|---|---|---|---|
URLs | Data | URLs | Data | ||
1 | www.abloomaboveflorist.com | 1 | 1 | 424 | 409 |
2 | www.acuderm.com | 24 | 22 | 794 | 661 |
3 | www.antik-zentrum-alling.de | 1 | 1 | 1218 | 1163 |
4 | www.askariel.com | 53 | 14 | 107 | 43 |
5 | www.bellyarmor.com | 12 | 1 | 40 | 8 |
6 | www.bestpet.co.uk | 10 | 5 | 3363 | 2640 |
7 | www.bezlimitno.ru | 4 | 1 | 442 | 19 |
8 | www.buynzuri.com | 7 | 4 | 99 | 53 |
9 | www.chapmanpartnership.org | 6 | 1 | 166 | 11 |
10 | www.da-x.de | 19 | 13 | 4895 | 3547 |
11 | www.dairydoo.com | 2 | 1 | 71 | 25 |
12 | www.doorsandspecialties.com | 1 | 1 | 637 | 609 |
13 | www.dresses.ie | 17 | 14 | 156 | 142 |
14 | www.eimass.co.uk | 2 | 1 | 372 | 321 |
15 | www.eyepiecesetc.com | 33 | 17 | 367 | 333 |
16 | www.fdmbenzinpriser.dk | 106 | 103 | 2244 | 2084 |
17 | www.flightcentre.co.za | 101 | 1 | 798 | 24 |
18 | www.flowershopcarmichael.com | 1 | 1 | 391 | 377 |
19 | www.fmeextensions.com | 214 | 139 | 157 | 95 |
20 | www.gadgetbox.gr | 10 | 1 | 1328 | 1228 |
21 | www.hamiltonparkhotel.com | 9996 | 7 | 63 | 21 |
22 | www.handhmusic.co.uk | 4 | 2 | 117 | 87 |
23 | www.hypnoticworld.com | 761 | 165 | 2022 | 560 |
24 | www.kahri.com | 2 | 1 | 351 | 246 |
25 | www.karben.com | 68 | 21 | 465 | 412 |
26 | www.kidssocks.com | 72 | 12 | 295 | 117 |
27 | www.kitdecomotoshop.fr | 4 | 2 | 1375 | 1165 |
28 | www.lautsprechertest.de | 6 | 5 | 426 | 222 |
29 | www.merin.pl | 6 | 3 | 2936 | 2324 |
30 | www.mgc.fr | 9 | 1 | 3692 | 3200 |
31 | www.migolondrina.com | 2 | 1 | 120 | 0 |
32 | www.missgirliegirl.com | 2 | 1 | 495 | 377 |
33 | www.missionmitsubishi.com | 6 | 2 | 1007 | 0 |
34 | www.motorsport-event-company.com | 3 | 1 | 415 | 10 |
35 | www.nded.fr | 33 | 3 | 1072 | 931 |
36 | www.newcenturyhobbies.com | 2 | 1 | 75 | 45 |
37 | www.nieuweruitenwissers.nl | 9 | 6 | 5476 | 0 |
38 | www.northridgeflowers.net | 1 | 1 | 428 | 412 |
39 | www.nudampen.nl | 1 | 1 | 456 | 435 |
40 | www.pennypackflowers.com | 1 | 1 | 397 | 382 |
41 | www.polar.com | 6891 | 470 | 7285 | 3091 |
42 | www.promoprizewheels.com | 16 | 16 | 81 | 60 |
43 | www.radioassistance.com | 3 | 1 | 102 | 74 |
44 | www.royalpizza.ru | 3 | 1 | 418 | 0 |
45 | www.rsberkeley.com | 5 | 2 | 397 | 324 |
46 | www.rue-des-puzzles.com | 57 | 42 | 7265 | 7265 |
47 | www.saltandpepperpotshop.co.uk | 3 | 2 | 467 | 451 |
48 | www.security-wizard.net | 2227 | 126 | 1807 | 672 |
49 | www.shapewearaustralia.com | 1 | 1 | 22 | 5 |
50 | www.shpuntik.com.ua | 4 | 2 | 2998 | 663 |
51 | www.solo24.pl | 2 | 1 | 1241 | 1217 |
52 | www.spectergear.com | 19 | 15 | 259 | 219 |
53 | www.tao-distribution.com | 5 | 4 | 790 | 743 |
54 | www.tauchshoponline.ch | 11 | 10 | 3268 | 3168 |
55 | www.tenspros.com | 18 | 11 | 366 | 311 |
56 | www.thecraft-studio.com | 5 | 1 | 4647 | 1270 |
57 | www.therealthingonline.co.za | 2 | 2 | 47 | 24 |
58 | www.waterbedbargains.com | 27 | 21 | 315 | 299 |
59 | www.wedoboxes.co.uk | 2 | 1 | 66 | 50 |
60 | www.windberflorist.com | 1 | 1 | 431 | 416 |
61 | www.xpradlo.sk | 1 | 1 | 1433 | 0 |
Stolz, Alex; Hepp, Martin: Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce, in: Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015), CEUR Workshop Proceedings Vol. 1426, ISSN 1613-0073, October 12, 2015, Bethlehem, PA, USA.
@inproceedings{StolzHepp:COLD2015, address = {Bethlehem, Pennsylvania, USA}, title = {Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce}, booktitle = {Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015)}, year = 2015, author = {Stolz, Alex and Hepp, Martin}, url = {http://www.heppnetz.de/files/commoncrawl-cold2015.pdf}, crossref = {COLD2015}, }