PCS2OWL Evaluation Website

This page was set up in the context of the evaluation of a scientific paper entitled PCS2OWL: An Approach for Deriving Web Ontologies from Product Classification Systems. On this page we present the source code and queries used for proving the conceptual correctness of the ontology conversion from the Google product taxonomy, one of the product classification systems currently supported by the PCS2OWL tool. To learn more about the project and the other product classification systems that have been converted so far please refer to the project landing page at

http://www.ebusiness-unibw.org/ontologies/pcs2owl/

Reverse-Engineering the Google Product Taxonomy

One step in the evaluation of our paper consisted of a reverse-engineering approach to build up the original Google product taxonomy file using the OWL ontology file generated by the PCS2OWL converter.

The product ontology has been made available in a SPARQL endpoint.
The Google product taxonomy is a plain-text file of the form as shown in the following. It is available in full from Google or from here.

Animals & Pet Supplies
Animals & Pet Supplies > Live Animals
Animals & Pet Supplies > Pet Supplies
Animals & Pet Supplies > Pet Supplies > Bird Supplies
Animals & Pet Supplies > Pet Supplies > Bird Supplies > Bird Cages & Stands
Animals & Pet Supplies > Pet Supplies > Bird Supplies > Bird Food
Animals & Pet Supplies > Pet Supplies > Bird Supplies > Bird Ladders & Perches
Animals & Pet Supplies > Pet Supplies > Bird Supplies > Bird Toys
Animals & Pet Supplies > Pet Supplies > Bird Supplies > Bird Treats
Animals & Pet Supplies > Pet Supplies > Cat Supplies
...

The live SPARQL query examples given subsequently are executed against a SPARQL endpoint that contains the product ontology derived from the Google product taxonomy and stored under a graph name "urn:google". In the following, we give a step-by-step example presenting the necessary source code snippets together with the corresponding SPARQL queries, afterwards we show the complete source code example that we used for our evaluation.

This document is mainly split into three sections:

Step-by-step example
Results
Full example

Step-by-Step Example

In this section we describe the details of our approach towards a reverse-engineered Google product taxonomy. As you will see shortly, the respective lines are built up from the hierarchy in the RDF graph using proper SPARQL queries and by concatenating the results using the same "right angle bracket" delimiter (">") as it can be found in the source file (see the file contents outlined above).

Step 1

First, we define constants for the endpoint URI, the product classification standard (google), the language of the source file (en), and the graph identifier (urn:google).

# Step 1: Prepare endpoint and set up variables

ep_uri = "<endpoint_uri>" # placeholder for endpoint URI
pcs = "google"
language = "en"

graph = "urn:" + pcs

Step 2

Query 1 fetches all taxonomic classes from the Google taxonomy. Without restricting the number of results there would be returned 5,508 of such classes in total (btw, the text area for the SPARQL query is editable!).

# Query 1: Select -tax classes (limited to the first 10 results)

PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/google/>

SELECT DISTINCT ?c
WHERE {
    GRAPH <urn:google> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?code . # only -tax classes have a hierarchy code
    }
}
LIMIT 10
OFFSET 0

Output

The detailed source code for retrieving all classes looks as in the following code snippet. It first prepares the SPARQL query, then it sets up the endpoint, executes the query on that endpoint, and finally it processes the results encoded as JSON format.

# Step 2: Query all taxonomic class URIs using query 1

query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/%(pcs)s/>

SELECT DISTINCT ?c
WHERE {
    GRAPH <%(graph)s> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?code .
    }
}
""" % {"graph": graph, "pcs": pcs}

ep = endpoint.Endpoint(ep_uri)
results = ep.query(query_string)
header = results["head"]["vars"]
result = results["results"]["bindings"]
result_list = []

uris = [uri["c"]["value"] for uri in result]

Step 3

In step 3 we read in the contents of the source file of the Google product taxonomy. The file is split up into a list of lines. They will later be used for comparing and assessing the quality of the conversion. Moreover, a print statement is included that checks whether the number of lines equals the number of URIs from the SPARQL query, which necessarily need to be the same, otherwise the conversion was wrong.

# Step 3: Read in contents of Google product taxonomy file and split by lines

lines = []
f = codecs.open("taxonomy.en-US.txt", mode="r", encoding="utf-8")
lines = f.read().splitlines()
f.close()

print len(uris), len(lines)

Step 4

Query 2 selects all root nodes in the RDF graph, i.e. classes that have no parent class in the taxonomy.

# Query 2: Select all root nodes in the RDF graph

PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/google/>

SELECT ?c_code ?c_label
WHERE {
    GRAPH <urn:google> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?c_code .
        ?c rdfs:label ?c_label .
        FILTER NOT EXISTS {
            ?c rdfs:subClassOf ?sc .
            ?ont a owl:Ontology .
            ?sc rdfs:isDefinedBy ?ont .
        }
    }
}

Output

The execution of the above SPARQL query compiles a list of all root nodes in the RDF graph. The results are populated into the result list (which was empty so far), as outlined in the next code listing.

# Step 4: Select all classes that do not have any parent classes / superclasses, i.e. root nodes in the RDF graph
    
query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.intelligent-match.de/%(pcs)s/>

SELECT ?c_code ?c_label
WHERE {
    GRAPH <%(graph)s> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?c_code .
        ?c rdfs:label ?c_label .
        FILTER NOT EXISTS {
            ?c rdfs:subClassOf ?sc .
            ?ont a owl:Ontology .
            ?sc rdfs:isDefinedBy ?ont .
        }
    }
}
""" % {"graph": graph, "pcs": pcs}

results = ep.query(query_string)
header = results["head"]["vars"]
result = results["results"]["bindings"]
result_list = result

Step 5

The third SPARQL query retrieves label and hierarchy code of a given class and its parent class.

# Query 3: Select label and hierarchy code of -tax class and its superclass in the taxonomy
# Class URI = "http://www.ebusiness-unibw.org/ontologies/pcs2owl/google/C_Soy_Milk-tax"

PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/google/>

SELECT DISTINCT ?c_code ?c_label ?sc_code ?sc_label
WHERE {
    GRAPH <urn:google> {
        ?c rdfs:subClassOf ?sc.
        ?c pcs:hierarchyCode ?c_code.
        ?sc pcs:hierarchyCode ?sc_code.
        ?c rdfs:label ?c_label. FILTER (?c = <http://www.ebusiness-unibw.org/ontologies/pcs2owl/google/C_Soy_Milk-tax> && lang(?c_label) = "en")
        ?sc rdfs:label ?sc_label. FILTER(lang(?sc_label) = "en")
    }
}

Output

Query 3 is applied in a loop over all URIs obtained from query 1. Furthermore, it is repeated using the SPARQL 1.1 property path feature for different hierarchical dephts, rdfs:subClassOf{1} (= rdfs:subClassOf), rdfs:subClassOf{2}, etc. That allows to build up all possible combinations of paths in the taxonomy, thus covering the whole taxonomy. At the same time, a list of keys keeps track of the depth of the taxonomy, later needed in order to iterate over the nodes of the taxonomy.

# Step 5: Loop over every class URI and get their subclasses by executing query 2, store result dictionaries in result list and maintain a list of keys

keys = ["c_label"]

sample_num = 1
for uri in uris:
    result = {}
    
    level = 1
    while True:
        query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/%(pcs)s/>

SELECT DISTINCT ?c_code ?c_label ?sc%(level)s_code ?sc%(level)s_label
WHERE {
    GRAPH <%(graph)s> {
        ?c rdfs:subClassOf{%(level)s} ?sc%(level)s.
        ?c pcs:hierarchyCode ?c_code.
        ?sc%(level)s pcs:hierarchyCode ?sc%(level)s_code.
        ?c rdfs:label ?c_label. FILTER (?c = <%(uri)s> && lang(?c_label) = "%(lang)s")
        ?sc%(level)s rdfs:label ?sc%(level)s_label. FILTER(lang(?sc%(level)s_label) = "%(lang)s")
    }
}
""" % {"graph": graph, "pcs": pcs, "uri": uri, "level": level, "lang": language}

        results_json = ep.query(query_string)
        header_json = results_json["head"]["vars"]
        result_json = results_json["results"]["bindings"]
        if len(result_json) == 0:
            break
        result = dict(result.items() + result_json[0].items())
        if "sc%d_label" % level not in keys:
            keys.append("sc%d_label" % level)
        level += 1
        
    if sample_num % 100 == 0:
        print sample_num
    
    if result:
        result_list.append(result)
    sample_num += 1

If we chose to print the aggregated result of the sequential execution of query 3 populated with results from query 1 as shown above, we would obtain something similar to the following table (a more extensive, yet incomplete table based on a random sample is available from here).

	sc4_label	sc3_label	sc2_label	sc1_label	c_label
1.		Apparel & Accessories	Clothing	Activewear	Boxing Shorts
2.		Hardware	Tools	Masonry Tools	Floats
3.			Hardware	Hardware Accessories	Lubricants
4.		Food, Beverages & Tobacco	Food Items	Prepared Foods	Sushi
5.		Electronics	Circuit Components	Semiconductors	Transistors
6.	Home & Garden	Kitchen & Dining	Kitchen Tools & Utensils	Scoops	Ice Cream Scoops
7.			Hardware	Electrical Supplies	Wall Plates
8.	Vehicles & Parts	Vehicle Parts & Accessories	Motor Vehicle Care	Vehicle Fluids	Brake Fluid
9.			Sporting Goods	Exercise & Fitness	Weightlifting Belts
10.	Vehicles & Parts	Vehicle Parts & Accessories	Motor Vehicle Parts	Motor Vehicle Exhaust	Catalytic Converters

Step 6

Next, we iterate row- and column-wise over a table as indicated by the previous table, concatenating the labels of the class nodes and separating them from each other by a "right angle bracket" (">") as in the original Google product taxonomy file. Furthermore, a check step is included that looks up every constructed line in the list of lines obtained from the original file. For every matching or non-matching line respective counters are increased.

# Step 6: Iterate over result list. Iterate over each item using the keys list, concatenate using ">" delimiter and check if resulting string exists as a line in the original file

keys.reverse()
    
output = ""
first_key = None
yes = 0
no = 0
for result_item in result_list:
    sparql_line = ""
    for key in keys:
        if key in result_item:
            if not first_key:
                first_key = key
            sparql_line += "%s" % result_item[key]["value"].replace(" [Taxonomy Concept: Anything that may be an instance of this category in any context]", "").replace(" (Taxonomy Concept: Anything that may be an instance of this category in any context)", "")
            if key != 'c_label':
                sparql_line += " > "
                
    if sparql_line in lines:
        output += "[YES]\t"+sparql_line+"\n"
        yes += 1
    else:
        output += "[NO]\t"+sparql_line+"\n"
        no += 1

Each check step is logged using proper messages. "[YES]" that goes along with the output of a line indicates that a matching line could be found in the initial derived list from the original file. Similarly, "[NO]" denotes that no matching line could be found.

Step 7

The last step finally prepends a short summary to the logged output and writes it to a file.

# Step 7: Write a summary and the log to a file

output = """
==============
Short Summary:
==============

No. items match?
--------------
%s
--------------

Consistent?
--------------
YES:\t%6.d
--------------
NO:\t%6.d
--------------

""" % (len(uris) == len(lines), yes, no) + output

f = codecs.open("gpt-consistency-check.txt", mode="w", encoding="utf-8")
f.write(output)
f.close()

Results

The snippet below shows the output written to the result file. The first part is a short summary describing that the number of URIs equals the number of lines in the original file. It also outlines the numbers of the YES/NO-counters, namely 5,508 checks succeeded and no single check failed. The detailed list of check steps is appended afterwards.

==============
Short Summary:
==============

No. items match?
--------------
True
--------------

Consistent?
--------------
YES:	  5508
--------------
NO:	      
--------------

[YES]	Home & Garden
[YES]	Baby & Toddler
[YES]	Toys & Games
[YES]	Religious & Ceremonial
[YES]	Vehicles & Parts
[YES]	Cameras & Optics
[YES]	Apparel & Accessories
[YES]	Mature
[YES]	Food, Beverages & Tobacco
[YES]	Health & Beauty
[YES]	Furniture
[YES]	Hardware
[YES]	Office Supplies
[YES]	Electronics
[YES]	Luggage & Bags
[YES]	Animals & Pet Supplies
[YES]	Media
[YES]	Arts & Entertainment
[YES]	Software
[YES]	Business & Industrial
[YES]	Sporting Goods
[YES]	Electronics > Audio > Audio Accessories > Satellite Radio Accessories
[YES]	Hardware > Countertops > Stone Countertops
[YES]	Sporting Goods > Water Sports > Boating > Rowing > Rowing Seat Pads
[YES]	Sporting Goods > Exercise & Fitness > Exercise Balls
[YES]	Hardware > Tools > Measuring Tools & Sensors > Distance Meters
[YES]	Furniture > Tables > Sewing Machine Tables
[YES]	Sporting Goods > Combat Sports > Fencing > Fencing Protective Gear
...

The complete file with all the results can be downloaded from here.

Full Example

Below is the full code example that combines the single steps from above (Note that the provided example takes advantage of some Python helper modules that for the sake of simplicity we do not discuss here):

# Step 1: Prepare endpoint and set up variables

ep_uri = "<endpoint_uri>"
pcs = "google"
language = "en"

graph = "urn:" + pcs


# Step 2: Query all taxonomic class URIs using query 1

query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/%(pcs)s/>

SELECT DISTINCT ?c
WHERE {
    GRAPH <%(graph)s> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?code .
    }
}
""" % {"graph": graph, "pcs": pcs}

ep = endpoint.Endpoint(ep_uri)
results = ep.query(query_string)
header = results["head"]["vars"]
result = results["results"]["bindings"]
result_list = []

uris = [uri["c"]["value"] for uri in result]


# Step 3: Read in contents of Google product taxonomy file and split by lines

lines = []
f = codecs.open("taxonomy.en-US.txt", mode="r", encoding="utf-8")
lines = f.read().splitlines()
f.close()

print len(uris), len(lines)


# Step 4: Select all classes that do not have any parent classes / superclasses, i.e. root nodes in the RDF graph
    
query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.intelligent-match.de/%(pcs)s/>

SELECT ?c_code ?c_label
WHERE {
    GRAPH <%(graph)s> {
        ?c a owl:Class .
        ?c pcs:hierarchyCode ?c_code .
        ?c rdfs:label ?c_label .
        FILTER NOT EXISTS {
            ?c rdfs:subClassOf ?sc .
            ?ont a owl:Ontology .
            ?sc rdfs:isDefinedBy ?ont .
        }
    }
}
""" % {"graph": graph, "pcs": pcs}

results = ep.query(query_string)
header = results["head"]["vars"]
result = results["results"]["bindings"]
result_list = result


# Step 5: Loop over every class URI and get their subclasses by executing query 2, store result dictionaries in result list and maintain a list of keys

keys = ["c_label"]

sample_num = 1
for uri in uris:
    result = {}
    
    level = 1
    while True:
        query_string = """
PREFIX gr: <http://purl.org/goodrelations/v1#>
PREFIX pcs: <http://www.ebusiness-unibw.org/ontologies/pcs2owl/%(pcs)s/>

SELECT DISTINCT ?c_code ?c_label ?sc%(level)s_code ?sc%(level)s_label
WHERE {
    GRAPH <%(graph)s> {
        ?c rdfs:subClassOf{%(level)s} ?sc%(level)s.
        ?c pcs:hierarchyCode ?c_code.
        ?sc%(level)s pcs:hierarchyCode ?sc%(level)s_code.
        ?c rdfs:label ?c_label. FILTER (?c = <%(uri)s> && lang(?c_label) = "%(lang)s")
        ?sc%(level)s rdfs:label ?sc%(level)s_label. FILTER(lang(?sc%(level)s_label) = "%(lang)s")
    }
}
""" % {"graph": graph, "pcs": pcs, "uri": uri, "level": level, "lang": language}

        results_json = ep.query(query_string)
        header_json = results_json["head"]["vars"]
        result_json = results_json["results"]["bindings"]
        if len(result_json) == 0:
            break
        result = dict(result.items() + result_json[0].items())
        if "sc%d_label" % level not in keys:
            keys.append("sc%d_label" % level)
        level += 1
        
    if sample_num % 100 == 0:
        print sample_num

    if result:
        result_list.append(result)
    sample_num += 1
    

# Step 6: Iterate over result list. Iterate over each item using the keys list, concatenate using "<" delimiter and check if resulting string exists as a line in the original file

keys.reverse()
    
output = ""
first_key = None
yes = 0
no = 0
for result_item in result_list:
    sparql_line = ""
    for key in keys:
        if key in result_item:
            if not first_key:
                first_key = key
            sparql_line += "%s" % result_item[key]["value"].replace(" [Taxonomy Concept: Anything that may be an instance of this category in any context]", "").replace(" (Taxonomy Concept: Anything that may be an instance of this category in any context)", "")
            if key != 'c_label':
                sparql_line += " > "
                
    if sparql_line in lines:
        output += "[YES]\t"+sparql_line+"\n"
        yes += 1
    else:
        output += "[NO]\t"+sparql_line+"\n"
        no += 1


# Step 7: Write a summary and the log to a file

output = """
==============
Short Summary:
==============

No. items match?
--------------
%s
--------------

Consistent?
--------------
YES:\t%6.d
--------------
NO:\t%6.d
--------------

""" % (len(uris) == len(lines), yes, no) + output

f = codecs.open("gpt-consistency-check.txt", mode="w", encoding="utf-8")
f.write(output)
f.close()