Tuesday, April 18, 2017

XML Parsing failed with LPX-00261: invalid URL


As promised in my last post, I'll address the parsing error you get when you attempt to convert the text returned by the NCBI efetch utilities into an XMLType. The Oracle XML parser within the database treats the XML comment "<!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN" "https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">" as an indicator to validate the XML against the DTD located at "https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd". In many, maybe even most, cases this is fine. If you are generating an XML payload you may want to validate it before sending it out to avoid the embarrassment of sending an invalid message. If you are retrieving it from the source, though, odds are good it will be valid and validating it against the DTD may just be extra overhead. It's also possible that the XML parser doesn't understand the DTD. At least one version of the database has a bug related to exactly how that comment is formatted. In my case there are at least two issues related to validating against the DTD, and I have no need for it. Below is the query and the associated error.

select xmltype(
'<?xml version="1.0" ?>
<!DOCTYPE TaxaSet PUBLIC "-//NLM//DTD Taxon, 14th January 2002//EN" "https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd">
<TaxaSet><Taxon>
    <TaxId>33208</TaxId>
    <ScientificName>Metazoa</ScientificName>
    <OtherNames>
        <GenbankCommonName>metazoans</GenbankCommonName>
        <BlastName>animals; animals</BlastName>
        <Synonym>Animalia</Synonym>
        <CommonName>multicellular animals</CommonName>
    </OtherNames>
    <ParentTaxId>33154</ParentTaxId>
    <Rank>kingdom</Rank>
    <Division>Invertebrates</Division>
    <GeneticCode>
        <GCId>1</GCId>
        <GCName>Standard</GCName>
    </GeneticCode>
    <MitoGeneticCode>
        <MGCId>1</MGCId>
        <MGCName>Standard</MGCName>
    </MitoGeneticCode>
    <Lineage>cellular organisms; Eukaryota; Opisthokonta</Lineage>
    <LineageEx>
        <Taxon>
            <TaxId>131567</TaxId>
            <ScientificName>cellular organisms</ScientificName>
            <Rank>no rank</Rank>
        </Taxon>
        <Taxon>
            <TaxId>2759</TaxId>
            <ScientificName>Eukaryota</ScientificName>
            <Rank>superkingdom</Rank>
        </Taxon>
        <Taxon>
            <TaxId>33154</TaxId>
            <ScientificName>Opisthokonta</ScientificName>
            <Rank>no rank</Rank>
        </Taxon>
    </LineageEx>
    <CreateDate>1995/02/27 09:24:00</CreateDate>
    <UpdateDate>2017/02/16 16:52:33</UpdateDate>
    <PubDate>1992/05/26 01:00:00</PubDate>
</Taxon>
</TaxaSet>') the_xml
  from dual;


ORA-31011: XML parsing failed 
ORA-19213: error occurred in XML processing at lines 2 
LPX-00261: invalid URL https://www.ncbi.nlm.nih.gov/entrez/query/DTD/taxon.dtd

Fortunately, you can instruct the database to NOT validate.

alter session set events='31156 trace name context forever, level 2'

will instruct the database to skip the validation. You can then use xmlTable, xmlQuery, etc. without receiving an error.

Note: you may need to "grant alter session" to the user in order for this to work, particularly if using an execute immediate within a package to do the alter session command.




No comments: