Parsing XML with lxml in Django – Multiple Namespaces and XPath

A few days ago, I was trying to figure out how to parse XML with multiple namespaces and get information using XPath in Django. I came across lxml which I think is really good. You don’t have to csrf_exempt this procedure as it is GET based and thus safe. I am doing it for consistency with the rest of my code.
I am using Primo Webservices basic search here as an example, but you may not be able to open this URL as it is a protected URL. Also, this may not be the best way to do this, so if you can think of improvements, please let me know.

from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
import simplejson as json
import urllib
from lxml import etree
def brief_search(request):
    errors = []
    if request.method == 'GET':
        searchTerms = request.GET.get('query')
        bulkSize = request.GET.get('pageSize')
        indx = request.GET.get('start')
        if indx:
            indx = int(indx) + 1
            DEFAULT_NS = ''
            query = 'any,contains,' + searchTerms
            url = '' + query + '&indx=' + str(indx) + '&bulkSize=' + bulkSize
            content = urllib.urlopen(url)
            xml = etree.parse(content)
            docset = xml.getroot().xpath('//sear:SEGMENTS/sear:JAGROOT/sear:RESULT/sear:DOCSET', namespaces={'sear': '', 'def': DEFAULT_NS})
            totalhits = docset[0].get("TOTALHITS");
            docs = xml.getroot().xpath('//sear:SEGMENTS/sear:JAGROOT/sear:RESULT/sear:DOCSET/sear:DOC/def:PrimoNMBib/def:record', namespaces={'sear': '', 'def': DEFAULT_NS})
            list_data = []
            for doc in docs:
                data = {}
                data['totalhits'] = totalhits
                data['record_id'] = doc.findtext('{%s}control/{%s}recordid' %(DEFAULT_NS, DEFAULT_NS))
                data['frbr_type'] = doc.findtext('{%s}facets/{%s}frbrtype' %(DEFAULT_NS, DEFAULT_NS))
                data['frbr_groupid'] = doc.findtext('{%s}facets/{%s}frbrgroupid' %(DEFAULT_NS, DEFAULT_NS))
                data['isbn'] = doc.findtext('{%s}addata/{%s}isbn' %(DEFAULT_NS, DEFAULT_NS))
                data['title'] = doc.findtext('{%s}display/{%s}title' %(DEFAULT_NS, DEFAULT_NS))
                creator = doc.findtext('{%s}display/{%s}creator' %(DEFAULT_NS, DEFAULT_NS))
                contributor = doc.findtext('{%s}display/{%s}contributor' %(DEFAULT_NS, DEFAULT_NS))
                if creator:
                    data['author'] = creator
                    data['author'] = 'No author specified'
                data['type'] = doc.findtext('{%s}display/{%s}type' %(DEFAULT_NS, DEFAULT_NS))
    return HttpResponse(json.dumps(list_data), mimetype="application/json")

Leave a reply:

Your email address will not be published.

Site Footer

Sliding Sidebar