Parsing XML with lxml in Django – Multiple Namespaces and XPath

A few days ago, I was trying to figure out how to parse XML with multiple namespaces and get information using XPath in Django. I came across lxml which I think is really good. You don’t have to csrf_exempt this procedure as it is GET based and thus safe. I am doing it for consistency with the rest of my code.

I am using Primo Webservices basic search here as an example, but you may not be able to open this URL as it is a protected URL. Also, this may not be the best way to do this, so if you can think of improvements, please let me know.

from django.http import HttpResponse
from django.views.decorators.csrf import csrf_exempt
import simplejson as json
import urllib
from lxml import etree
def brief_search(request):
    errors = []
    if request.method == 'GET':
        searchTerms = request.GET.get('query')
        bulkSize = request.GET.get('pageSize')
        indx = request.GET.get('start')
        if indx:
            indx = int(indx) + 1
            DEFAULT_NS = ''
            query = 'any,contains,' + searchTerms
            url = '' + query + '&indx=' + str(indx) + '&bulkSize=' + bulkSize
            content = urllib.urlopen(url)
            xml = etree.parse(content)
            docset = xml.getroot().xpath('//sear:SEGMENTS/sear:JAGROOT/sear:RESULT/sear:DOCSET', namespaces={'sear': '', 'def': DEFAULT_NS})
            totalhits = docset[0].get("TOTALHITS");
            docs = xml.getroot().xpath('//sear:SEGMENTS/sear:JAGROOT/sear:RESULT/sear:DOCSET/sear:DOC/def:PrimoNMBib/def:record', namespaces={'sear': '', 'def': DEFAULT_NS})
            list_data = []
            for doc in docs:
                data = {}
                data['totalhits'] = totalhits
                data['record_id'] = doc.findtext('{%s}control/{%s}recordid' %(DEFAULT_NS, DEFAULT_NS))
                data['frbr_type'] = doc.findtext('{%s}facets/{%s}frbrtype' %(DEFAULT_NS, DEFAULT_NS))
                data['frbr_groupid'] = doc.findtext('{%s}facets/{%s}frbrgroupid' %(DEFAULT_NS, DEFAULT_NS))
                data['isbn'] = doc.findtext('{%s}addata/{%s}isbn' %(DEFAULT_NS, DEFAULT_NS))
                data['title'] = doc.findtext('{%s}display/{%s}title' %(DEFAULT_NS, DEFAULT_NS))
                creator = doc.findtext('{%s}display/{%s}creator' %(DEFAULT_NS, DEFAULT_NS))
                contributor = doc.findtext('{%s}display/{%s}contributor' %(DEFAULT_NS, DEFAULT_NS))
                if creator:
                    data['author'] = creator
                    data['author'] = 'No author specified'
                data['type'] = doc.findtext('{%s}display/{%s}type' %(DEFAULT_NS, DEFAULT_NS))
    return HttpResponse(json.dumps(list_data), mimetype="application/json")

Leave a reply:

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Site Footer

Sliding Sidebar

Blog of Masud Khokhar

Subscribe to Blog via Email

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Currently Reading