Indivo Transforms

Introduction

For the pipeline to be functional, data must be transformed from its original format into processed medical facts ready to be stored in the database. Each schema in Indivo therefore defines a transform that can be applied to any document that validates against the schema.

Transform Outputs

The ultimate output of the transformation step in the data pipeline is a set of Fact objects ready for storage in the database. However, technologies like XSLT are incapable of producing python objects as output. We looked around for a simple, standard way of modeling data that would meet our needs, and came up empty (though we’re open to suggestions if you think you have the silver bullet). As a result, we’ve created our own language, Indivo Simple Data Modeling Language (SDML), to both define our data models and represent documents (in XML or JSON) that match them.

Thus, transforms may output data in any of the following formats:

Outputs are validated on a per-datamodel-basis. For data model definitions and example outputs, see Indivo Data Models.

Types of Transforms

Indivo currently accepts Transforms in two formats:

  • XSLT documents
  • Python classes

This may change as Indivo begins to accept data in more, varied formats.

XSLTs

We won’t cover XSLTs in any detail here, as their format and use is clearly outlined in the specification. Since XSLT is traditionally used to transform XML to XML, the most natural output format for XSLTs is SDMX.

Python

For those unskilled in the arts of XSLT, we also allow transforms to be defined using python. To define a transform, simply subclass indivo.document_processing.BaseTransform and define a valid transformation method. Valid methods are:

BaseTransform.to_facts(doc_etree)

Transform an etree into a list of Indivo Fact objects.

Subclasses should implement this method, which takes an lxml.etree._ElementTree (the result of calling etree.parse()), and returns a list of indivo.models.Fact subclasses.

As an example, here’s how you might implement this for a transform from our Simple Clinical Note Schema to our Simple Clinical Note Data Model:

from indivo.document_processing import BaseTransform
from indivo.models import SimpleClinicalNote
from lxml import etree

NS = "http://indivo.org/vocab/xml/documents#"

class Transform(BaseTransform):

    def to_facts(self, doc_etree):
        args = self._get_data(doc_etree)

        # Create the fact and return it
        # Note: This method must return a list
        return [SimpleClinicalNote(**args)]

    def _tag(self, tagname):
        return "{%s}%s"%(NS, tagname)

    def _get_data(self, doc_etree):
        """ Parse the etree and return a dict of key-value pairs for object construction. """
        ret = {}
        _t = self._tag

        # Get the date_of_visit
        ret['date_of_visit'] = doc_etree.findtext(_t('dateOfVisit'))

        # Get the date finalized
        ret['finalized_at'] = doc_etree.findtext(_t('finalizedAt'))

        # Get the visit_type
        visit_type_node = doc_etree.find(_t('visitType'))
        ret['visit_type'] = visit_type_node.text
        ret['visit_type_type'] = visit_type_node.get('type')
        ret['visit_type_value'] = visit_type_node.get('value')
        ret['visit_type_abbrev'] = visit_type_node.get('abbrev')

        # Get the visit location
        ret['visit_location'] = doc_etree.findtext(_t('visitLocation'))

        # Get the specialty of the clinician
        specialty_node = doc_etree.find(_t('specialty'))
        ret['specialty'] = specialty_node.text
        ret['specialty_type'] = specialty_node.get('type')
        ret['specialty_value'] = specialty_node.get('value')
        ret['specialty_abbrev'] = specialty_node.get('abbrev')

        # get the signature
        signature_node = doc_etree.find(_t('signature'))
        provider_node = signature_node.find(_t('provider'))
        ret['signed_at'] = signature_node.findtext(_t('at'))
        ret['provider_name'] = provider_node.findtext(_t('name'))
        ret['provider_institution'] = provider_node.findtext(_t('institution'))

        # get the chief complaint
        ret['chief_complaint'] = doc_etree.findtext(_t('chiefComplaint'))

        # get the content of the note
        ret['content'] = doc_etree.findtext(_t('content'))

        return ret
BaseTransform.to_sdmj(doc_etree)

Transform an etree into a string of valid Simple Data Model JSON.

Subclasses should implement this method, which takes an lxml.etree._ElementTree (the result of calling etree.parse()), and returns a string in valid SDMJ format.

As an example, here’s how you might implement this for a transform from our Simple Clinical Note Schema to our Simple Clinical Note Data Model. Note the reuse of the _get_data() function from above:

from indivo.document_processing import BaseTransform

SDMJ_TEMPLATE = '''
{
    "__modelname__": "SimpleClinicalNote",
    "date_of_visit": "%(date_of_visit)s",
    "finalized_at": "%(finalized_at)s",
    "visit_type": "%(visit_type)s",
    "visit_type_type": "%(visit_type_type)s",
    "visit_type_value": "%(visit_type_value)s",
    "visit_type_abbrev": "%(visit_type_abbrev)s",
    "visit_location": "%(visit_location)s",
    "specialty": "%(specialty)s",
    "specialty_type": "%(specialty_type)s",
    "specialty_value": "%(specialty_value)s",
    "specialty_abbrev": "%(specialty_abbrev)s",
    "signed_at": "%(signed_at)s",
    "provider_name": "%(provider_name)s",
    "provider_institution": "%(provider_institution)s
    "chief_complaint": "%(chief_complaint)s",
    "content": "%(content)s"
}
'''

class Transform(BaseTransform):

    def to_sdmj(self, doc_etree):
        args = self._get_data(doc_etree)
        return SDMJ_TEMPLATE%args
BaseTransform.to_sdmx(doc_etree)

Transform an etree into a string of valid Simple Data Model XML.

Subclasses should implement this method, which takes an lxml.etree._ElementTree (the result of calling etree.parse()), and returns another lxml.etree._ElementTree instance representing an XML document in valid SDMX format.

As an example, here’s how you might implement this for a transform from our Simple Clinical Note Schema to our Simple Clinical Note Data Model. Note the reuse of the _get_data() function from above:

from indivo.document_processing import BaseTransform
from lxml import etree
from StringIO import StringIO

SDMX_TEMPLATE = '''
<Models>
  <Model name="SimpleClinicalNote">
    <Field name="date_of_visit">%(date_of_visit)s</Field>
    <Field name="finalized_at">%(finalized_at)s</Field>
    <Field name="visit_type">%(visit_type)s</Field>
    <Field name="visit_type_type">%(visit_type_type)s</Field>
    <Field name="visit_type_value">%(visit_type_value)s</Field>
    <Field name="visit_type_abbrev">%(visit_type_abbrev)s</Field>
    <Field name="visit_location">%(visit_location)s</Field>
    <Field name="specialty">%(specialty)s</Field>
    <Field name="specialty_type">%(specialty_type)s</Field>
    <Field name="specialty_value">%(specialty_value)s</Field>
    <Field name="specialty_abbrev">%(specialty_abbrev)s</Field>
    <Field name="signed_at">%(signed_at)s</Field>
    <Field name="provider_name">%(provider_name)s</Field>
    <Field name="provider_institution">%(provider_institution)s</Field>
    <Field name="chief_complaint">%(chief_complaint)s</Field>
    <Field name="content">%(content)s</Field>
  </Model>
</Models>
'''

class Transform(BaseTransform):

    def to_sdmx(self, doc_etree):
        args = self._get_data(doc_etree)
        return etree.parse(StringIO(SDMX_TEMPLATE%args))

Adding Custom Transforms to Indivo

Associating a new transform with an Indivo-supported schema is simple:

  • Write your transform, as either an XSLT or a Python module, as described above.
  • Drop the file containing your transform (transform.xslt or transform.py: make sure to name the file ‘transform’) into the directory containing the schema. See Adding Custom Schemas to Indivo for more details.
  • Make sure to restart Indivo after moving transform files around, or the changes won’t take effect.