XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts

Bob DuCharme, Innodata Isogen

Content analysis: why? You’ve “inherited” content. Need to save time or effort.

Handy tool 1: “sort”. As in the Unix command line tool. (Even Windows)

Handy tool 2: “uniq -c”Â (flag -c means include counts)

Elsevier contest: interface for reading journals. Download a bunch of articles, and see what’s all in there.

Handy tool 3: Trang. Schema language converter. But can infer a schema from one or more input documents. Concat all sample documents under one root, and infer–this gives a list of all doctypes in use.

trang article.dtd article.rng
trang issueContents.xml issueContents.rng
saxon article.rng compareElsRNG.xsl | sort > compareElsRNG.out

compareElsRNG.xsl has text mode output, ignores input text nodes, and checks whether the RNG has references to each element, outputing “Yes: elementname” or “No: elemenname”. (which gets sorted in step 3)

Helps ferret out places where the schema says 40 different child elements are possible but in practice only 4 are used.

Handy tool 4: James Clark’s sx, converts SGML to XML.

Another stylesheet counts elements producing a histogram. [Ed. I would do this in XQuery in CQ.] Again, can help prioritize parts of the XML to use first. Similar logic for parent/child counts; where @id gets used; find all values for a particular attribute.

Another stylesheet goes through multiple converted-to-rng schemas, looking for common substructure. Lists generated this way can be pulled into a stylesheet.

Analyze a SGML DTD? dtd2html -> tidy -> XSLT. Clients like reports (especially spreadsheets). The is more like lego bricks.

-m

XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts

Related Posts

Cleaning Data by David Mertz now available

The physics of the impossibly tiny