XML 2008 liveblog: Automating Content Analysis with Trang and Simple XSLT Scripts
Bob DuCharme, Innodata Isogen
Content analysis: why? You’ve “inherited” content. Need to save time or effort.
Handy tool 1: “sort”. As in the Unix command line tool. (Even Windows)
Handy tool 2: “uniq -c” (flag -c means include counts)
Elsevier contest: interface for reading journals. Download a bunch of articles, and see what’s all in there.
Handy tool 3: Trang. Schema language converter. But can infer a schema from one or more input documents. Concat all sample documents under one root, and infer–this gives a list of all doctypes in use.
trang article.dtd article.rng
trang issueContents.xml issueContents.rng
saxon article.rng compareElsRNG.xsl | sort > compareElsRNG.out
compareElsRNG.xsl has text mode output, ignores input text nodes, and checks whether the RNG has references to each element, outputing “Yes: elementname” or “No: elemenname”. (which gets sorted in step 3)
Helps ferret out places where the schema says 40 different child elements are possible but in practice only 4 are used.
Handy tool 4: James Clark’s sx, converts SGML to XML.
Another stylesheet counts elements producing a histogram. [Ed. I would do this in XQuery in CQ.] Again, can help prioritize parts of the XML to use first. Similar logic for parent/child counts; where @id gets used; find all values for a particular attribute.
Another stylesheet goes through multiple converted-to-rng schemas, looking for common substructure. Lists generated this way can be pulled into a stylesheet.
Analyze a SGML DTD? dtd2html -> tidy -> XSLT. Clients like reports (especially spreadsheets). The is more like lego bricks.
-m