this is a german Web-Mirror of MHONARC.ORG powered by Domainunion AG

[Top] [All Lists]

Re: [xsl] Reflecting on: csv data to xml

2013-07-01 08:25:06
It may be of some interest to this thread, at The National Archives we
do a lot of CSV to XML processing using a minimally modified version
of Andrew Welch's XSLT. However as once we have the data in XML we
need to further extract and process the data we need to be certain of
the original CSV format (which subsequently enables us to be certain
of the resultant XML format, amongst other concerns). To achieve this
we have built as open source a CSV Validation tool.

The CSV Validation tool consists of a specification for a simple text
grammar that describes the format of a CSV file and rules that are
asserted against the CSV file. It also includes an implementation for
the JVM (in Scala, we also provide a Java API) which takes such a
grammar and CSV file and  performs the validation, reporting all
non-validating issues or pass. The tool is available here It should be considered beta, i.e. we
are using it internally but until now it has not been publicised. In
addition documentation is missing but the EBNF file in the source repo
describes the grammar, and running the tool without arguments gives
you the simple command line useage. I hope documentation will follow
shortly, in the mean time issues etc should be aimed at the Github

On 30 June 2013 10:49, Wolfgang Laun <wolfgang(_dot_)laun(_at_)gmail(_dot_)com> 
The thread "csv data to xml" was triggered by a relatively simple
problem: converting CSV data to XML. There were one or two voices
advocating the use of Perl (or similar) "for this kind of problem" in
preference to XSLT, and there were claims that it would be a simple
matter to use XSLT's analyze-string... Now I'm not going to vote
either way - I'd just like to post some observations I made while
investigating this. If you are impatient, skip down to "conclusion".

I decided to implement this in Perl and was hoping to be able to
compare this with an equivalent implementation in XSLT, concentrating
on ease of development and maintainability. Ken's implementation
<> filled the XSLT slot.

I had a quick Perl 5 filter solution up and running in 30 minutes, no
program parameters, hard-coded names for document and row elements,
but using the first CSV line for obtaining the names for the cells.

10 Minutes of that time were spent on getting a couple of Perl
packages from CPAN, one for parsing CSV and another one for writing XML,
which reduced the code I had actually to write to 23 lines.

Considering this to be too sloppy, I spent some more time, adding
a *nix-style CLI (for file names, element names,...), data checking
(invalid element names, excess cells in a row), default element names
for cells (using "A", "B",...), CLI documentation etc.

Ken's solution falls short on a few points I was able to add easily. I can't
say how difficult they would be to add to Ken's existing solution - it might
not be a matter of minutes for some of those add-ons.


Perl's CPAN is a great asset. Certainly, the quality of its offerings varies,
but the packages are tested and users report on their experience. (Why
doesn't XSLT have anything like it?)

Ken used a proprietary (?) solution for embedding documentation that can
be extracted into HTML. Now that's great, but it is a solitary answer to the
problem. Perl's pod is a somewhat clunky solution but it is supported with
a rich toolset, along with the Perl distribution. I consider the
existence of a documentation format that is defined along with the
language as "state of
the art" and essential for sustainable SW development.

XSLT is "special purpose" for XML handling and consequently easy to use,
but it isn't better than the average language for string processing.


XSL-List info and archive:
To unsubscribe, go to:
or e-mail: 

Adam Retter

skype: adam.retter
tweet: adamretter

XSL-List info and archive:
To unsubscribe, go to:
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>

<Prev in Thread] Current Thread [Next in Thread>