Addons/xml/sax
User Guide | Installation | Development | Categories | Git | Build Log
xml/sax - XML parser based on Expat library
SAX (Simple API for XML) parser addon.
There is both flat API and object oriented, SAX-like interface.
Binaries for Windows, Linux x86 and Darwin PPC included.
Based on Expat 2.0.0, see http://expat.sourceforge.net/
See also: examples in test folder in SVN;
change history.
Installation
Use JAL/Package Manager or download the xml_sax archive from JAL:j602/addons and extract it into the ~addons/xml/sax folder (or ~addons/xml for j504).
The listings and results of some examples found in the test folder.
SAX Usage
SAX (Simple API for XML) is originally a Java framework by David Megginson derived from expat processing model. This paradigm results in systematically faster XML processing than DOM, as the SAX stream has a tiny memory footprint. See http://www.saxproject.org/.
SAX parsing works within the push model, i.e. the API calls you. You provide the callback functions by overriding the base class, see saxclass definition. For the XML nodes events, these functions are called on.
A higher-level visitor design pattern can be obtained if you define verbs with names of elements of interest and a prefix and call then from start/endElement. This would be similar to wd calling on event verbs.
In your class you maintain the state and selectively process the events. The event for text between tags is called characters. It is demoed in the table and rss examples.
In the "rss" example, a simple stack of nested elements is maintained in the S list. Then characters processes the text accroding to the current context.
You can pass the result for process in the output of endDocument, which is the last event called.
x2j - XML to J
x2j is an extension in the 'xml/sax' addon, which allows to compactly describe XML parsing in a declarative way.
Hopefully, this will finally overcome the entry barrier for people to take on using XML with J.
Simple things are really simple; complex are, well, possible. Just to give you an idea. Say, covert an HTML table to J table.
<table> <tr> <td>0 0</td><td>0 1</td><td>0 2</td><td>0 3</td> </tr><tr> <td>1 0</td><td>1 1</td><td>1 2</td><td>1 3</td> </tr><tr> <td>2 0</td><td>2 1</td><td>2 2</td><td>2 3</td> </tr> </table>
Here's all the code with definition and execution:
require 'xml/sax/x2j' NB. Preamble x2jclass 'px2j3' 'Items' x2jDefn NB. Definitions / := table : table=: 0 0$0 tr := table=: table,row : row=: '' td := row=: row,<y ) process_px2j3_ TESTX2J3 NB. Processing +---+---+---+---+ |0 0|0 1|0 2|0 3| +---+---+---+---+ |1 0|1 1|1 2|1 3| +---+---+---+---+ |2 0|2 1|2 2|2 3| +---+---+---+---+
See the link above for more of exciting examples, such as nested structures and attributes handling.
As illustrated above, the usage pattern consists of three parts:
- Preamble loads the x2j library and inherits from the px2j class
- Definitions in full or compact notation (see below)
- Processing starts parsing, which will call respective definitions
Each definition consists of
- Path part, which is matched to the current node. Currently, only the tail part of the path is matched; future version may support more XPath expressions. A special path for root '/' matches the top-level document.
- Explicit Definition (J code) which is invoked when the current parsed node matches the path.
Ambivalent definitions (monadic and dyadic) are used for elements, where the
- dyad is called at the start of an element, having
- y with element name and
- x with attributes, accessed with the verb atr, e.g. test=: atr'test' or 'one two'=:atr'one two'
- monad is called at the end of an element, having
- y with element name.
Monadic definitions are used for text nodes, where the
- y contains the string value of the text
In each definition category (element or text), if the same path is repeated, only the first occurrence is used.
Below is the summary of the full and compact definition notations.
Full Definitions
Element definitions are of the form
'path' x2jElm (3 : 0)
end of element code
:
start of element code
)
Text definitions are of the form
'path' x2jChar (3 : 0)
text code
)
The same path can be used in an element and a text definition.
The definitions are regular J explicit definitions, and thus can be
contracted to one-liners or be tacit, e.g.
'path' x2jChar (3 : 'code')
or even reference named verbs, e.g.
'path' x2jElm verb or
'path' x2jElm (monad : dyad) .
x2jElm and x2jChar are conjunctions that build corresponding matching tables for processing, so no copula (=:) is used.
Compact Definitions
Compact definition is a convenience notation, which is translated into a series of full definitions internally.
Each line consists of path, definition token := (not a copula), and one-line explicit definition, in which monad and dyad parts are separated with spaced colon ( : ). Note that arguments x or y must be indicated explicitly.
'Note' x2jDefn
elm1 := monad code : dyad code NB. ambivalent - element
elm1/elm2 := : dyad code NB. dyad-only - element
elm1/elm3 := monad code NB. monad-only - text
)
A number of compact definitions can be freely mixed with full definitions.
Processing
x2j processing is similar to the SAX visitor pattern. As the XML is parsed, each node is visited in a depth-first manner. When the current node path matches one of the definitions, the corresponding code is invoked.
The passed parameters or attribute values can be assigned to the locale (=:) variables (not local), and used in other definitions to construct higher-level items.
The result of the whole processing is the return value of the end monad verb of the document (with the root path '/'). All other results are ignored.