Extremely XML

XML is one of the more exciting file formats in the past fwe decades. Rather than just being a convenient way to store information, it tends to open up accessibility to information to more software than any other format in history.

Tuesday, June 27, 2006

Visualizing Social Networks by Harnessing: SIOC and FOAF documents, RDF tools - and producing SVG diagrams

Fred Glasson wrote a very interesting blog entry entitled Implementing and visualizing relationships between Talk Digger's SIOC and FOAF documents.

Researchers are frequently publishing papers about analyzing social networks using semantic web technology now.

The reasons for that are pretty obvious:
  1. There are networks of bad people doing bad things in the world, including the US, right now - so social networks are topical.
  2. Social networks are already documented in semantic web syntax whenever FOAF documents are involved.
  3. Page scraping software can get information published on the Web as HTML text into semantic web compatible XML very quickly.
  4. Semantic web analysis tools are already very powerful and widely available for free.
  5. SVG is widely supported (including by Firefox 1.5) and, being another XML file format - it is easy to convert semantic web data (RDF and OWL based XML data) into SVG using tools like IsaViz.


So in other words, analyzing social networks provides low-hanging fruit for semantic web researchers and commercial companies and other organizations.

The toolchain (define: toolchain) he describes in his post is simple, completely based on free software, and generally useful to a lot of information analysis projects involving data-mining, information analysis, and graphical visualization of the results.

PC Pro: News: paper at w3 conference uses semantic web to turn social web into information goldmine

Data-ming seeks to bring out valuable nuggets of knowledge buried deep in a morass of data. Not only can it do that, if often does.

Really poor data mining jumps to conclusions about people based on false precepts. An example would be that matching first and last names proves matching identity. It does not. Everyone knows it does not.

Good information is qualified, by carefully matching up multiple pieces of information - and taking context into account.

Semantic web researchers recently used their skills to piece together a puzzle involving a huge number of people.

They analyzed a bunch of FOAF files, figured out who-knew-who - and compared that with a list of C.S. researchers.

They took that a step further, and tried to determine how prevalent the possibility of conflict of interest (COI) issues were.

PC Pro:
The plan was to map a simple social network Friend of a Friend, where individuals listed their immediate friends, against a commercial bibliographic database of authors computer science papers.

The latter was Semantically tagged, whereby records are attributed additional data describing each record, for example subject, date, author and so on. This means that online information can be meaningful not just to people viewing it, but also to computers accessing that data.

The goal of the research project was to discover whether there were any conflicts of interest between those authors putting forward papers and those chosen to review them. The researchers claimed the project brought out inferences that a simple topographical view would have missed.


The result is the researchers gleaned some interesting facts while preparing their paper, Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection. Facts they would not have stumbled over or inferred any other way.

What makes it interesting is how many people in the US have dumped their information into MySpace and other social websites. Even more interesting is that they identify on that same site, who their friends are. Sadly, I doubt most of those friends listed on that site really are friends in any conventional meaning of the word.

That is where context actually comes in.

However, the mechanism is still valid. The source of the input data just has be of adequate quality. The FOAF data the researchers culled from was closer to that than MySpace. So their conclusions, if identities were matched by more than first and last name, are probably interesting.

That... makes the Semantic Web a whole lot more interesting.
Technorati tags: , , , , , ,

Wednesday, June 21, 2006

Java Platform, Standard Edition (Java SE) 6 Beta 2

Java SE 6 (JDK 1.6) is on beta 2 now.

There are some interesting improvements being made that will make Java nicer for people trafficking/trading in XML - and persisting information with it.

Java Platform, Standard Edition (Java SE) 6 Beta 2:
New Client and Core Java Architecture for XML-Web Services (JAX-WS) 2.0 APIs
New support for Java Architecture for XML Binding (JAXB) 2.0


Seeing how they have hit beta 2 at mid-year, it seems very likely this JDK will be seeing its official release before the end. That is just going by past history, especially recently past history.

Sounds like Java will get even better for XML-oriented developers very soon.
Technorati tags: , , , ,

Monday, June 19, 2006

DocBook 5 beta 6 has added support for SVG+MathML

Great news this month for those doing software documentation using DocBook!

Not only does the latest beta of DocBook 5.0 support Schematron schema rules and RELAX NG schema grammar for validating DocBooks - it also adds support to DocBook itself for MathML and SVG.

It does not take a genius to figure out why DocBook upgraded its validation scheme when it upgraded itself to support these 2 new XML modules.

The XHTML 2.0 standard is another popular XML standard that is eschewing the older DTD/XSD schema technology in order to get the flexibility - not to mention ease of writing/reading/understanding that RELAX NG proffers.

My guess on what motivation for adding support for SVG+MathML is that Firefox 1.5 supports them, as well has XHTML. And, not just that Firefox supports having them combined in a single document - all 3 of them!

My guess is that the world of software documentation authoring/reading/distibution is about to take a turn into W3 nirvana. It is nice to see that after so long existing as separate - and often ignored - technologies, they are all finally getting integrated at last.

Cafe con Leche XML News and Resources:
Norm Walsh has published the sixth beta of DocBook 5.0.

DocBook 5 is "a significant redesign that attempts to remain true to the spirit of DocBook."

The schema is written in RELAX NG. A DTD and W3C XML Schema generated from the RELAX NG schema are also available.

There's also a Schematron schema "that validates some extra-grammatical DocBook constraints. These patterns are also present directly in the RELAX NG Grammar and some validators, for example MSV, can perform both kinds of validation at the same time."

This beta allows MathML and SVG in imagedata and improves support for aspect-oriented programming source code in DocBook documents.

Schematron, You Are On - paper, that is

At long last, the ISO Schematron standard spec is available as printed hardcopy from the ISO.

Not only that, it is available from ANSI too! Most americans will probably get it from ANSI, since their price is in dollars - not swiss francs. Not to mention most americans are better at reading English than French and do not live particularly close to central Europe.

O'Reilly XML Blog:
The paper and online versions of the ISO Schematron standard are now available from ISO for CHF120 and from ANSI for US$98.


This is one of the simplest ways to validate an XML document that there is. It is based on rules, not grammar. So it works better for XML document formats that are, as should be obvious, specified according to rules - not a grammar per se.

Combined with the RELAX NG schema format, which is a grammar based schema language (and a simple one at that) - Schematron rules can provide a lot of flexibility in determining if (or why not!) an XML document is valid.

Technorati tags: , , , ,

Bio of an XML 'Borg

Back in 1997 or 1998, I flicked a URL to a friend and former coworker of mine by email.

That link was to a new World Wide Web Consortium standard called XML.


I did not think much of it at the time. I remember I just thought it was interesting as a better way to represent data files than tab-separated values and a lot better than the crufty comma separated value (CSV) file format.

I also remember thinking that it was rather verbose, and not all that inspired really - to simply take the HTML format, and essentially say, There, you can make up your own tag names now, and use it for data!.

I mean that was okay as ideas went, but hardly revolutionary.

I was wrong.


My friend went silent for days after I sent that URL. And I think days turned into a couple weeks.

I thought he was mad at me over some slight I was unaware of but nevertheless responsible for, so I did not bother him. I figured if he wanted to get back to me, he would.

It turns out he was not mad at me at all. He was very, very deeply absorbed in learning everything he could about how XML was used.

That was something I had not bothered to do. I saw it as a data file format, and that was that.

Two weeks after I sent him my message, my friend broke the utter silence that had suddenly opened up between us.


He informed me that there was rapid progress being made in XML parsers, which was being spurred by the fact that there was a standardized parser API called SAX. So applications were written to use parsers via that API, and parsers basically functioned as plugins to them.

He went on to tell me that there was work underway on a new standard formatting tool called XSLT. He detailed some of the arguments between one of its inventors and a community that wanted it to be not merely something to style documents with - but a programming language for manipulating them as data.

He told me about the API wars being waged over the choice of functions that would be available in a standard programming library for manipulating XML documents from applications.


He and I subsequently did a presentation about XML to a company in Beltsville, Maryland in mid-1998. He and I introduced XML technology to an R&D office of a blue chip company in the suburbs of Washington D.C.

In late 1998 or early 1999, right around the turn of the year, I wrote my first Java code to generate XML. It was fun and it was easy. It generated a dump of the in-memory data model of an application I was debugging.

In 2000, I wrote some Java code that exported the definitions of the tapes in a tape library system for which I was one of the developers. I also wrote the Java code that would read that same document, verify that it was well-formed, and import the data if it was.

That pair of commands proved a very handy way to allow our system operators to do some basic maintenance of the system - without our team having to write a whole data enter/edit module for those entities.


After that, I got into using XML for documents. Generating web pages, word processor files, spreadsheets, PDF reports - that sort of stuff. I did a lot of that, and I enjoyed it.

If anything was particularly tedious about that, I would have to say it was the incompletely support for certain things in the standards.

While support for reading/writing XML itself has never really been lacking, what has been kind of incomplete is the software support for the most complex XML-related (or XML-compatible) technologies. In my experience, these things include:
  1. CSS
  2. XSL-FO

The things that generally did not seem to fall short of implementing the standards were: XML parsers, DTD validators, and XSLT transformers. Those things worked pretty well. If you wrote your programs for them based on what the spec said, your programs would generally work.

Things have gotten more complicated but the tools someone can bring to bear when working with data in XML format are awesome today.

That is why I named this blog Extremely XML. That, and because XML is not just a data file format. It is all the processing that can be brought to bear on information when it is stored in that format. That is the whole point of using the XML format to house it.