User:DSosnoski
Please don't touch anything VTD-XML related, it is quite rude... If you still don't get the benefit of VTD-XML, there are other people who appreciates it... Thanks!
The above text is what was created for this page associated with my user name, as a comment from the person who restored the inappropriate references to VTD-XML which I'd edited on Wikipedia. VTD-XML is promoted by spamming on Java users lists worldwide, as well as on Wikipedia. In actuality, there would be a useful role for VTD-XML if the developers would try to make use of the strengths of the approach while also recognizing the benefits of other approaches. Instead, the tendency I've seen has been to promote VTD-XML for anything and everything XML-related, whether it's a good approach or not.
Here's an email I wrote to the Seattle Java Users List in response to a VTD-XML spam email:
Hi Jimmy, Your benchmark code only does a parse step, which effectively builds an index to the document. Here's the actual timing code as you supply it: for (int j=0;j<10;j++){ l = System.currentTimeMillis(); for(int i = 0;i<total;i++) { vg.setDoc(ba); vg.parse(true); } long l2 = System.currentTimeMillis(); lt = lt + (l2 - l); } //System.out.println("latency "+ ((double)lt/10)+ " ms"); System.out.println(" average parsing time ==> "+ ((double)(lt)/total/10) + " ms"); System.out.println(" performance ==> "+ ( ((double)fl *1000 * total)/((lt/10)*(1<<20)))); You are not actually retrieving the data or making use of it in any way in your test code. This hardly represents a realistic performance test for the vast majority of applications, which will at least want to look at the text content. As a minimal realistic test, I'd suggest counting the number of elements and attributes and processing each character of attribute data and character data content (even something as simple as just summing together all the character values for the attribute data and character data content). As for XMLBeans, this uses an underlying data store which is at least somewhat similar to what you're building in VTD. The data binding and such that XMLBeans implements is really just a facade over the data store. Because of this XMLBeans also does better than DOM in terms of memory usage, though I think XMLBeans stores data as characters rather than raw bytes. Since you replied, I am curious about the level of XML well formedness checking you do in VTD. Do you check that all characters are legal according to the XML productions? How about element and attribute names? Do you enforce namespace rules? How about checking for duplicate attribute names on an element? These are all checks that are required by XML and add to the overhead of most parsers. I don't know if you based your XML parsing on an existing parser or wrote your own, but if it's the latter these are all issues which are easily overlooked. You might want to consider adding your answers on these questions to the FAQ, since the questions are probably going to occur to most developers who are familiar with XML. Calling VTD a vanity project was overly harsh, and I do apologize for that. I think it has very interesting potential for some types of applications, especially if the benchmarks are made more relevant and answers are provided for these types of questions. - Dennis Dennis M. Sosnoski SOA, Web Services, and XML Training and Consulting http://www.sosnoski.com - http://www.sosnoski.co.nz Seattle, WA +1-425-296-6194 - Wellington, NZ +64-4-298-6117
And the follow-up:
Jimmy Zhang wrote: > The reason the benchmark doesn't retrieve any data and use it is because > > (1) It is difficult to benchmark that because there are a lot of different ways data > can be retrieved and used, how do you benchmark all use cases to make it fair > and unbiased?? how? > > Then don't do benchmark comparisons with parsers that actually supply the data to the application in a usable form. The DOM benchmark comparisons are at least more reasonable from this standpoint, though if you're going to benchmark at all I'd love to see some tests that actually compare usage rather than just the time taken to input the document. So how fast is your XPath support compared to the DOM XPath support for a few sample queries, for instance? And how fast is basic data retrieval compared to DOM? These are all issues that are important to real users, but your benchmark consists solely of comparing your time to index a document with that of building a complete DOM or parsing a document with the full content delivered to the application as text. > (2) Just like DOM, VTD-XML is intended to be offer random access, and > the overhead of retrieving data and making use of it ranges from zero to negligible, > > "from zero to negligible" - gotta love that phrase. If that's really true, then adding a retrieval test to the benchmark shouldn't effect the results. > The basic idea of VTD is that string creation is not only slow, but also avoidable. > I recommend to read the FAQ section it, for example, VTD-XML contains the > functions to convert a VTD record into an integer, bypassing the string creation step, > saving the object creation cost and garbage collection cost... that is why for the mininal > realistic test you mentioned simply won't make any difference > > I understand and appreciate that you're able to avoid some overhead from string conversions - it's something I also try to avoid when I'm writing code for performance. Unfortunately, many components of an XML document need to be retrieved as text to be useful to an application. > We didn't overlook the wellformedness issue, we have been explicit about the limitation > of VTD-XML such as not supporting external reference, etc, we of course check for > legality of characters and duplicated attributes... > > I'm glad to hear that, though as I said before this is something I think you should address in your FAQ. > we have no intention to deceive anyone, the reason we did all the posting is because we > are very certain VTD-XML is going to help people solve a lot of problems and make > previously impossible tasks possible... > > I have a hard time seeing how you've made previously impossible tasks possible. I think your approach does have some nice potential, especially if you can extend it to work with documents larger than memory size without major performance hits. But does it even support document creation or modification? I didn't see any obvious way to handle this in the API. If it's read-only, that's a major limitation for most uses of XML. > And we have got a lot of excited people who deeply resented SAX, tried VTD-XML, > and can hardly believing what they have seen. > > 2x as fast as SAX with NULL content handler, 1/4 the memory usage of DOM, > and still XPath capable, I just don't see how VTD-XML is deceptive... > > I'm glad you've got the excited users. That's always the ultimate test of success for any open source project. From my perspective the SAX speed comparison isn't valid, since SAX delivers the content to the application as the document is parsed. The lower memory usage and speed advantages over DOM are nice, but they're only a linear improvement - you're not able to handle vastly larger documents than DOM. In fact, DOM running on a 64-bit version of Java would actually support handling larger documents than VTD (though I wouldn't recommend DOM for this purpose). XPath support just matches what you get with DOM (if that - here again, I haven't been able to find anything on your site that tells how much of XPath is supported, so it's difficult to compare with other libraries). The fixed maximum sizes on documents, names, prefixes, etc. are also potential issues with VTD. I know of organizations that are already processing documents larger than 2 GB, for instance (you see them trying to exchange these documents as SOAP messages on the Axis list from time to time), and complex documents can get beyond 255 levels of nesting (though thankfully this type of structure is uncommon). - Dennis
And a final comment:
This doesn't appear to be a productive discussion, so I won't continue it beyond this email. I will point out a couple more weaknesses in your benchmarks, though, in case you ever want to improve the quality of the results. In your VTD timing test program, you read the test documents completely into a byte array in memory. This is standard for timing tests working with XML documents, since it eliminates the I/O overhead which is highly system-dependent and not part of the actual processing. What's not standard is that you then supply the byte array directly to your code for indexing, while all other forms of processing use a ByteArrayInputStream backed by the byte array. By supplying the byte array directly for indexing you're creating a very artificial basis for comparison, since real applications are not generally going to have the documents to be processed conveniently located in byte arrays already read into memory - they're instead going to have to read the documents into memory, creating a byte array to be passed to VTD, since VTD will not work with documents in any other form. For a more realistic test you should use a ByteArrayInputStream as the data source, including the time required to read the document from the stream and create a new byte array as part of the VTD timing. In your DOM memory test program, you do not allow any garbage collection to take place after repeatedly parsing the input document to create the DOM representation. This means that your DOM memory usage figures include a large amount of memory that was used for temporary objects which would return to the available memory pool after a garbage collection. Based on past experience with DOM this can cause the memory usage to be overstated by as much as a factor of 2. You use the same approach for measuring VTD memory usage, but the VTD indexing approach does not create any significant number of temporary objects (part of why it's certainly faster than the DOM build) so the memory usage in that case is probably much more accurate. The technique I've used in the past to encourage garbage collections when I wanted to see how much memory was actually being used by data structures was to loop 5-10 times on a System.gc() call, with a 1 second delay between calls. - Dennis Dennis M. Sosnoski SOA, Web Services, and XML Training and Consulting http://www.sosnoski.com - http://www.sosnoski.co.nz Seattle, WA +1-425-296-6194 - Wellington, NZ +64-4-298-6117