« Serving up XML feeds from TextDrive | Main | Enterprise Blogging in Practice »

Real programmers manage petabytes with Java and Python

Spotted on FoRK:

"I'll just mention that Fermilab uses a home-grown python package called Enstore to manage their data store of 3 Petabytes of physics data, growing at 1PB/year. The transfers of ~25TB/day to and from that system is what keeps me busy.
There's also a Java-based front-end called Dcache for caching and grid access. Part of that system just got a pile of raid units. 42 of them. They each hold 42 disk drives. Of 400GB. That's ~705 TB. " - Wayne Baisley

A pdf about it here

October 3, 2005 01:27 PM


Brian Miller
(October 3, 2005 08:54 PM #)

Virtual machinery has data grids conquered, as you note. But computational grids are still obsessed with native execution speed, and so C++ and ForTran rule.

(October 4, 2005 06:57 AM #)

Thankfully they are obsessed with native due to their inability to address the benefits of what java would bring them. Not to mention that java is every bit as fast as any native app and in many cases, faster due to runtime optimizations that straight native code can't do. There have been more than enough tests posted on the net regarding native C and C++ intensive computational tasks compared to being done in Java and it swings both ways. Many tests show java being faster even while in its JVM runtime.

(October 5, 2005 12:43 PM #)

Indeed. We had a clustering algorithm implemented in Fortran and hand optimized in C. We did a conversion of the C code to Java and ran it with the JRockit JVM. It was quite a bit faster than the C version so we went back to the C version and looked again at the optimizations and compiler optimizations to speed it up a bit. In the end the Java / JRockit version was 18-22% faster than the best C version we could get. The variance in the improvement had to do with the size of the dataset where larger was better. That said, the Fortran verison was still faster. However, after this experience, my preference in most numercially intensive systems with large data sets is as follows: Fortran, Java, C. C is just too slow by comparison. I never thought I would say that, but it is true. And yes, I tried this on both Windows and Linux and tried a number of different C compilers.