« Struts Action 2 | Main | UI clunker #2 »

Ripping PDF

In "Imitation is the saddest form of flattery" Dave Thomas writes

"For this reason, I honestly don’t mind other publishers blatantly ripping us off. But I’d rather they didn’t. Instead, I’d rather they found their own ways of innovating, and build their own ideas that others found useful. The publishing industry is in transition. It needs all the good ideas it can get. All publishers should contribute in their own way to the reshaping of the industry. Simply aping someone else’s success won’t help the community as a whole."

I felt Dave Thomas came across as whiny (ed: too opinionated) disappointed, as well as unintentionally hinting that the Pragmatics innovations don't pose high barriers to entry. Then I read Derrick Story's entry "New "Rough Cuts" Provides Early Access to O'Reilly Books" and have a bit more sympathy. No mention of the beta books model. Story links to another post where Tim O'Reilly says:

"At O'Reilly, we've always said that a key part of our business is watching the alpha geeks, and then building products to bring their knowledge and insights to a wider audience. "

Seems like the alpha geeks have figured out how to rationalise book publication. In fairness O'Reilly mentions the beta book scheme as an inspiration, as well as mentioning he figured it out back in 2000, and didn't implement it.

Still I wonder if either house gets why the beta book is valuable. It's not because one gets involved in shaping the book, it's because one gets information now. The lesson here is that for some purposes, people don't need the level of quality associated with a book if the information is timely. Weblogs and online articles have likely been a big part in this lowering of expected standards. The other lesson is that the traditional publishing cycle for a tech book is dysfunctional for open source projects, SaaS, and "release early and often" software. They make books instant legacy - by the time book ships the project has moved on. I've seen this with Eclipse WebWork, Spring, Rails, Subversion, Hibernate, you name it. One answer to this is something like a beta book programme, or for post-ship, the Sourcebeat model of getting maintenance releases to the book. [I can see the Sourcebeat model being adapted into the IT sector for documentation and operations manuals, which are notorious for getting out of sync with deployed systems.]

Here's the thing - distribution is still broken. Now that publishers have figured out how to function in market redefined by open source projects and online services, you still can't get a book or chapter via a feed or as markup. It's all licenced PDF, -from the Prags, from the Safari Online, from Sourcebeat. It would be interesting to see the current moralizing around music/video/software copyright and distribution played out in the tech book sector if programmers ever start ripping PDFs to XML.

January 24, 2006 01:30 AM


(January 24, 2006 10:31 AM #)

Delete this comment please!


Dominic Mitchell
(January 24, 2006 06:32 PM #)

Having recently had to do text extraction with PDFs at work, I'd say that it's a pretty secure form of copy protection. At least in terms of getting semantic markup out of the bloody things...

Dave Thomas
(February 14, 2006 06:17 AM #)


Sorry to come across as whiny. To be honest, I was more disappointed than anything else.

FWIW, I'd love to distribute the books as markup, and if I was clever enough to come up with a business model that let me pay authors for their work and Bank of America for my mortgage, I'd do it in a flash.

Cheers, Dave

Bill de hOra
(February 14, 2006 12:23 PM #)

Whatever about my (now edited ) opinions on your post, your production model seems to kick everyone elses into touch (ie it's 100% disruptive). You've become the Toyota of tech books. And I've not much sympathy for any house that sat on the same model for years and didn't pass on the potential savings to me.

The only criticism I've heard is that it might pressurize authors in tighter release cycles.

And - is it a race to the bottom? I don't think you'll start a publishing price war a la low cost airlines (but I think you *could* start shipping PDFs at 10 bucks for volume). Instead you're maintaining price parity but passing on other features to readers like early releases.

Also there's the free stuff. There's *tons* of good content on weblogs, better than many books. As a result unless it's from you or Apress, I rarely buy "jobbing" books anymore - I'm tending to CS texts and classics instead; things that have longeivity.

Maybe aggregating the online stuff, providing editorial oversight and repackaging it is the way to go, ie offering an alternative stream to ad revenue. It seems to have worked for Spolsky, as a one off at least. For another example keep an eye on Redmonk (the analyst firm); I think will perform this editorial "meet and greet" service for enterprise IT analysis, connecting the best online authors with enterprises, undercutting the vendor/analyst offering. Everyone's deploying OSS, who needs traditional analysis for that?

Joe Clark
(February 14, 2006 11:22 PM #)

Tagged PDFs extract reasonably well, though considerable cleanup is still necessary. The problem is that only my friends and I know about tagged PDFs and even “leading” publishers like O'Reilly don't care enough to use them. (Hint: Tick “Use eBook tags” in InDesign to get them automatically.)