35 internet years

zeldman

2001: "Thousands of new sites premiere every day. Most of them are built to support bad browsers intead of standards. It’s an epidemic. Enough already. We finally have good browsers. Let’s use them."

 

Zeldman

2008:"One accommodates Microsoft as one’s ancestors accommodated Imperial Rome. As a wiser man than me said, 'Render unto Caesar.'"

Installing buildr trunk (1.3.4 pre) on Ubuntu 8.10

Update 2009/04/11: Assaf has a better way:

"There's a snapshot of 1.3.4 you can gem install from apache.org without all the excessive dev dependencies.

sudo gem source —add http://people.apache.org/~assaf/build...
sudo gem install buildr"

WFM.

 

Buildr documentation:

To install Buildr from the source directory:

$ cd buildr

$ rake setup install

I got some errors doing that. This worked for me on Ubuntu 8.10

# cd /tmp
# wget http://rubyforge.org/frs/download.php/45905/rubygems-1.3.1.tgz
# tar xzf rubygems-1.3.1.tgz
# cd rubygems-1.3.1
# sudo ruby setup.rb
# sudo apt-get install python-setuptools
# sudo gem install echoe
# sudo gem install cucumber

# git clone git://github.com/buildr/buildr.git
# cd buildr
# rake setup install
# buildr --version
Buildr 1.3.4

This was to get to a post-1.3.3 Buildr to setup a Scala/Java project structure, as Buildr supports Scala compilation, plus I gather there's lots of good stuff on trunk. I still had to add require "buildr/scala" to the buildfile. As much as I prefer Buildr/Ivy for bootstrapping a project over Maven2, I wonder about needing a cross-language dependency chain (or gems) like this for doing Java/JVM stuff (such as having to install easy_install to get a gem set). Having never used it in a production/industrial setting it's hard to say. Otherwise, I do like Buildr.

Naked CSS Day

It's naked css day; at least for the web pages here that are not html on the filesystem. Part of me thinks this matters less and less each year - for me at least since most weblog information I consume through feedreaders.

A reasoned response to Scala/Ruby at Twitter...

Alex Payne: "Make things, measure them, have reasonable and respectful conversations about them, improve them, and teach others how to do the same." - Mending The Bitter Absence of Reasoned Technical Discussion.

as far as the current Ruby/Scala "debate" goes - I would say always bet on protocols and formats, the web being the prime example. Because as someone who likes Twitter immensely, I like that I don't have to care too much what Twitter is written in or what it runs on. I like that behind the server, the entire stack can be swapped out or ground up rewritten as the service owner sees fit, and as seems to happen with many popular Web services as they grow. That the Twitter API can persist across such internal upgrades is a wonderful thing. This is possible because on the Web, programming languages are an implementation detail. Including javascript/actionscript code on demand.

The Format Of The Long Now

Mark: "HTML is not an output format. HTML is The Format. Not The Eternal Format, but damn if it isn’t The Format Of The Now."

If that doesn't jibe with you, follow the link and view source on the markup around those statements.

Related. Now, view source on that link. Savor the irony.

Feature Creep

Joe: "The ultimate destination of programming language evolution is lisp-without-parentheses"

...with optionally typed function arguments.

Backwards compatibility is commitment

Marc Andreessen: "That's a big deal, that's a very big deal. It's a very serious commitment for a company. Apple's had this commitment, Microsoft's had this commitment. It's what's called a commitment to backwards compatability. So you have to commit to never break anything. So you load up Windows Vista and it run the original Visicalc from thirty years ago, which was the original killer app on the PC, the original spreadsheet. So that is a long term institutional commitment, that takes a very serious company to be able to do."

Partial update is the problem?

Mike Amundsen: "...once you start introducing partial updates, you open yourself for caching problems. doing partial updates means all cached copies of the original resource are now invalid. "

"Just" use POST

Tim Bray: "But maybe Joe needs a bigger club, because I have to admit that limiting myself to GET and POST just doesn’t cause me that much heartburn."

I get asked a lot about PUT v POST, as do other people associated with REST based design. The question comes up online frequently as well (eg it's a regular topic on the rest-discuss and atompub lists). Usually it's in the context of updates via forms posting or how to change just a few data fields. "How do I change the title of an entry?" is a very common and valid use case. Forms posting is easy to code to and highly portable - almost all deployed client and server libraries support (and are often optimised for) forms posting.

The pro-REST answer is to use PUT. PUT means update the resource with this entity, which tends means "overwrite". Let's think for a moment about how that works for things like tags in a blog post - if I leave the tag out, am I saying remove it or ignore it? On the server side, a PUT to a resource involving embedded lists (eg tags in Atom/RSS entries) tends to result in ugly code when either the backing system is an RDBMS or the representation is any "joined" structure in the persistence layer - they'll have to diff what's persisted against what's sent, which for 99% of people means a "select for update" pattern (a double for loop cross-referencing the posted tag list with the database tags is a sure sign you've hit this problem). Yes, you can store the entity straight to disk or use a non-relational architecture - but now you have N indexing problems, something a relational database "just" solves for the 99.9% of developers who don't have a megadata problem.

So PUT often feels wrong or contorted to developers who literally want to mod a couple of fields. Hence PUT is much less popular in the wild than forms posting (all aside from the fact that PUT is excluded from HTML4 forms).  In other words, people tend to see PUT as a heavyweight, sucking, POST. In turn they "just" use POST+forms.

Are we done? Unfortunately, no.

When does PUT v POST actually *matter*? It matters, as far as I can tell, when your resource stands for a collection, which is very common - folders, albums, feeds, collections, blogs, tagclouds, orders, a shopping cart - any list based structure.

Let's take AtomPub as an example - to add something to a collection using AtomPub, you use POST:

POST /collection
host :example.org
content-type: image/png

...binary...



Easy, and you can update that uploaded object later via PUT. Updates to the collections themselves are undefined in AtomPub.  But let's ask, how would we do that? We could PUT the Atom feed (san the contained Entries) back to the collection URI. So imagine we want to change only the title - isn't an entire PUT of an Atom feed (san the contained Entries) verbose, inefficient and stupid for that simple usecase? We could "just" use a form post instead:

POST /collection
host :example.org
content-type: application/x-www-form-urlencoded

&title=foo


Ahh. Boom. Updating the collection in this way uses the same verb as the adding to the collection. How to tell the difference in client intent? The answer here for most people, will be to use the fact that forms posting has a specific media type - so the media type "qualifies" the operation. This definitely isn't REST style, as the verb is no longer uniform; at the same time it's not an abstract concern - there'll be a big switch in the code somewhere that looks for the media type - exactly the kind of thing good programmers hate.  Let's remember that AtomPub servers aren't limited to blog posting - they can accept any media type they declare support for, adn thus can act as generic upload systems (if you have a stable network, more on that another time).

One workaround could be that if the client sent a corresponding "ID", like this:


POST /collection
host :example.org
content-type: application/x-www-form-urlencoded

&title=foo&id=http%3A%2F%2Fexample.org%2Fid%2Fefgfeacbe

the server could detect that the ID is present. It feels funky though, aside from having to map the field/keys in your precious snowflake format into forms parameters

Speaking as a member of the IEFT WG, perhaps we shouldn't have skipped collection updates in AtomPub as it would have made the overall constraint clearer - POST can't be used in the general case for updates to collections, ergo PUT is the only uniform approach to updating their content. On the other hand lots and lots and lots of people don't, won't (and sometimes can't) care about REST/HTTP/AtomPub arcana. So some part of me thinks we need patterns and practices to help developers jfdi.

Fwiw, like Tim, I can live with the forms POST option, to either update a collection or perform a partial write. But think about it for a bit - switch on type is a fairly ugly workaround. Not quite RPC, but problematic. Blog entries in turn are often collections (containing media), as are the folders you find in WebDAV and so on - it's not a problem specific to AtomPub.

So when you ask a pro-REST person about why not "just use forms" for partial updates instead of having to write out the entire data to send to the server via PUT, and they go "uhm, uhm,...", this is the kind of design kludge they're thinking about. Maybe you could PUT a form as a workaround for partials - I think that could work better than POST or having special "edit" URIs for anything collection-like. But as far it goes as I'm not sure we in the pro-REST community have a good general answer or design pattern for partially updating a resource. Until we do, I predict people will tend drop down to using forms posting as it's the easiest and most portable approach for deployed client libraries and web frameworks. That or define some other specialised media type for partial updates.

Containerization

Dan Diephouse on Deployment : "I'm continually amazed at how hard of a problem deployment actually is. If you're going to be deploying any reasonably sized application you have an endless list of things to worry about:

    * Taking the cluster up and down so there is no downtown
    * Managing the configuration of individual nodes
    * Operating system setup
    * Installation of required libraries/3rd party tools
    * Managing dev, QA, staging and production deployments
    * Schema migration/database updates
    * How to do rollbacks

[...]

There are a few other interesting tools out there."

I think one reason that there are only a few tools for deployment is that it's a general end to end problem, technically and organisationally. When you understand the enormity and complexity of bringing up even middling size systems, never mind big ones where components are constantly failing, it can be an overwhelming thing to bite off. Very possibly it means altering existing build systems, or even how the organisation itself is arranged (since deployment cross cuts standard boundaries such as development, qa and operations). Which could seem like ocean-boiling.

Tools like Puppet and SmartFrog take the problem space head-on and look to be general purpose solutions, so I agree with Dan's pointers to them. As an example Dan links to Steve Loughran's deck on deploying a Hadoop cluster. They're well ahead of other FOSS tools that I know of and it's remarkable how few people know they exist. But to use those means skilling up and investing in their configuration language, which might seem arduous. Xen images, version control, language and distro packing all add more flavour to the mix - is your deployment unit a tarball, a warfile, a gem, a deb, an image, a git checkout? All of them? Knowing what the container unit is matters.



Hence you see people starting out with point solutions and dealing with either with problem subsets or specific pain points (code rollout but not configuration or health checks),  app/framework specific tools like capistrano (deprecated afaik, thanks for the correction Bob), tail-ending sftp/tomcat tasks onto your Antfile, or in-house shell scripts. None of these scale up as systems get bigger or more layered. 

If you won't adopt an external framework, probably the most important thing to do is get past shell scripts to a declarative configuration language so deployment configurations can be managed in their own right. Getting the data structures and component models that represent the state of your running system, right is very important (both puppet and smartfrog have ways to describes and compose systems). Otherwise you're going to being rewriting those scripts forever. This will make your shell scripts more like command line tools than one-offs.

"I hope that we start to see more core infrastructure managed by the infamous cloud people. Just write your app, upload, and tell it where to deploy. Then we can focus on building applications, which is what we really want to do anyway"

This offloading reminds me of the early promise of J2EE containers, but it turned into a vendor specific hell. I'd hope the hosted world can do better :) In any case, while good tools matter, deployment automation is as much about improved process quality.

Format mappings and transitivity

Dare Obasanjo has responded to my post Format Debt: what you can't say by asking "Can RDF really save us from data format proliferation?". Quoting him, quoting me*:

"Bill de hÓra has a blog post entitled Format Debt: what you can't say where he writes

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases. 

I've always found this particular argument by RDF proponents to be suspect. When I complained about the the lack of standards for representing rich media in Atom feeds, the thrust of the complaint is that you can't just plugin a feed from Picassa into a service that understands how to process feeds from Zooomr without making changes to the service or the input feed."

Being a proponent is relative. I'm not sure I'm considered an RDF proponent in the RDF community, having been critical in the past ;) But generally, I can't agree with the argument. Under the hood, it's just mapping and there's no magic here - technically the language (RDF in this case, there are others) will either be able to express the mappings or it won't. For example, RDF can't map celsius to farenheit, but I know it can map foo:title to atom:title.

"The issue I'm pointing out is that either way a developer has to create a mapping."

Right; the questions really are how many mappings, where they are declared and to what extent you can stand over them as being sound. We've be doing this in code for years for syndication formats by mapping them into internal object models in code - every library then having its own mappings that might or might not be consistent. Dare mentioned MediaRSS and without an external configuration for extension formats, we'll have to do for MediaRSS as it appears in the wild today what we do  for the 9+ RSS/Atom formats are out there. The double whammy as part of format of the Format Debt is it appears that MediaRSS needs to be mapped to itself in Dare's examples because parsing syntax can result in different dict/tree data structures.

"The problem with this argument is that there is a declarative approach to mapping between XML data formats without having to boil the ocean by convincing everyone to switch to RD; XSL Transformations (XSLT). "

Not quite the same thing (I'll explain why in a minute). XSLT is actually computationally more powerful than RDF - afaict XSLT could do the celsius to farenheit mapping. It can do knights tour.

"In my experience I've seen that creating a software system where you can drop in an XSLT, OWL or other declarative mapping document to deal with new data formats is cheaper and likely to be less error prone than having to alter parsing code written in C#, Python, Ruby or whatever. However we don't need RDF or other Semantic Web technologies to build such solution today. XSLT works just fine as a tool for solving exactly that problem. "

But XSLT is code. All we're saying by this is that XSLT code is cheaper and less likely to be error prone than Python et al. Which I can buy - an XSLT sheet done well can be an executable specification. All an RDF (or "interlingua") proponent will say is that RDF can be even cheaper and less error prone, and much of the reason not to adopt it is down to developer preferences, lack of familiarity, tooling and so on - i.e., much the same reason developers don't adopt XSLT, summarising the issue as "XSLT sucking".

Finally, I think you can easily argue that RDF/OWL gives more leverage for this kind of problem than XSLT, even though RDF is a computationally less powerful, because it allows you state relationships using formal semantics. For example if I write down that:

atom:title owl:sameAs foo:title

foo:title owl:sameAs bar:title

I can infer

bar:title owl:sameAs atom:title

without writing a line of code and I can use that on seeing new data. The predicate "owl:sameAs" is what the formalists call transitive and this reasoning at a distance is the kind of thing RDF proponents are on about when they talk about "semantic webs". OWL in particular has a boatload of such predicates, sameAs is probably the best known.

That kind of inference is not a remotely straightforward thing to do in XSLT. Rather than emulate Greenspun's 10th Rule by writing a half-baked, incomplete, buggy predicate reasoner in XSLT, you'll end up writing multiple XSLT sheets instead, and possibly trying to chain them together. This is the real problem with using XSLT in anger for this kind of work - it doesn't scale as the number of elements to map grows. In that scenario, people fall back to regular programming languages where you can useful data structures like dicts and lists to manage the element names and their associations. That's why things like the feedparser don't (and won't) tend to get written in XSLT. and it's why the mappings will have to stay as private details of implemetations for now.


* on reflection, I blame Abba Singstar for that particular turn of phrase.

Format Debt: what you can't say

Aristotle: "In passing, though, I have to note that it would be nice if we could do a better job of what media types tried to do with their type/subtype separation, ie. have a standardised way to specify a layering of specifity of formats, including multiple formats, so that it would be possible to say that a document is text, and specifically HTML, and specifically a combination of hCard+hTag+ hEXIF+image-link, and specifically a Flickr photo, so as to allow clients to know what the representation means without having to parse it, at whatever their level of understanding of the specified format.

I don't know if this would work in practice, after all the type/subtype thing in media types is mostly a failure. Maybe that was just because of it tried to constrain types to just two layers. It would also be necessary to do a better job of what media types tried to accomodate with the '+xml' suffix contortion, ie. make sure that types reliant on possibly multiple lower-level formats are expressible in a sensible fashion."

There are limiting returns on patching around media types and formats. This suggests doing a better job becomes increasingly harder. Let's call this "Format Debt". I think the media types construct is entirely inadequate for expressing mashed up formats in the way Aristotle wants and we will be limited to patching around it -  the media type is deeply embedded into web architecture. I take a polarised position on this, because I think it's less important to be right that push the debate along.

The syntax first, and liberally, approach is good for adoption but has limits, such as inconsistent placement (eg with MediaRSS in feeds), field duplication (eg Activity Streams in Atom) and structural hacks (eg RDFa's Qnames in content), weakly-defined qualifiers (html/atom rel, HTML5 data-*). Or parsing at all costs.

We say we want layered formats, because that's what the combination of IETF IDs, W3C Recommendations and deployed browsers and servers allow us to say. It's the Web version of of the Blub paradox.  What we want is layered data.  What we want is not just to qualify a media type, but to describe the ingredients in the entity whose "shell" is the media type.

I think the argument that identifying and extracting mashed data from entities should happen at a higher layer than transfer is a good one. But an interim approach for dealing with Aristotle's wish might be media type extensions to well known formats that flag contain mashed up data is contained within. These types won't be as specific as to say what exactly is contained (this is neopolitan, this is raspberry ripple), but it's enough information for a code switch.

Such an interim approach won't scale well - for example trying to articulate the specific media type for a HTML document containing RDFa with a slew of vocabularies and divs with slew of microformats is not viable. There is higher order data the way there is higher order programming and this is too difficult to capture in general with media type declarations. Roughly - microformats are to HTML as closures are to functions, and RDFa is to microformats as a macro is to closures. Another limitation is that people doing basic publishing are not going to be speccing the served media type - most people don't know what a media type is. The frameworks will need to support that kind of specificity, which means the editing tools need to signal to the server what's being published is mashed up.

A better interim approach might be to "just use" a new HTTP header.

Another approach might be to ignore the syntactic structure altogether post-parse for extensions in code APIs. When the chunk of syndicated XML or rdfa/microformatted HTML is turned into code, that code can have method that returns the list of the found extensions as data structures instead of asking developers to hit and miss through the code. The found extensions can be in turn iterated over. I've written code like this against Atom that allowed you to get all the links matching a rel value argument without caring about their placement. The HTML5 DOM does something similar for data-* attributes and it seems doable for syndication/html extensions in general in other libraries. The Universal Feed Parser and Beautiful Soup indicate syntax reality is messy but can be dealt with.

Colophon: RDF

The closest thing to a deployable web technology that might improve describing these kind of data mashups without parsing at any cost or patching is RDF. Once RDF is parsed it becomes a well defined graph structure - albeit not a structure most web programmers will be used to, it is however the same structure regardless of the source syntax or the code and the graph structure is closed under all allowed operations.

If we take the example of MediaRSS, which is not consistenly used or placed in syndication and API formats, that class of problem more or less evaporates via RDF. Likewise if we take the current Zoo of contact formats and our seeming inability to commit to one, RDF/OWL can enable a declarative mapping between them. Mapping can reduce the number of man years it takes to define a "standard" format by not having to bother unifying "standards" or getting away with a few thousand less test cases. 

RDF has after a decade seen limited deployment, developers and publishers peferring instead to incrementally patch syntax. Atom has XML extensions and rel attributes. HTML has RDFa and microformats. A few years ago RDF tended to be heavily criticised by syntax proponents. You are free to search through the xml-dev and rest-discuss archives, or search for "RDF Tax" or "RDF syndication war" to see what I mean. I hope a year or two out, people will be less dismissive and at least willing to learn from RDF as the nuisance factor of formats and media type limit increases.

Snowflake APIs

Speculation: for Data APIs in 2009 there will be two developments and one debate. All are centered around an important technical principle in web design - uniform interfaces (an idea that goes back quite a bit in distirbuted systems).

You are not a beautiful or unique snowflake.

First idea: putting links into API data. The REST community call this 'Hypermedia as the engine of application state' or HATEAOS. Yes. Worst. Abbreviation. Ever*. I tend to call it "links in content". Nonetheless the idea is simple - put links in your format data. Heavy Atom and HTML users do this already, almost subconsciously but a lot (most) proprietary data APIs fall down here and in doing miss out on a number of things. First are the network effects of being able pass along URLs. The very essence of a "web" is linking, to almost the point where operationally a "good" webapp is own that use plenty of links. Second are decoupling of clients from your servers -if you describe in your format both where links can be found and what their "type" is via a metadata qualifier you are free to refactor on the server side, relocate the server, introduce a CDN, whatever. For example, Atom qualifies links using the "rel" and "type" attribute in a way that will work for every web site on the planet.  Clients extracting links out of the data and constructing an absolute minimum of URLs are loosely coupled to server structures - URL parsing and generation being important coupling point in API design.  Third is simplification of client code - just pull out the links them, render them in the UI or call them using HTTP methods. You don't even need to design the link elements - steal Atom's link element or lace your current XML with "src" attributes that contains URLs.

For more detail on how links in content can work I recommend reading Mark Baker's "Hypermedia in RESTful applications" and Subbu Allamaraju's  "Describing RESTful Applications", both on Infoq.

Second idea: "standardisation" of feed metadata. Recently Dare Obsanjo and Dave Winer have blogged about inconsistencies in MediaRSS that complicate data consumption. Dave Winer on adding a phot site into friendfeed:

"I always assumed you should just add the feed under "Blog" but then your readers will start asking why your pictures don't do all the neat things that happen automatically with Flickr, Picasa, SmugMug or Zooomr sites. I have such a site, and I don't want them to do anything special for it, I just want to tell FF that it's a photo site and have all the cool special goodies they have for Flickr kick in automatically."

Dare Obasanjo:

"We have a similar problem when importing arbitrary RSS/Atom feeds onto a user's profile in Windows Live. For now, we treat each imported RSS feed as a blog entry and assume it has a title and a body that can be used as a summary. This breaks down if you are someone like Kevin Radcliffe who would like to import his Picasa Web albums. At this point we run smack-dab into the fact that there aren't actually consistent standards around how to represent photo albums from photo sharing sites in Atom/RSS feeds."

I went a few rounds last year with MediaRSS and have to agree I'd much rather have something as well specced as Atom is for syndication, or RFC5005 is for pagination. And it's not limited to media. Same goes for Geo Data (main criteria being, does it support WSG84?) , contacts, activity/events, representing arbitary site metadata, even Exif. Making things up or having to choose between competing formats is a real pain. There are two problems - where to place the data, because all the popular formats have sufficiently arbitrary structure that something like MediaRSS can appear in multiple places (the difference between Picasa and Zommr as Dare outlined in his post) and how to notate it (the difference between MediaRSS and Smugmug again as Dare outlined).

If you are a semwebber around long enough to remember the syndication wars, you will be having a good old chortle as this problem is arguably solved better by RDF/XML than any syndication or markup format. It's an interesting turnaround, since one of the arguments against RDF adoption for syndication back then was that clients and servers had common internal object models for syndication data and thus a formal model on the wire didn't matter that much - the parse/lex layers could switch. Extension metadata it seems is a bit different - varaibility has a cost.  Whether you agree or not re RDF, if this impacts you, having a look at how RDF or even RSS1.0 modules work get described and how they are supposed to be parsed into a data structure is no harm at all.

RDF is worth learning for a different reason — the profound enlightenment experience you will have when you finally get it. That experience will make you a better format and data API designer for the rest of your days, even if you never actually use RDF itself a lot. (You can get some beginning experience with RDF fairly easily by writing and modifying simple files like FOAF and DOAP for social networks and software projects, or RDFa extensions for XHTML.)

The debate: should there be that many custom formats? Via Kevin Marks and Aristotle Pagaltzis, I came across the "precious snowflake" analogy for APIs which to me describes the situation perfectly both across hundreds of websites and within content domains such as geo/contacts/media. Here's Aristotle:

" There are a lot of good existing choices once you get over the idea that your domain is a unique and precious snowflake."

There are probably hundreds of publicly available APIs today, all different, each their own "SiteML", and you have to be able to mash them all. The big but smart companies, such as GOOG and MSFT that have application suites and not just individual web silos have adopted common syntax, posting and extension models that allow for consistency and evolvability over time - individual API offerings might seem suboptimal and indirect, even obtuse, but the overall product portfolio makes a ton of sense - as well as lowering consumer costs it allows them to ship client APIs with less hasssle. This is basic platform and product architecture - reduced variability at one layer allows for increased offerings with lower costs at higher layers. Standalone web properties just don't do this today; each individual API is like a precious snowflake, but being in the snowball business is expensive, and so is keeping that snowflake preserved (when you designed that API did you think about encoding, escaping, empty v not-present, namespaces, timestamps, bidi, versioning, extensions, content-negotiation, cacheability, required v optional, new formats, input sanitation? Didn't think so ;). This creates a new market for web integration providers such as Friendfeed and Gnip ("making data portability suck less") or silo publishing providers such as Mashery. We call them aggregators in the web consumer space but when you get to scores of providers it effectively requires the "EAIfication" of mashups, or if you prefer, the introduction of Value Added Networks (VANs) for consumer data. Others like EBay and sf.com seem to have become subject to X.Y.Z versioning issues which are maintenance nightmare* (I find these tend to beassociated with SOAP style processing models - YMMV). So how API families like DISO and OpenSocial, or specific formats like Portable Contacts, Activity Streams and Atom Media Extensions develop will be important this year. That or we start taking microformats and RSS/Atom/JSON extensibility a lot more seriously than we do today, or the number of APIs will soon be in their thousands.


 
*  X.Y.Z for software binary compatibility, sure, but X.Y.Z in data formats is arguably missing the entire concept of web data APIs - when clients are out of your administrative control, lockstepped upgrades are practically speaking, impossible.

Managing large stories on agile projects

Obie Fernandez: "Where have you experienced limitations of a story-driven process and how did you deal with it?"

Insightful post from Obie on two fronts. First the common functional themes in Web 2.0 apps:

"Hashrocket does a lot of Web 2.0 apps, and most of them have a similar assortment of basic functionality and social networking features: user accounts, event notification, photo uploading, commenting, etc. Naturally, there is a desire to take our sum knowledge of building these types of features and codify it in the form of a base application and suite of plugins."

Very true - typically these Web 2.0 features revolve around socialising and sharing rather than content (someday I'll post on why "websites" are an anachronism). Second was the issue of breaking down big ticket items into measurable stories, which is what I wanted to highlight. Here's Obie's card:

"Story: Adding Photos

In order to provide photo collections for my client,
As a developer,
I want to be able to add photos to a domain model

Acceptance:

    * Installation instructions
    * Declarative macro makes model act like a photo collection
    * Generator-created: Photos controller that includes photo resource module
    * Generator-created: Views (new, form partial, javascript, FancyUploads)
    * Generator-created: migration
    * API documentation"

and here's the issue:

"One of my guys called it a "12-pointer" to denote how much it needed to be broken down in order to fit in with our typical process (limited to 1, 2 and 4 point stories).

The problem is that as the stakeholder, I don't want this story broken down! This is the level at which I want to do acceptance, for at least a couple of reasons: 1) I already went through a long acceptance process for many of these stories the first time they were implemented, in the language of the end-user. 2) In order to properly define acceptance criteria for this story, I would have to know exactly how it's going to be implemented, and that information is not available until they actually sit down to figure out how to do the extraction into gems and plugins. In fact, one of the reasons that the storycarding session was feeling so painful was exactly that we were spending so much time arguing about implementation -- out of place in a typical storycarding process where you leave implementation details for later."

I've seen this granularity problem on every project, product or program I've worked on. Often in non-agile methods it comes it up in the form of traceability requirements on top of the actual requirements. As someone whose has the word architect in their job description I come across a similar problem with systems, operations and non-functional stories that less to do with the features and more to do with the features being used by a lot of people (I know, you can argue they're not really stories in the pure XP sense, but good luck explaining *after* the iteration why the service failed ;). I find that business owners often want to think about the software at this level and having them into decompose the work into stories that are meaningful to developers, but which need to reconsitituted, that isn't always the right thing and we need different views of the work depending on context. Interestingly (to me) story cards are not requried by either XP or Agile, but we tend to fetishize index cards (iirc Extreme Programming 2nd ed mentions index cards just once, and as an option).

Mike Cohn has a nice decomposition for this, a notion of epics, themes and stories from Agile Estimating and Planning:

"Although in general, we want to estimate user stories whose sizes are within one
order of magnitude, this cannot always be the case. If we are to estimate every-
thing within one order of magnitude, it would mean writing all stories at a fairly
fine-grained level. For features that we’re not sure we want (a preliminary cost
estimate is desired before too much investment is put into them) or for features
that may not happen in the near future, it is often desirable to write one much
larger user story. A large user story is sometimes called an epic.
    Additionally, a set of related user stories may be combined (usually by a pa-
per clip if working with note cards) and treated as a single entity for either
estimating or release planning. Such a set of user stories is referred to as a
theme. An epic, by its very size alone, is often a theme on its own.
    By aggregating some stories into themes and writing some stories as epics, a
team is able to reduce the effort they’ll spend on estimating. However, it’s impor-
tant that they realize that estimates of themes and epics will be more uncertain
than estimates of the more specific, smaller user stories.
    User stories that will be worked on in the near future (the next few itera-
tions) need to be small enough that they can be completed in a single iteration"



In that light I would call Obie's card a "theme". Alternatively your shop might call it a feature or capability. It rightly should be broken into measurable stories, but equally, for the owners it's important the theme is not lost or splattered across a set of tickets, cards or ATs so it can't be easily tracked. It should remain a whole thing to the owners. So the idea is simple - treat the software as an Epic (part of the infinite game) and Themes are large and meaningful things the software should do, that can be broken down into Stories, chunks of work that the team can reason about. 


That still leaves two issues; the tracking mechanism and acceptance.

On the first, the tracking mechanism, I like how some open source projects manage these big ticket items simply as tickets. Here's are two examples I know of from Hadoop:

HADOOP-2510: Map-Reduce 2.0, re-work Hadoop Map-Reduce to make it suitable for a large, static cluster.

HADOOP-3719: contribution of Chukwa by Yahoo!, a data collection and analysis framework.

there's a lot of separately scheduled work implied by those, but there is one place around which the work can be dicussed, scheduled and closed (Hadoop use Jira which has features to relate tickets, but they're incidental here - being able to have links to tickets is way more important for the document of record). In no way are they the kind of things you put on an index card and blutack to the wall.

On the second, acceptance, this can be tricky and probably has more business impact than how to notate large stories. A common problem in Agile projects is having a theme scale card open across iterations. This has a number of negative knock on effects - the worst probably is the reintroduction of the 80% done antipattern that agile all but eliminates. It can also impact upstream acceptance scheduling organisation such as systems or beta testing, even marketing plans. It can allow for subtle scope creep where "splitting" of the card is in fact new work. 

How to represent work like this? One simple approach is to enhance your story or kanban board from a set of vertical swimlanes into a matrix where Themes are each given a horizontal slice on the board and each Theme has its own set of Stories. The entire board represents the Epic. This provides visibility on how software is progressing overall at a level above story cards, which is suitable for people who will care about different themes at different times, which features are bottlenecking and where. It will work well for integrations and visualising dependency issues day to day  - there's nothing to say a theme can be written from the point of view of another system or team - "As the Billing System, I would like ...". It also fits in with other non-team toolchains such as rollups required by management or program teams, mostly because it scales to a portfolio view by creating a master portfolio board where each aforementioned Epic is in turn given a horizontal slice. Another approach described by Mike Cohn is a treemap, but this requires more sophisticated tooling,

Design considerations for fine grained data access via the Web

Julian Hyde: "You would think that something called a 'feed' would push content is pushed to subscribers as soon as it arrives, but in fact RSS and the other feed types in the prototype use a pull protocol. With a pull protocol, the subscriber needs to continually poll the feed to get the content (typically an XML document a few kilobytes long), parse the content, and figure out what, if anything, is new since the last time we polled.

This process soaks up a lot of network bandwidth and resources for both the provider and the subscriber, and the cost goes up the more regularly we poll. Typically the provider has to throttle the feed to prevent their servers from being overwhelmed. For example, Twitter updates its feed only once per minute and limits the number of tweets on the page. At times of high volume, only a small percentage of tweets make it into the feed.

This may not sound that serious if the content is a twitter conversation between friends, or a blog with one or two posts a week. But web feed protocols are becoming part of the IT infrastructure, and business users require lower latency, higher throughput and higher availability. (The existence of services like Gnip is evidence of the need to control the web content chaos.)"

I would like to know how to scale this so that the origin server does not melt down under query load. Let me explain, assuming the origin server is backed by a relational database.

Most people that want real time efficient feeds are concerned about bandwidth overhead or the apparent technical stupidity of polling the same data over and over. They would just like what has changed since the last time they asked. It's clearly more efficient and better. Let's call this a "bespoke" feed model.

What tends to gets forgotten about with bespoke feeds is that each client request forces a subselect on the database. This model is not likely to scale nearly as well on the server as resending redundant information and letting the client sort it out locally, however dumb that approach might seem. The Atom format for example is designed so that the client can sort it out locally by virtue of the atom:id and atom:updated values. 

The alternative polling option people arrive at is to not support bespoke queries but to serve the same redundant data to all clients. Let's call this the "one size fits all" (osfa) feed model. It is the standard approach on the Web for scalable, high availibility feed serving. The osfa approach "works" insofar as it assumes a lot of clients are accessing data and makes a tradeoff preferring bandwidth overhead to database load. This tradeoff makes a lot of sense as the number of clients go up - anyone who builds database backed websites quickly learns to reduce the number of calls on the database, be it through query caches, L2 object cache, caching proxies, and so on. An osfa approach allows the data to served off disk directly, making it a pure file serving problem, which is far easier to scale than hitting a relational database.

So, where does that leave us? Well I think if you must allow per client querying for a lot of clients, you need to be sure the server can handle the database load at scale. If you are really worried about bandwidth then compression is the first obvious thing to do. Another is caching, but that leads to data latency and if you are asking for "just" the changed data there is a chance you want that data "right now" as well (more on that in a minute). You might also think that sending down less data will be a win - but this really depends on your use case. Replacing one coarse grained fetch with 4 fine grained queries isn't neccessarily going to lead to a better user experience or sane usage of the data server, though a client developer might find it convenient to not have to  om nom through a larger dataset. If you are familar in enterprise development with the .NET/JEE antipattern of data access that leads to the use of DTOs, well, fine grained feeds present similar issues.

Julian has a suggestion:

"I would like to see the emergence of a genuine 'push' protocol for web-based content. It doesn't have to be particularly complicated. To illustrate what I have in mind, here is an example of a simple, stateless protocol, built using XML over HTTP, like the current feed formats. A subscriber sends a request

<readRequest>
  <minimumRowtime>2008-12-04 18:00:46.000</minimumRowtime>
  <maximumCount>1000</maximumCount>
  <maximumWait>10s</maximumWait>
</readRequest>

over HTTP"

I would like to see such a thing as well. But.

"According to the protocol, the provider sends the results after 10 seconds, or when there are 1000 records to return, whichever occurs sooner. After it has received a result, the subscriber will typically ask for the next set of rows with a higher rowtime threshold.

Even though it is simple, the protocol ensures that data flows efficiently for feeds of all data rates. For a high volume feed, the 1000 record limit will be reached before the 10 second timeout, so latency naturally decreases. For a low volume feed, many requests may time out and return an empty result; but the 10 second wait limits the number of requests per minute that the server has to handle."

It is simple, but by virtue of assuming the data server can handle the load of pushing out the data and managing subscription state; the protocol does nothing to manage that part of the architecture. Good client-server protocol designs (where good means scale to large numbers of both) try to avoid or mitigate these kind of asymmetries.

Back to latency.  Many web sites scale of the basis of the data being latent - even a few minutes can make a huge engineering and operational difference, especially as your application grows beyond a single cluster (or geographic location). IMO the mapreduce pattern scales not just on parallelisation but on the data latency the results are allowing to have (which is why it gets used a lot for log/warehouse analytics and post-hoc querying). So if you demand real time precision in the data, be aware that this can put stress on your server.

"Real time" requirements in turn might lead you towards a push model, but I think it's reasonable to say that we don't know how to do internet scale push yet, at least not without creating asymmetries - its hard to have a lot of clients to send data to, and the problems gets harder as you add things in like filtering and long held connections by clients that will have you ripping out those loadbalancers.

For push, I think XEP-60 is worth looking at, even though we (imo) have work to learn how to manage mass subscriptions,  and if you are interested in systems architecture, Rohit Khare's ARRESTED model.