Friday, 3 February 2012

The Problem With Counting Browser Features

css3test.com is doing the rounds. I get almost exactly the same results in Firefox trunk and Chrome Nightly (64% vs 63%). This gives me a chance to explain why this sort of testing tends to be bad for the Web, without sounding whiny and bitter :-).

The root of the problem is very simple, and explained right at the top of the page:

Caution: This test checks which CSS3 features the browser recognizes, not whether they are implemented correctly.

Thanks for the disclaimer, but it doesn't eliminate the problem. The problem is that whenever someone counts the number of features supported by a browser and reports that as a nice easy-to-read score, without doing much testing of how well the features work, they encourage browser developers to increase their score by shipping some kind of support for each tested feature. Because we have limited resources, that effectively discourages fixing bugs in existing features and making sure that new features are thoroughly specced, implemented and tested. I think this is bad for Web developers and bad for the Web itself.

If Web authors reward Web browsers for superficial but broken support for features, that's what they'll get.

Instead, I would like to see broad and deep test suites that really test the functionality of features, and people comparing the results of those test suites across browsers. Microsoft does this, but of course they tend to only publish tests that IE passes. We need test suites with lots of tests from multiple vendors, and Web authors too. (Why haven't cross-browser test results for the official W3C CSS 2.1 test suite been widely published?)

I don't want to come across as harsh on css3test.com. The site is lovely, it has that disclaimer, and it goes a lot further than, say, html5test.com in terms of testing the depth of support for a feature. So it actually represents good progress :-).

18 comments:

  1. There is something to be said about raising the collective awareness of the community of these features. Not everyone has time to read the specification. Sites like this can encourage folks to check out what's possible.

    ReplyDelete
  2. Anonymous: people are more aware of things like HTML5 and CSS3 than they ever have been in the past. They're literally buzzwords. The awareness is there; what we need is more awareness doesn't capture the complete picture (and is often outright misleading).

    ReplyDelete
  3. Making a big deal about official testsuite scores can be just as problematic; the testsuites are typically not fairly weighted and making the results PR fodder creates pressure on vendors to only release tests that they pass, or to release many many almost-identical tests for things that the competition fails. A nice example of something that is overrepresented in a testsuite – though I am not suggesting this case represents malicious intent – is the "outline" property in the CSS 2.1 testsuite which has 175 tests with many that differ only trivially.

    Interpreting the results of testsuites is hard, very hard, and they are only really good for browser vendors to improve the quality of their implementations. Authors would be better served by sites that make experience-based judgement calls about whether particular feature implementations are basically usable per spec, so buggy or non-standard that they will be difficult to code against, or simply missing.

    Also, FWIW it's not just Microsoft that release tests; at Opera we have been making a big effort to release as many of the testsuites we write internally as possible. These are the same ones that we use for our regression testing so if the tests suck – e.g. ignore corner cases – it's pretty likely that our implementation does too.

    ReplyDelete
    Replies
    1. I should have written:

      "These are the same ones that we use for our regression testing so if the tests suck – e.g. ignore corner cases – it's pretty likely that our implementation will too. Since we aren't interested in making bad implementations we have to write, and therefore release, high quality tests."

      Delete
    2. Official testsuites can be problematic too, but I think they're better than any of the alternatives. If we had enough tests and a broad enough range of contributions, I think accidental or deliberate bias could be smoothed out.

      Or we could assign a lower weight to test results for browser B for the tests that were contributed by browser B, somehow...

      Delete
  4. Comparing test suite results across browsers is rather pointless. If not all browser (claim to) support all features in the suite, it's pointless, since you have to start by trying to figure out how much a feature is "worth" (should <video> support be worth more, less, or the same as <input type="date">?).

    Considering that a lot of test suites amount to regression tests, they're probably biased towards features more commonly used (I'm sure I'd find a lot more tests for float support than inline-table, e.g.) and features that are more complex.

    With that in mind, if all browsers support all the features in the test, you might be able to say something if there is a significant difference in the pass rates; it also only really works if you compare the results of the current versions at the time of the test suite release.

    In any case, releasing test suites for features that haven't even stabilized in specification and comparing support for them is completely and totally misguided.

    ReplyDelete
  5. Hi Robert,

    I really admire your work so I'm kinda flattered you wrote a whole blog post about my project :)

    You are very right in your points, and I've been one of the people talking about how bad WebKit's implementations sometimes are. I've always said that WebKit implements stuff fast, but in buggy ways. I've even tried to add the very bad cases in the sidebar (under "Cheaters")

    However, sometimes you can't get multiple birds with one stone. The purpose of this test was to raise awareness about CSS3 features that don't get the PR of transforms or shadows and to debunk the myth that WebKit is unbelievably ahead of other engines. I wanted to add as many tests as possible, which unfortunately limits the scope of the tests to being very shallow, if you're just one person :)

    Also, the means we have today for deeply testing features are kinda limited if we can't accept human interaction (which would be cumbersome with such a test). I would love a screen capture API for example, where we could get a test's results on a canvas and analyze what actually happened visually.

    ReplyDelete
    Replies
    1. Those are all good points.

      It would be nice to be able to run reftests in a browser. Unfortunately to run Mozilla or W3C reftests you have to be able to access privileged APIs (we don't want just anyone to be able to get perfect screenshots of arbitrary pages --- see http://robert.ocallahan.org/2011/11/drawing-dom-content-to-canvas.html). Maybe someone could create an extension for each browser that lets certain whitelisted sites run reftests.

      Or, at least in Firefox, we could add an about:config flag that gives any site unrestricted drawWindow access, which people could use to run tests.

      Delete
    2. Both the extension and the about:config flag would be a step to the right direction. However, that's cumbersome, which also reduces the number of results. I would do it, the average developer probably wouldn't care that much. The whole point of eliminating human interaction is making it easier to run, thus collecting more data.

      I'm aware of the security risks involved, I learned that from a discussion with Chrome developers a few months ago. However, those can be solved (not trivially), it's just that anything will be buggy at first and those kinds of bugs are bad, as they introduce security holes.

      I'm also aware of foreignObject as a workaround for that. However, the origin issues make it impossible to use it in such a context currently, and when I played with it, I noticed differences in the way CSS3 is rendered in it (of course I might have been doing something wrong). Also, I'm not sure what's the point of some of the restrictions, such as not allowing anything external, even if it's same origin.

      Delete
    3. Trunk Firefox will let you get the pixel data from SVG images rendered to canvas, so the technique mentioned here works:
      http://robert.ocallahan.org/2011/11/drawing-dom-content-to-canvas.html
      Other browsers could do the same.

      That same blog post elaborates on some of the security issues. Unrestricted capturing of Web page content isn't going to happen.

      I don't think we actually need to capture data from lots of developers here. It's cool to be able to run the tests in your own browser, but it doesn't really matter if 10 people test Firefox 10 vs 10,000 people, especially for CSS features that almost all work the same on every platform and every kind of hardware.

      Delete
    4. Sure, but lots of people = greater browser diversity.

      Delete
    5. But there's only a small number of browsers worth caring about, e.g. release+dev versions of Firefox, Chrome, Opera, Safari and IE on desktop, and a similar set on mobile.

      Delete
    6. "I've always said that WebKit implements stuff fast, but in buggy ways"

      This has been true from my observations as well. Here is an example: http://code.google.com/p/chromium/issues/detail?id=105227. Chrome brought CSS 3D transforms to stable release earlier than others, but it was buggy - rendering the feature useless for a web app that would do much more than just transform a simple div.

      On a related note, it is widely recommended by many to not lookup browser user-agent string to detect features. But with bugs and incorrect implementation of new features, I dont see a way around it.

      Thanks Robert for this post.

      Delete
  6. Hi,

    I agree with your overall point, but:

    "The problem is that whenever someone counts the number of features supported by a browser and reports that as a nice easy-to-read score, without doing much testing of how well the features work, they encourage browser developers to increase their score by shipping some kind of support for each tested feature. Because we have limited resources, that effectively discourages fixing bugs in existing features and making sure that new features are thoroughly specced, implemented and tested."
    => In essence, what you're saying here is that browsers developers are led by these kinds of websites.
    That's where the problem lies. Mozilla and any browser could have an official expression to say exactly what you said in your blog post. The problem is that no browser takes position to say "we don't care about our score in such or such benchmark, rather we deliver a spec-compliant product, backed by such number of tests"
    You have such an expression in your blog, but it seems to be a personal opinion rather than an official one.
    Different browsers could gather up to agree on which "testing website" they give credit to and which they do not.

    "If Web authors reward Web browsers for superficial but broken support for features, that's what they'll get. "
    => I'm a web author and supports what you've said, not the CSS3test. I just haven't setup a website to express it. Are you going to only listen those who set up a website grading superficial support of features?

    As a browser developer, you can decide of who you listen to, but also who you choose not to listen.

    ReplyDelete
    Replies
    1. Have an "official position" doesn't really help. At the end of the day, if some people are running around saying "browser X rules, browser Y sucks", it hurts browser Y and helps browser X no matter what their official positions are.

      Yes, actually, if you set up a popular site explaining why you like the browsers you like, that does help them. However, one problem is that sites that report a percentage or a number are more popular than sites which simply express an opinion --- because numbers sound objective, they're easy to compare, they can be checked in your own browser, and so on (no matter how bogus that number might be!). It's the old trope about how you can make people believe any old lie if you dress it up with statistics :-).

      Delete
  7. If there are too many features to implement correctly, then there are too many features. Web standards ought to be getting simpler, not more complicated. More features means more bugs, and there will be no end to it.

    VanillaMozilla

    ReplyDelete
  8. I think there are two problems with disseminating information about e.g. CSS2.1 test scores. The first is that they're not presented in an interesting way. The data is there, but it's pretty much a raw dump. Someone has to figure out how to make it interesting. The second problem is that the data isn't up to date with the latest releases (which is at least partly because almost all of the tests are manual). Ideally, we'd have full test results for every release of every browser so you could compare.

    ReplyDelete
  9. Over a year later and still they're all scoring around 63%!

    What I would like to see is a system where developers/designers can vote on which features are most important and have their votes add weight to the score value of certain features.

    I have a feeling flexbox would be voted up near the top of the list and Firefox's lack of support for wrapping flexboxes would knock their % score down a lot.

    ReplyDelete

Note: only a member of this blog may post a comment.