I'm at linux.conf.au at the moment (until Wednesday) and yesterday I attended the browsers miniconf. It went well, better than I expected. I had a slot to talk about the MediaStreams Processing API proposal to enable advanced audio effects (and much more!) in browsers, which has been my main project for the last several months (see my earlier post here. I worked frantically up to last minute to create demos of some of the most interesting features of the API, and get my implementation into a state where it can run the demos. By the grace of God I was successful :-). Even more graciously, the audio in the conference room worked and even played my stereo effects properly!
I have made available experimental Windows and Mac Firefox builds with most of the MediaStreams Processing API supported. (But the Mac builds are completely untested!) The demos are here. Please try them out! I hope people view the source, modify the demos and play with the API to see what can be done. Comments on the API should go to me or to the W3C Audio Working Group.
I must apologise for the uninspired visual design and extraordinarily naive audio processing algorithms. Audio professionals who view the source of my worker code will just laugh --- and hopefully be inspired to write better replacements :-). Making that easy for anyone to do is one of my goals.
Some of the things I like about this API:
- First-class support for JS-based processing. In particular, JS processing off the main thread, using Workers. This lets people build whatever effects they want and get reasonable performance. Soon we'll have something like Intel's River Trail in browsers and then JS users will be able to get incredible performance.
- Leverages MediaStreams. Ongoing work on WebRTC and elsewhere is introducing MediaStreams as an abstraction of real-time media, and linking them to sources and sinks to form a media graph. I don't think we need another real-time media graph in the Web platform.
- Allows processing of various media types. MediaStreams currently carry both audio and video tracks. At the moment the API only supports processing of the audio because we don't have graphics APIs available in Workers to enable effective video processing, but that will change. Applications will definitely want to process video in real time (e.g. QR code recognizer, motion detection and other "augmented reality" applications). Soon we'll want Kinect depth data and other kinds of real-time sensor data.
- First-class synchronization. Some sources and effects have unbounded latency. We want to make sure we maintain A/V sync in the face of latency or dynamic graph changes. This should be automatic so authors don't have to worry about it.
- Support for streams with different audio sample rates and channel configurations in the same graph. This is important for efficient processing when you have a mix of rates and some of them are low. (All inputs to a ProcessedMediaStream are automatically resampled to the same rate and number of chnanels to simplify effect implementations.)
- No explicit graph or context object. It's not needed.
Most of the features in the proposed spec are implemented. Notable limitations:
- "blockInput" and "blockOutput" are not implemented; there is no way for streams to opt out of being synchronized. For example it would be nice to be able to pipe a media resource into a processing node and if the resource pauses (e.g. due to a network delay), the processing node doesn't block but just treats the paused input as silence. This is probably the trickiest feature not yet implemented.
- No support for "live" streams. Similar to above, if a stream feeds into an output node that is blocked, we sometimes don't want to buffer the input stream. E.g. if the input is a live webcam you often (but not always) want to throw away buffered data so that when the output unblocks it immediately gets the latest video frames.
- There has been very little tuning to optimize throughput and latency, especially across a range of devices. This will be a lot of work.
- In general the API is very lightly tested. I'm sure there are lots of bugs.
- Video elements don't play in sync with streams captured from them. In my demos I worked around this by hiding the source video elements and creating new video elements to play the video via the stream. Fixing this bug would simplify the demos a bit.
- Canvas video sources are not implemented.
- The built in audio resampler is stupendously naive and needs to be replaced.
- Add support multiple audio and video tracks and the MediaStream track API.
- ProcessedMediaStreams using JS workers need to add checks ensuring that all upstream media sources are same-origin.
- The biggest limitation is that it's not shipping in Firefox yet. My giant patch is messy and a lot of cleanup needs to be done. I have a plan to split the patch up, clean up the pieces and land them piecemeal. In particular I need to get some of the infrastructure landed ASAP to help the WebRTC team make progress. (When we ship it, much or all of the API will probably be disabled by default, behind a hidden pref, until the standards situation is resolved.)
Update Updated the build links to point to new builds with improved performance (faster JS execution in workers due to type inference being turned on; fewer control loop wakeups due to more intelligent buffering decisions for ProcessedMediaStreams).