Saturday 25 October 2014

Hap attack (and Quicktime fun)

A little while ago I got asked to add Hap support in vvvv.

This is a rather simple format, idea is that you get a BC1/BC3 frame (with small snappy compression), so you can do fast GPU upload.

It's more or less the scheme used by many "media servers", one difference is that all is packed in a single file instead of a bunch of dds files.

It's a pretty useful format since frame load is very fast, and can even be done within the frame, so you can have perfect synchronisation between videos on a single (or multiple) machines.

So the first step is to simply decode a frame, as a test rig I just used a media foundation source reader, which hapilly gives me a sample (aka: a frame), in compressed form.

Once you have this, everything is reasonably straightforward:

First 4 bytes are [length] (3 bytes) + Flag (1 byte)

Flag gives you compression + format like this (c#):

Code Snippet
  1. public enum hapFormat
  2. {
  3.     RGB_DXT1_None = 0xAB,
  4.     RGB_DXT1_Snappy = 0xBB,
  5.     RGBA_DXT5_None = 0xAE,
  6.     RGBA_DXT5_Snappy = 0xBE,
  7.     YCoCg_DXT5_None = 0xAF,
  8.     YCoCg_DXT5_Snappy = 0xBF
  9. }

Once you have this, you need to call Snappy to decompress (if relevant):

Code Snippet
  1. int uncomp = 0;
  2. SnappyStatus st = SnappyCodec.GetUncompressedLength(bptrData, frameLength, ref uncomp);
  3. st = SnappyCodec.Uncompress(bptrData, frameLength, (byte*)snappyTempData, ref uncomp);
  4. initialData = snappyTempData;


I just used an existing P/Invoke wrapper, no need to waster time reinventing the wheel:

http://snappy4net.codeplex.com/

Now you have your frame ready, you just have to upload to your GPU :

Code Snippet
  1. Texture2DDescription textureDesc = new Texture2DDescription()
  2. {
  3.     ArraySize = 1,
  4.     BindFlags = BindFlags.ShaderResource,
  5.     CpuAccessFlags = CpuAccessFlags.None,
  6.     Format = format.GetTextureFormat(),
  7.     Height = this.frameSize.Height,
  8.     Width = this.frameSize.Width,
  9.     MipLevels = 1,
  10.     OptionFlags = ResourceOptionFlags.None,
  11.     SampleDescription = new SharpDX.DXGI.SampleDescription(1, 0),
  12.     Usage = ResourceUsage.Immutable
  13. };
  14.  
  15. DataRectangle dataRectangle = new DataRectangle(initialData, format.GetPitch(this.frameSize.Width));
  16. Texture2D videoTexture = new Texture2D(this.device, textureDesc, dataRectangle);
  17. ShaderResourceView videoView = new ShaderResourceView(this.device, videoTexture);

format.GetTextureFormat() takes care of properly converting the hap format to the relevant BC texture format.

That was about it to decode a frame, hardcore work ;)

Now as usual this part is only the tip of the iceberg, you have to think how to handle playback.

My initial thought was to continue using Media Foundation source reader, so I write a little player and check decode time, around 1.5ms per full hd frame (on a laptop with no SSD).

So all is pretty promising, really fast decode access, but then you reach the point where you want to loop your video, (which involves calling SetCurrentPosition on your source reader).

Surprisingly, this is extremely slow, grabbing a frame after a seek suddenly takes 60ms (which is far too much obviouly). That completely removes random play (seek every frame) as well.

So back to the good old windows AVI api, which just parses file and allows you to load a random frame in memory.

First we load the file:

Code Snippet
  1. Avi.AVIFileInit();
  2.             fileHandle = 0;
  3.             int r = Avi.AVIFileOpen(ref fileHandle, @"E:\repositories\cartfile\other\EncodingTest\sample-1080p30-Hap.avi", Avi.OF_READWRITE, 0);
  4.  
  5.             Avi.AVIFileGetStream(fileHandle, out videoStram, Avi.streamtypeVIDEO, 0);
  6.  
  7.             Avi.AVISTREAMINFO streamInfo = new Avi.AVISTREAMINFO();
  8.             Avi.AVIStreamInfo(videoStram, ref streamInfo, Marshal.SizeOf(streamInfo));
  9.  
  10.             Avi.BITMAPINFO bi = new Avi.BITMAPINFO();
  11.  
  12.             int biSize = Marshal.SizeOf(bi);
  13.  
  14.             Avi.AVIStreamReadFormat(videoStram, 0, ref bi, ref biSize);
  15.  
  16.             SharpDX.Multimedia.FourCC fcc = new SharpDX.Multimedia.FourCC(bi.bmiHeader.biCompression);

Please note that we get the Hap FourCC in the bitmap compression header (so we of course add a check to verify our AVI is encoded using Hap.

Now to get a frame, we simply call:

Code Snippet
  1. Avi.AVIStreamRead(this.videoStram, frameIndex, 1, this.aviTempData, 1920 * 1080, 0, 0);


With our frame Index. Since Hap uses one keyframe per frame, this is extremely fast.
Once done we upload to GPU as previous.

Now bit of code to integrate into vvvv, just wrap all that lot into some plugins:



That's pretty much it. Please note that upload is so fast that I didn't bothered yet to do any buffering. I quite like the concept of "ask for this frame and get it" :)

Now one thing is that hap can have 2 containers. Avi (from the directshow codec), or MOV (from the quicktime codec).

Of course most hap files from people using this thing called Mac will encode using the second codec. It's actually easy to change container, but well, it would be much better to read quicktime files directly.

That causes an initial problem, you need Quicktime installed on Windows (which sucks), and that implies to use the QuickTime SDK for windows (which has been abandoned more than 5 years ago). So that also mean forget 64 bits support.

As a side note, I can limit my use case, I only want to read Hap, if a video is from another codec my player just will not accept it.

So let's see if we can't just parse that mov file and just extract raw data like we do using AVI.

Here only difference, I did not found any wrapper (like vfw.h does for AVI), so time to go read specifications and open Hexadecimal editor ;)

For people interested, I will leave you to read the whole specs :
https://developer.apple.com/library/mac/documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html#//apple_ref/doc/uid/TP40000939-CH202-TPXREF101

But let's summarize,

QuickTime files use the concept of Atom (which more or less just a Node in a tree structure).
Each Atom has a length, a code (fourcc) and can either contain other atoms or data, this is structured this way:

[Length 4 Bytes][FourCC 4 Bytes][Data = Length - 8 Bytes]

Yes length parameter includes itself and FourCC.

Please note there is nothing in the file format to know automatically is an Atom is a Leaf (data) or a Container (contains other Atoms), so you have to go read the documentation and find by yourself.

First atom is called : File Type compatibility (contains a header to check is it's a valid quicktime file, plus few version info.

Next we have "wide", which is a special one to allow to add a flag for large files.

Then we have "mdat" , which contains all the sample data (where we want to read from). But of course for now we don't know how data is organized.

So we need to go to the next one (called "moov"). Which contains all the information we need. There's a really high amount of options, but roughly from there we retrieve frame per second, and track list ("trak" atom).

We can already go into the track header ("tkhd" atom) to retrieve track length / size.

Then our work is not finished, we need to check if our file is Hap, this is contained in the "stsd" atom (Sample Description).

Once we are in the sample table, the most important data is at the reach of our hand (how to find position/length of a frame).

First, data is organized in chunks. A chunk contains one or more samples (so for example you can load the whole chunk from file instead of one at a time).

So we need to enumerate file offset for each chunks, which is contained into the "stco" Atom (Chunk Offset Atom).

Data is simply a prefix table, which contains file offset for each chunk. Please note the offset is absolute to the file, which makes it much easier since once we get that data we don't need to check for child Atom anymore.

Here is the data for a test mov file (powered by Hex Editor and Windows calculator ;)

stco (chunkoffsets)
Chunk 1 Offset : 48
Chunk 2 Offset : 1015901
Chunk 3 Offset : 2030918
Chunk 4 Offset : 3045373
Chunk 5 Offset : 4058366
Chunk 6 Offset : 4348429

Pretty simple, since all is absolute search is also much faster.q

Now, we need to know each Sample (or frame) size.

All frames size are contained in a single Atom ("stsz"), so we go thought them and get frame length:

stsz (sample size table)
Sample 1 : 144974
Sample 2 : 145333
Sample 3 : 145210
Sample 4 : 145344
Sample 5 : 145065
Sample 6 : 144811

Now we still don't know how samples relate to chunks (the last missing piece of the puzzle).

Now we need to read data in "stsc" atom (Sample to Chunk), which is a prefix table.

In my sample mov, this is described this way:

stsc (sample to chunk)
First/Sample Per Chunk/ Descriptor
1 / 7 / 1
5 / 2 / 1
6 / 1 / 1

As you can see this is compressed (7+2+1 = 10), and my file has 31 frames.

So this simply expands, as from 1 to 5 (the first 4 chunks), we have 7 samples.

Which is correct, since 7*4+2+1 = 31


So that's about it, with all that data we are ready to roll, we first build a prefix sum for chunk offsets:
0 - 7 - 14 - 21- 28 - 30

From this it's pretty easy to retrieve in which chunk our frame is contained.

Once we know chunk + position in chunk, we need to iterate using each sample length until we reach our position.

So third frame location is :

First Chunk = 48 + 144974 + 145333

Of course we can precompute all this (per chunk or per sample), so we end up having 2 arrays:
frameindex->location
frameindex->size

Then we just load our frame from file into memory, and rinse and repeat the Hap upload.

So here we go, Hap decoder (small SharpDX standalone sample) , which reads Hap MOV file without any need for QuickTime installed.





Fun times


No comments:

Post a Comment