Video Decode and Presentation API for Unix

Introduction

The Video Decode and Presentation API for Unix (VDPAU) provides a complete solution for decoding, post-processing, compositing, and displaying compressed or uncompressed video streams. These video streams may be combined (composited) with bitmap content, to implement OSDs and other application user interfaces.

API Partitioning

VDPAU is split into two distinct modules:

The intent is that most VDPAU functionality exists and operates identically across all possible Windowing Systems. This functionality is the Core API.

However, a small amount of functionality must be included that is tightly coupled to the underlying Windowing System. This functionality is the Window System Integration Layer. Possibly examples include:

Object Types

VDPAU is roughly object oriented; most functionality is exposed by creating an object (handle) of a certain class (type), then executing various functions against that handle. The set of object classes supported, and their purpose, is discussed below.

Device Type

A VdpDevice is the root object in VDPAU's object system. The Window System Integration Layer allows creation of a VdpDevice object handle, from which all other API entry points can be retrieved and invoked.

Surface Types

A surface stores pixel information. Various types of surfaces existing for different purposes:

Transfer Types

A data transfer object reads data from a surface (or surfaces), processes it, and writes the result to another surface. Various types of processing are possible:

Data Flow

Compressed video data originates in the application's memory space. This memory is typically obtained using malloc, and filled via regular file or network read system calls. Alternatively, the application may mmap a file.

The compressed data is then processed using a VdpDecoder, which will decompress the field or frame, and write the result into a VdpVideoSurface. This action may require reading pixel data from some number of other VdpVideoSurface objects, depending on the type of compressed data and field/frame in question.

If the application wishes to display any form of OSD or user-interface, this must be created in a VdpOutputSurface.

This process begins with the creation of VdpBitmapSurface objects to contain the OSD/UI's static data, such as individual glyphs.

VdpOutputSurface rendering functionality may be used to composite together various VdpBitmapSurfaces and VdpOutputSurfaces, into another VdpOutputSurface "VdpOutputSurface".

Once video has been decoded, it must be post-processed. This involves various steps such as color space conversion, de-interlacing, and other video adjustments. This step is performed using an VdpVideoMixer object. This object can not only perform the aforementioned video post-processing, but also composite the video with a number of VdpOutputSurfaces, thus allowing complex user interfaces to be built. The final result is written into another VdpOutputSurface.

Note that at this point, the resultant VdpOutputSurface may be fed back through the above path, either using VdpOutputSurface rendering functionality, or as input to the VdpVideoMixer object.

Finally, the resultant VdpOutputSurface must be displayed on screen. This is the job of the VdpPresentationQueue object.

vdpau_data_flow.png

Entry Point Retrieval

VDPAU is designed so that multiple implementations can be used without application changes. For example, VDPAU could be hosted on X11, or via direct GPU access.

The key technology behind this is the use of function pointers and a "get proc address" style API for all entry points. Put another way, functions are not called directly via global symbols set up by the linker, but rather through pointers.

In practical terms, the Window System Integration Layer provides factory functions which not only create and return VdpDevice objects, but also a function pointer to a VdpGetProcAddress function, through which all entry point function pointers will be retrieved.

Philosophy

It is entirely possible to envisage a simpler scheme whereby such function pointers are hidden. That is, the application would link against a wrapper library that exposed "real" functions. The application would then call such functions directly, by symbol, like any other function. The wrapper library would handle loading the appropriate back-end, and implementing a similar "get proc address" scheme internally.

However, the above scheme does not work well in the context of separated Core API and Window System Integration Layer. In this scenario, one would require a separate wrapper library per Window System, since each Window System would have a different function name and prototype for the main factory function. If an application then wanted to be Window System agnostic (making final determination at run-time via some form of plugin), it may then need to link against two wrapper libraries, which would cause conflicts for all symbols other than the main factory function.

Another disadvantage of the wrapper library approach is the extra level of function call required; the wrapper library would internally implement the existing "get proc address" and "function pointer" style dispatch anyway. Exposing this directly to the application is slightly more efficient.

Multi-threading

All VDPAU functionality is fully thread-safe; any number of threads may call into any VDPAU functions at any time. VDPAU may not be called from signal-handlers.

Note, however, that this simply guarantees that internal VDPAU state will not be corrupted by thread usage, and that crashes and deadlocks will not occur. Completely arbitrary thread usage may not generate the results that an application desires. In particular, care must be taken when multiple threads are performing operations on the same VDPAU objects.

VDPAU implementations guarantee correct flow of surface content through the rendering pipeline, but only when function calls that read from or write to a surface return to the caller prior to any thread calling any other function(s) that read from or write to the surface. Invoking multiple reads from a surface in parallel is OK.

Note that this restriction is placed upon VDPAU function invocations, and specifically not upon any back-end hardware's physical rendering operations. VDPAU implementations are expected to internally synchronize such hardware operations.

In a single-threaded application, the above restriction comes naturally; each function call completes before it is possible to begin a new function call.

In a multi-threaded application, threads may need to be synchronized. For example, consider the situation where:

In this case, the threads must synchronize to ensure that thread 1's call to VdpDecoderRender has returned prior to thread 2's call(s) to VdpVideoMixerRender that use that specific surface. This could be achieved using the following pseudo-code:

 Queue<VdpVideoSurface> q_full_surfaces; 
 Queue<VdpVideoSurface> q_empty_surfaces; 
  
 thread_1() { 
     for (;;) {
         VdpVideoSurface s = q_empty_surfaces.get();
         // Parse compressed stream here
         VdpDecoderRender(s, ...);
         q_full_surfaces.put(s);
     }
 } 
  
 // This would need to be more complex if
 // VdpVideoMixerRender were to be provided with more
 // than one field/frame at a time.
 thread_1() { 
     for (;;) {
         // Possibly, other rendering operations to mixer
         // layer surfaces here.
         VdpOutputSurface t = ...;
         VdpPresentationQueueBlockUntilSurfaceIdle(t);
         VdpVideoSurface s = q_full_surfaces.get();
         VdpVideoMixerRender(s, t, ...);
         q_empty_surfaces.put(s);
         // Possibly, other rendering operations to "t" here
         VdpPresentationQueueDisplay(t, ...);
     }
 }

Finally, note that VDPAU makes no guarantees regarding any level of parallelism in any given implementation. Put another way, use of multi-threading is not guaranteed to yield any performance gain, and in theory could even slightly reduce performance due to threading/synchronization overhead.

However, the intent of the threading requirements is to allow for e.g. video decoding and video mixer operations to proceed in parallel in hardware. Given a (presumably multi-threaded) application that kept each portion of the hardware busy, this would yield a performance increase.

Surface Endianness

When dealing with surface content, i.e. the input/output of Put/GetBits functions, applications must take care to access memory in the correct fashion, so as to avoid endianness issues.

By established convention in the 3D graphics world, RGBA data is defined to be an array of 32-bit pixels containing packed RGBA components, not as an array of bytes or interleaved RGBA components. VDPAU follows this convention. As such, applications are expected to access such surfaces as arrays of 32-bit components (i.e. using a 32-bit pointer), and not as interleaved arrays of 8-bit components (i.e. using an 8-bit pointer.) Deviation from this convention will lead to endianness issues, unless appropriate care is taken.

The same convention is followed for some packed YCbCr formats such as VDP_YCBCR_FORMAT_Y8U8V8A8; i.e. they are considered arrays of 32-bit pixels, and hence should be accessed as such.

For YCbCr formats with chroma decimation and/or planar formats, however, this convention is awkward. Therefore, formats such as VDP_YCBCR_FORMAT_NV12 are defined as arrays of (potentially interleaved) byte-sized components. Hence, applications should manipulate such data 8-bits at a time, using 8-bit pointers.

Note that one common usage for the input/output of Put/GetBits APIs is file I/O. Typical file I/O APIs treat all memory as a simple array of 8-bit values. This violates the rule requiring surface data to be accessed in its true native format. As such, applications may be required to solve endianness issues. Possible solutions include:

Note: Complete details regarding each surface format's precise pixel layout is included with the documentation of each surface type. For example, see VDP_RGBA_FORMAT_B8G8R8A8.

Video Mixer Usage

VdpVideoSurface Content

Each VdpVideoSurface is expected to contain an entire frame's-worth of data, irrespective of whether an interlaced of progressive sequence is being decoded.

Depending on the exact encoding structure of the compressed video stream, the application may need to call VdpDecoderRender twice to fill a single VdpVideoSurface. When the stream contains an encoded progressive frame, or a "frame coded" interlaced field-pair, a single VdpDecoderRender call will fill the entire surface. When the stream contains separately encoded interlaced fields, two VdpDecoderRender calls will be required; one for the top field, and one for the bottom field.

Implementation note: When VdpDecoderRender renders an interlaced field, this operation must not disturb the content of the other field in the surface.

VdpVideoMixerRender surface list

The VdpVideoMixerRender API receives VdpVideoSurface IDs for any number of fields/frames. The application should strive to provide as many fields/frames as practical, to enable advanced video processing algorithms. At a minimum, the current field/frame must be provided. It is recommended that at least 2 past and 1 future frame be provided in all cases.

Note that it is entirely possible, in general, for any of the VdpVideoMixer post-processing steps to require access to multiple input fields/frames.

It is legal for an application not to provide some or all of the surfaces other than the "current" surface. Note that this may cause degraded operation of the VdpVideoMixer algorithms. However, this may be required in the case of temporary file or network read errors, decode errors, etc.

When an application chooses not to provide a particular surface to VdpVideoMixerRender, then this "slot" in the surface list must be filled with the special value VDP_INVALID_HANDLE, to explicitly indicate that the picture is missing; do not simply shuffle other surfaces together to fill in the gap.

The VdpVideoMixerRender parameter current_picture_structure applies to video_surface_current. The picture structure for the other surfaces will be automatically derived from that for the current picture as detailed below.

If current_picture_structure is VDP_VIDEO_MIXER_PICTURE_STRUCTURE_FRAME, then all surfaces are assumed to be frames. Otherwise, the picture structure is assumed to alternate between top and bottom field, anchored against current_picture_structure and video_surface_current.

Applying de-interlacing

Note that VdpVideoMixerRender disables de-interlacing when current_picture_structure is VDP_VIDEO_MIXER_PICTURE_STRUCTURE_FRAME; frames by definition need no de-interlacing.

Weave de-interlacing may be obtained by giving the video mixer a surface containing two interlaced fields, but informing the VdpVideoMixer that the surface has VDP_VIDEO_MIXER_PICTURE_STRUCTURE_FRAME.

Bob de-interlacing is the default for interlaced content. More advanced de-interlacing techniques may be available, depending on the implementation. Such features need to be requested when creating the VdpVideoMixer, and subsequently enabled.

If the source material is marked progressive, two options are available for VdpVideoMixerRender usage:

  1. Simply pass the allegedly progressive frames through the mixer, marking them as progressive. This equates to a so-called "flag following" mode.
  2. Apply any pulldown flags in the stream, yielding a higher rate stream of interlaced fields. These should then be passed through the mixer, marked as fields, with de-interlacing enabled, and inverse telecine optionally enabled. This should allow for so-called "bad edit" detection. However, it requires more processing power from the hardware.

If the source material is marked interlaced, the decoded interlaced fields should always be marked as fields when processing them with the mixer. Some de-interlacing algorithm is then always applied. Inverse telecine may be useful in cases where some portions, or all of, the interlaced stream is telecined film.

Extending the API

Enumerations and Other Constants

VDPAU defines a number of enumeration types.

When modifying VDPAU, existing enumeration constants must continue to exist (although they may be deprecated), and do so in the existing order.

The above discussion naturally applies to "manually" defined enumerations, using pre-processor macros, too.

Structures

In most case, VDPAU includes no provision for modifying existing structure definitions, although they may be deprecated.

New structures may be created, together with new API entry points or feature/attribute/parameter values, to expose new functionality.

A few structures are considered plausible candidates for future extension. Such structures include a version number as the first field, indicating the exact layout of the client-provided data. Such structures may only be modified by adding new fields to the end of the structure, so that the original structure definition is completely compatible with a leading subset of fields of the extended structure.

Functions

Existing functions may not be modified, although they may be deprecated.

New functions may be added at will. Note the enumeration requirements when modifying the enumeration that defines the list of entry points.

Display Preemption

Please note that the display may be preempted away from VDPAU at any time. See Display Preemption for more details.

Trademarks

VDPAU is a trademark of NVIDIA Corporation. You may freely use the VDPAU trademark, as long as trademark ownership is attributed to NVIDIA Corporation.

Generated on Mon Dec 22 06:59:42 2008 for VDPAU by  doxygen 1.5.6