EIA-608 test captions. Hackipedia broadcast television documentation project. EIA-608 character set test. This CC data shows you the entire character set in character code order: 1. Standard character set (codes 0x20-0x7F). Most of the character set mirrors ASCII, but some codes have been replaced with other symbols. 2. Special character set (codes 0x30-0x3F). Special characters, the most common in use the musical note (code 0x37) but also covering the Registered and Trademark (TM) symbols. 3. Spanish/French character set. 4. Portuguese/german/etc. character set. Note that there are quirks in the special, spanish, and portuguese sets that this CC data tests, specifically the undocumented and unspoken yet widely use convention that the special character is transmitted twice yet the receiver treats two consecutive codes (of the same value) as one instance. Or, that Spanish/French/Portuguese extended characters are not printed normally but instead "overwrite" the last ASCII character, apparently so that the CC data can transmit the ASCII equivalent then overwrite it for older CC decoders?? Needless to say, EIA-608 in my humble opinion is a badly documented mess and you'll find there are a lot of holes and gray areas on how you handle things with the standard. But once you know these cases, you'll probably handle just fine the majority of closed caption data in the wild. Another weird quirk uncovered by this test CC data, is that most TV sets will allow you to print up to the full 32 chars per row, yet with special and Spanish characters, will not properly print the char if the cursor is at the 32nd column. If you attempt to print 32 chars, it will overwrite the 31st char (as you'll see in the test results). ------------------------------------------------------ How it was generated: The code patterns were generated as a Scenarist closed caption file (SCC) using a Perl script to run through the patterns. The SCC file is included here. A binary file containing the raw CC (16-bit) data is also provided. The format is very straightforward: it is an array of 16-bit WORDs, each WORD containing both caption bytes in the order transmitted on Line 21 (cc1 and cc2). If you were to strip off the MSB (parity bit) you would see ASCII text mixed in with control codes. For complete testing, a "donor" MPEG file is provided as well where the CC data has been inserted into the user data packets, one packet per GOP into the I-frame. For reference, recordings are provided of actual TV sets decoding the CC data. ----------------------------------------------------------- UPDATE: Doh, I realize that in producing this file, I made a minor mistake: I used a 25fps PAL framerate MPEG-1 video stream to test! It's not entirely a waste of course, since it seems to reveal (when played on NTSC monitors) which TV sets do or do not count padding (0x8080) between consecutive special characters. But, from this point on, I need to remind myself that the "donor" MPEG stream needs to be NTSC frame rates! The 25fps snafu has revealed for example that Sylvania and LG TV sets don't count padding (0x8080) when considering duplicate chars, but Viewsonic TVs do count it (you'll notice in the stills that the Viewsonic does not decode the doubled special chars because of that---the 25fps to 29.97fps framerate conversion ensures that sometimes there is padding!)