New Version of the Rich Text Format (RTF) Specification

I have the pleasure, perhaps honor :), of being the principal editor of a revised version of the RTF file format specification. The focus of this revision is interoperability with Microsoft Word, RichEdit and all those other programs that support the venerable standard. More specifically, this new version contains definitions for all of the control words that show up in RTF files, fixes some errors, improves the English, and makes the formatting consistent.

It's really worth it, since RTF is so great at allowing documents to travel fluidly forward and backward through time. HTML is reasonably good at time travel, but it's not nearly as rich. Meanwhile XML formats require new namespaces whenever we add new features.  With RTF one just defines appropriate new control words, which older readers happily ignore and new ones can understand if they choose to.

Sometimes it's tricky work because the people who wrote the underlying code have left the Word team, if not Microsoft, and one has to reverse engineer a lot. But that's something my colleagues and I have been doing for years in maintaining and generalizing RichEdit's RTF converters. You just enter a construct in Word, save it as RTF and look at it with a plain-text editor. Also Word can be too helpful" in editing the document: autoFormatting, smart quotes, and background spelling need to be disabled or else i's get capitalized when they shouldn't be, curly quotes get used when ASCII quotes should be used, and the second of two capitals gets lower cased.

Word has a very cool Compare feature that's invaluable for projects like this one. It lets you see all the changes you've made to a document without confronting you with revision marks as you edit. Go to the Review tab, choose Compare and enter the file names of the original and revised documents. Then after a little while (at least with the 300 page RTF specification), up comes the compared version with revision markings for all changes. You can also see a revision pane with a summary of the corrections as well as windows with the original and revised documents. To display all four windows, it's handy to have a large screen. In particular, the Compare facility readily reveals if Word has made background changes you don't want.

Many people have helped with this revision, both in spotting problems by examining a myriad RTF documents as well as in filling in gaps in understanding from personal experience and/or examination of the Word code. Now it's your turn. If you want to see changes in the RTF specification, please send them to me.

- Murray Sargent

Office Blogs Comments

Comments: (10) Collapse

  • Some features in Microsoft Outlook require you to use a Microsoft Exchange account. Exchange is an e-mail-based collaborative communications server for businesses. Licenses for Exchange can be purchased from Microsoft and its resellers.

  • It's worth remembering, that in Novell's MS antitrust complaint, available at gl.scofacts.org/gl-20041115214025458.html , that Novell alleges that changes to RTF was part of the barrier to competition it faced (paragraphs 90-92 excerpted below). The statements in this complaint seem to be at odds with your "travel fluidly forward and backward in time" comment. I note that you've had to reverse-engineer some items, since the original knowledge appears to have been lost, including generalizing some RichEdit conversion features. This does not inspire confidence that the format is a good basis for interoperability. Also, you claim that older readers can merely ignore new control words; how do you define interoperability in this case (e.g. an old reader discarding unrecognized items in a round-trip scenario)? It's hard to claim that RTF is a credible interoperable file format when you note that sometimes people had to read the MS Word source code to understand some things... too bad if you were a competitor that didn't have access to the Word sources. And finally, I'm interested in a clearer explanation of why RTF's single flat namespace is better than XML's multiple namespaces when extending a specification. --recondite ------------- 90. Third, Microsoft unilaterally made the proprietary Rich Text Format

    ("RTF") of Microsoft Word the standard file format for text-based documents in

    applications developed for Windows. Upon capturing the standard, Microsoft

    strategically withheld the specification to injure competitors, including Novell. 91. As Microsoft knew, a truly standard file format that was open to all ISVs

    would have enhanced competition in the market for word processing applications,

    because such a standard allows the exchange of text files between different word

    processing applications used by different customers. A user wishing to exchange a text

    file with a second user running a different word processing application could simply

    convert his file to the standard format, and the second user then could convert the file

    from the standard format into his own word processor's format. Thus, a law firm, for

    instance, could continue to use WordPerfect (which was the favorite word processor of

    the legal profession), so long as it could convert and edit client documents created in

    Microsoft Word, if that is what clients happened to use. Microsoft knew that if it

    controlled the convertibility of documents through its control of the RTF standard, then

    Microsoft would be able to exclude competing word processing applications from the

    market and force customers to adopt Microsoft Word, as it soon did. 92. The specifications for RTF were readily available to Microsoft's

    applications developers, because RTF was the format they themselves developed for

    Microsoft's office productivity applications. Microsoft withheld the RTF specifications

    from Novell, however, forcing Novell to engage in a perpetual, costly effort to comply

    with a critical "industry standard that was, in reality, nothing more than the

    preference of its chief competitor, Word. Indeed, whenever Word changed its own file

    format, Microsoft unilaterally and identically changed the RTF standard for Windows,

    forcing Novell and other ISVs constantly to redevelop their applications. In this

    manner, Microsoft gave Word a permanent, insurmountable lead in time-to-market,

    and made document conversions difficult for users otherwise interested in running

    non-Microsoft applications. Many WordPerfect users were thus forced to switch to

    Microsoft Word, which predictably monopolized the word processing market.

  • recondite makes a number of comments on RTF format fidelity. The idea that RTF files can “travel fluidly forward and backward in time” does assume, of course, that any information a reader ignores is lost. But the degree of loss depends on the sophistication of the reader and is almost always much less than that incurred using plain text. For example, RichEdit doesn’t support footnotes and simply ignores footnote RTF. But people find RichEdit very useful anyway. In fact, some people deliberately read Word RTF into WordPad (which uses the OS RichEdit 4.1) and write it back out to obtain simplified RTF. A very high degree of fidelity can be achieved with the vast majority of RTF files without handling the rare RTF control words we researched and documented. If full interchange fidelity with Word is a primary concern, I’d recommend using OOXML instead of RTF. But don’t forget that RTF was invented long before XML, back when XML’s parent, SGML, was the only significant nonbinary format alternative. SGML was considerably more difficult to work with and didn’t have the extensive ecosystem that XML now enjoys. Note that due to the large number of RTF control words in Word 2007 and the correspondingly large research effort to verify the documentation, it was certainly useful to use whatever tools we had at our disposal, including examining the Word source code. Earlier versions of Word were considerably simpler, a fact reflected in the RTF they generated. Also reverse engineering by examining Word generated RTF is often the easiest way to understand the meaning of RTF sequences, regardless of how complete the documentation is. At least I’ve always found that learning by example is one of the best ways to learn, whether I’m learning physics, music, swimming, or RTF :) With regard to extensibility, namespaces in XML are fine for major extensions and provide useful clarity to a set of changes. But many changes are minor extensions of what’s already there and a whole namespace for such things can lead to an awful lot of notation overhead. An example is adding a nifty small numeric fraction to the Office math XML (OMML). The underlying Office 2007 math engine can display fractions like ½ with arbitrary numerators and denominators and perfect typography. But the feature was developed too late to incorporate into Word 2007. To extend the OMML fraction type attribute to include the small numeric fraction, we’d need to add a new namespace. In RTF, we’d just add a definition for a new value. And if a reader doesn’t know what that value means, it’d ignore it and display the usual stacked fraction. Thanks

    Murray

  • Thanks for the response. I'll have a think about what you've said, and try to make a more coherent response in a day or so. [FYI, my starting point is to try and map out what software ecological niche RTF has served and/or sought to serve in the past, especially the pre-IS29500 days, versus its role and function now and into the future.] --recondite

  • G'day, I couldn't find a proper specification for RTF on the MS website -- I only found a file named "Word2007RTFSpec9.doc" (or, alternately, a .docx version) which describes itself as a "White Paper", with Microsoft claiming that the information in the document only represents its current opinion. The text on the first page is worth repeating in full: -- (start first-page excerpt) The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

    This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

    © 2008 Microsoft Corporation. All rights reserved.

    Microsoft, MS-DOS, Windows, Windows NT, Windows Server, ActiveX, Excel, FrontPage, InfoPath, IntelliSense, JScript, OneNote, Outlook, PivotChart, PivotTable, PowerPoint, SharePoint, ShapeSheet, Visual Basic, Visual C++, Visual C#, Visual Studio, Visual Web Developer, Visio are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

    All other trademarks are property of their respective owners. License Agreement -- (end first-page excerpt) I tried looking for a license agreement for this document prior to downloading, but couldn't find one easily... is the full text for the license governing the document accessible without downloading the document? --recondite

  • Regarding "namespaces": My quick scan of Wikipedia notes that namespaces are used to provide extra naming information to allow items to be uniquely described where the shorter name might be ambiguous, e.g. "Employee ID 123" might refer to "Fred Bloggs" in Company X, but to "Sally Jones" in Company Y. I see namespaces as valuable in interchange formats when they can reduce or preferably eliminate the likelihood that item naming conflicts will arise between independent users of the specification, and/or between current users and future versions of the specification. For example, I might have defined an item named "vista" as a custom extension to an RTF document workflow back in 1999 in order to serve some private purpose, but might find that in a future revision to the specification, Microsoft chooses to use the name "vista" to mean something completely different -- and my old documents become forwards-incompatible, new documents are backwards-incompatible, and both cannot "travel fluidly forward and backward through time". A namespace facility would have helped in the example above if it allowed me to define my extensions in a non-conflicting way, so that the different definitions of "vista" could coexist without conflicting. So, regarding namespaces, my problem with your initial announcement is that your use of terms and phrases like "interoperability" and "travel fluidly [...] through time" is based on a Microsoft-centric point of view, and these terms may not be equally applicable to third-party users. Given that RTF aspires to be expressive but relatively simple in syntax and semantics, and namespaces would appear to be outside its scope, my bottom line is that I'd prefer that announcements such as your initial blog posting be clearer or perhaps more careful when using terms such as "interoperability", as an independent reader of the announcement may form a different expectation to your intended meaning. --recondite

  • One last comment in this batch, probably at least until the weekend following any response: I believe that the meaning of a human-readable document cannot, in general, be separated from details of its presentation. For example, a name may be italicised to distinguish it from surrounding text; removing the italics may change the way that the text is interpreted by a "typical" reader. Given this starting point, the policy of an old reader to ignore unrecognised items may change the meaning of the transferred text, and so impair interoperability. How does RTF deal with this situation? (I notice, from the Wikipedia article, that the Unicode character item includes an alternative-character specification to use if the Unicode character can't be rendered. This is a specific case; is there a general mechanism?) --recondite

  • MurryS said: "If full interchange fidelity with Word is a primary concern, I’d recommend using OOXML instead of RTF." Nonsense. It would make much more sense to use an older Word format. Most word processing applications support legacy Word formats rather well, and in some cases even surpassing the support included in later versions of Microsoft Office. Microsoft has already released extensive documentation on these formats, and if one wanted to write their own application, they could simply rip out the format code from a project such as OpenOffice.org (which is licensed under LGPL, a license common to open source libraries). You could argue that this cut you off from using the latest and greatest features of Office 2007, but I would point out to you that many people are still using Office 2003 or earlier. You can get add-ins on for Office 2003 to support the OOXML-like Office 2007 format, but I don't think one can really expect to get greater fidelity from such a document in Office 2003 than one would if the document were just saved as an Office 2003 file.

  • Children in need.

    DONATE EDUCATIONAL MATERIALS TO CHILDREN IN IMPOVERISHED COUNTRIES

    In some parts of the world, educational materials such as books, paper, pencils, rulers and erasers are scarce and expensive. Donate now to help children in need.

    LIBERTY RESERVE: U9022457

  • How come .rtf is so hard to use on smartphones, I have gone through iOS, BB and Android, NONE of the apps/phone viewers is able to open a link to a .rtf file. Why?

Comments

Comments: (loading) Collapse