Book Master version 1 documentation
This specification is superseded by bookmaster-2.|
This specification and the corresponding bixutls.py program evolved over a period of years as the technologies were being developed. Eventually, the limitations of the way this XML format evolved became cumbersome, and the original formats from which the bookmaster XML format was derived (bookindex and bookix) had become extinct. At that point, it was no longer necessary to retain XML as a basis for the format. Hence this specification, developed as "bookmaster" has now become "bookmaster version 1", and bixutls.py was enhanced to convert the bookmaster-1 format to the bookmaster-2 format.
As soon as all extant bookmaster-1 files have been converted to bookmaster-2 format, and the bookmaster-2 format is fully documented, this specification will have only historical significance.
For the record, some of the cumbersome issues with XML were:
- XML generators don't preserve the order of attributes for a tag. Due to the overuse of attributes on the <book> tag, order preservation became important, and generation of the format had to be done with custom code instead of just an XML dump.
- Overuse of attributes also made it difficult to generate lists of key/value pairs. Over time, the importance of lists of key/value pairs increased, and various workarounds were applied. The straw that broke the camel's back was when key/multi-value tuples were needed. This became extremely cumbersome.
- Standard XML parsers don't produce particularly useful error messages, although the one used here does pinpoint the exact line and character position of an error, which overcomes the deficiency of the messages somewhat. Errors in the source file always produced "interesting" stack traces.
- Limitations on the character set used for attribute names required extra work and features to work around them.
The Book Master file format can be used to produce a collection of short items with simple formatting such as hymns and poems in various e-book formats usable by various generic and customized e-book readers. Note that the goal of the .pdf and .txt formats is not to produce printable books (those are available from other sources) but to provide easily searchable books, so particular hymns or poems can be located quickly, based on remembered phrases or words.
The bixutls.py program is a set of Python 3.1+ utility functions and file format conversion functions that can be used to manipulate and convert between bookmaster, bookix, and bookindex formats. It can also produce other formats such as plain txt, PDF, epub, and mobi (the latter three being standardized publishing formats). bixutils.py makes most of its functionality available from the command line as a standalone program as well as the utility functions which can be imported into other Python 3.2+ programs.
Originally created to manipulate bookindex and bookix files, it seemed appropriate to have a master file format (.master.txt) from which the others could be created as needed, and then it was observed that some other formats could usefully be generated as well. Programmers may contact me for more details about the bixutils.py program.
Song Printer for Windows
One application of the Book Master format and bixutls.py is to convert to and from Song Printer for Windows format. Song Printer for Windows is a music typesetting program targeted for creation of hymnbooks with 4-part harmony.
The details of the file formats are given below. The bookmaster format is the primary format, and the bookindex and bookix are similar, so are defined in terms of concepts defined for the bookmaster format.
All the file formats defined here use UTF-8 character encoding.
bookmaster (.master.txt) format
The bookmaster format is a variation of XML. A DTD has not been created for it, but probably could be. An XML parser will parse it successfully, although without validation. A listing of the tags is below, and for each tag, a listing of attributes that are defined, and their expected values. Programs that process bookmaster files must be prepared to do default handling of new attributes (possibly as simple as ignoring them) which may be defined in the future.
Programs using the bookmaster format include bixutls.py.
Used to enclose a complete book. Attributes are:
(1) Note that all characters used in the items in the file must be found in one of sort, ideograph, whitechars, or otherchars.
- version - version of the book. Generally, the last two digits of the year (or all four after 2099), a period, the month digits, a period, and the day digits of the last edit made to the master file.
- rootfile - consists of the following parts separated by ".":
The rootfile is also the base filename for derived file formatss, with various extensions added to discriminate among those formats. Generally, the rootfile is used, followed by .master.txt, as the name of bookmaster file. Whatever name is used for the .master.txt file must also be used for .cover.jpg, which is the file containing the cover image. Optionally, if the rootfile name is undesirable or inappropriate for a particular format, there can be other attributes, named the same as the derived file extension, but prefixed with "rootfile", that override the rootfile for that format. A special value of "None" means not to generate that format. The present list of format extensions follows without description. More may be added.
- type - the type of material contained in the book, currently known types are:
- language - the words used in the native language to describe itself (see also englishlanguage below). This should come from the text of the ISO 639-1 or ISO 639-2 standards. Not all languages are supported by the Amazon Kindle.
- title - title of the book in the native language (see also englishtitle and name below).
- rootfile.bookindex - see rootfile
- rootfile.bookix - see rootfile
- rootfile.txt - see rootfile
- rootfile.pdf - see rootfile
- rootfile.x.pdf - see rootfile
- rootfile.c.pdf - see rootfile (a format for printing)
- rootfile.t.pdf - see rootfile (a format for printing)
- rootfile.epub - see rootfile
- rootfile.mobi - see rootfile
- rootfile.gbk.twm - see rootfile
- rootfile.bok.mybible - see rootfile
- rootfile.tsv - see rootfile
- rootfile.wordlist.txt - see rootfile
- rootfile.words.js - see rootfile
- rootfile.toc.js - see rootfile
- rootfile.ix.js - see rootfile
- rootfile.metric.js - see rootfile
- rootfile.author.js - see rootfile
- rootfile.composer.js - see rootfile
- abbrevtitle - a short form of the book title, for use in tabs in theWord. Other similar uses can be envisioned.
- englishtitle - the title of the book in English, or an approximation. Needed for theWord format, but also handy for reference by English speaking people.
- name - the title of the book for use by the Hymnbooks program. Since Hymnbooks doesn't include the language in the listing, titles from different languages that share some common words may have non-unique titles, so the name is an expanded version of the title used to clarify which book is being referred to.
- language - the name of the language of the book as written in the language of the book
- englishlanguage - the name of the language of the book as written in the English language. This should come from the text of the ISO 639-1 or ISO 639-2 standards.
- sort - space separated list of groups of characters that sort equivalently. These, and only these, characters define the interesting part of a word, and are used for sorting of indices and word lists. (1)
- ideograph - a list of ranges of Unicode characters that should be treated as ideographs(1)
- whitechars - a list of white space characters used to separate words. (1)
- splitchars - a list of characters used to split words. The space character should be included, and should be last.
- otherchars - a list of other characters, such as punctuation, that do not separate words, but are not parts of the words, either, so are stripped off when searching for words. Some may even be used within a word, separating parts of it from other parts of it (in English, hyphens and apostrophes are commonly found). Characters within a word may impact some search algorithms. If hyphen is included in the list, it must be first. (1)
- quotes - pairs of quote characters that should be balanced within a text. Note that if the same character is used to start and end a quote, no balancing calculations can be performed, or are useless.
- repeat - character sequence used to both begin and end a repeated text sequence of lyrics
- sequence & sequence.PRE - book item numbers are expected to be a monotonically non-decreasing sequence of positive numbers. By default, it is just the digits themselves, with a possible suffix for adjacent variations on the same theme. If multiple sections of a book exist with independent numbering, then sequence is used to specify a name for the main section, and all other sections are expected to have prefix string prior to their numbers. The name of each section is given by the value of sequence.PRE where PRE is the prefix for the numbers of the named section. Each section is expected to be continguous, with a monotonically non-decreasing sequence of positive numbers, or be a single item with no sequence number. The order of items in the bookmaster file is expected to contain one section after another, in the appropriate order. It would be foolish to include digits in the prefixes, particularly at either end.
- translate.CODE - used to provide translations to HTML Hymns ebooks so that the ebooks can appear in the same language as the book. CODE is a text used as a placeholder in the code to be replaced with the translated text.
- metatrans.N - used to manipulate metadata (item attributes). N is a number with value greater than 1, without leading zeros. Starting at 1, a search is made for metatrans attributes with increasing values. If more than 10 numbers are skipped, the search ends. The value is composed of three parts, separated by “¤” characters.
If the matches happen, the new-attr is created on that same item with new-value as its value. One use is to extract copyright information from the SPW attributes, and create a normal copyright attribute.
- attribute to match
- prefix of value to match
- new-attr = new-value
- metatranslate.N - also used to manipulate metadata like metatrans.N, but differently. metatranslate.N simply converts or translates an item attribute name into a better name for including in the book. The value is composed of two parts, separated by “¤” characters.
- attribute to match
- replacement text of attribute name
- hdrattr - overrides the use of title in the table of contents, and at the top of each item. This is particularly useful when there are no titles, and the table of contents would have just numbers. Specify the name of an item attribute to use instead, as the value for this book attribute.
- hdrsplit - provides a sequence of characters which are used to split the header value. More detail provided under calculated item attributes.
- useordinals - overrides the use of number in the table of contents, indexes, and chapter titles. This is particularly useful when there are no numbers, but you want numbers to appear in those places for reference.
- skipchorus - the __chorus attribute is generally derived from the first line of the chorus (which are identified as lines starting with the TAB character). If that first line matches the value of this attribute, the second line of the chorus is used instead.
- chorus - a word to insert in front of indented chorus lines: usually a tranlations of the word Chorus, or Refrain.
- chorus.font-style - normal, italic, oblique
- font-family - comma-separated list of font names appropriate for this text
- sanitize.always - space-separated list of groups of characters that are translated to the final character of the group when analyzing user search input, and when transforming words to insert into the word list for searching.
- sanitize.maybe - space-separated list of groups of characters that should be treated equivalently in an expanded search.
- unmappedattrs - ¤-separated list of the names of item attributes that, when sorted, should not be mapped via the sort mapping for the book. This may be appropriate when punctuation or white space is important to the sort, such as for a metric index.
- cpdf.page - ¤ separated list of values for use with cpdf format. In order: page width inches ¤ page height inches ¤ page margin inches ¤ number columns ¤ inches between columns ¤ indent points ¤ numwid points ¤ padding points
- cpdf.font - ¤ separated list of values for use with cpdf format. In order: default font size ¤ default leading ¤ book title font size ¤ book title leading ¤ item title font size ¤ item title leading
- tpdf.page - ¤ separated list of values for use with tpdf format. In order: page width inches ¤ page height inches ¤ page margin inches ¤ cover width pixels ¤ cover height pixels ¤ indent points ¤ numwid points ¤ padding points
- tpdf.font - ¤ separated list of values for use with tpdf format. In order: default font size ¤ default leading ¤ book title font size ¤ book title leading ¤ item title font size ¤ item title leading
- ix.count - specify the number of indexes defined. This number must be exact. If omitted, no indexes will be generated.
- ix.N.formats - ¤ separated list of formats for which this index should be generated.
- ix.N.title - ¤ separated list of title and description of index N (where N is from 1 to numix). The description is used in the table of contents.
- ix.N.what - ¤ separated list of item attributes that contribute to this index, in a single sorted order. For music, a common list would be "title ¤ __chorus ¤ __first". A separate index might have "¤¤meter", or "¤¤__hdrshort". Not all items are required to have all, or even any, of the attributes. Those that exist will be indexed. The first type of item in the list will be in bold font when included, the second will be in italic font. Other items will be in upright (normal) font. Use empty entries in the list to avoid the special fonts.
- ix.N.type - a list of values separated by “¤” characters. The first value selects a type of index, and defines any additional values in the list:
- numbers - starting with the second number, list a single group of every third number. No additional values are allowed. If this index is specified, it is placed before the book content, except for .txt and .pdf formats. ix.N.what, ix.N.min, and ix.N.max are ignored for this type of index.
- unsorted - entries are listed in the order they were discovered, which is the order in the bookmaster file. This is the default if this attribute is omitted. No additional values are allowed.
- sorted - a dictionary sort is applied to all the values contributing to the index. No additional values are allowed.
- sortedby - a dictionary sort is applied to all the items contributing to the index, but additional values in this entry specify corresponding sort attributes that are different than the displayed attributes. (They could be the same, but use "sorted" for that!)
- groups - attribute values are sorted, and groups of items with like attributes are listed under each group. A second value in the list specifies either “unsorted” or “sorted” which applies to the entries in each group. Subsequent entries specify the attributes to use to identify each item.
Groups ignore the special fonts from ix.N.what, but use a similar scheme here: the third item would be in bold font, the fourth italic, and the fifth and subsequent items would be in upright (normal) font.
If “number” is specified alone, they are listed across the line instead of one per line, and no special font is used.
If there is no specification, “__hdrshort” is assumed, and no special font is used.
No attributes. Must precede all <item> tags.
Content is descriptive text for the book, preferably in English, possibly also in the language of the book.
The content of each item in the book consists of the plain text of the item. Lines of text indented with one tab may be italicized in some output output formats, and may be considered the chorus or refrain of a hymn. Lines of text beginning with ¤ followed by one tab will be elided in most output formats (used for echo voices for hymns for Song Printer for Windows). Song Printer for Windows needs syllabication, so additional ¤ characters have been used to divide syllables according to the rules for Song Printer for Windows (see its documentation) -- these are all stripped when exporting to other formats.
New rules for syllabication and note skipping have been established in 2015. These can be briefly described.
There are 5 special characters, space, hyphen, ¤, ␣, and █. Each text corresponding to a musical note (syllable) contains other characters, and ends with one of the first 3 special characters. Space means end of word, hyphen means syllable break between syllables of a compound word where the hyphen is part of the spelling and must appear, and ¤ is used for other syllable breaks, and a hyphen may optionally be generated if the adjacent syllables must be spaced apart during formatting. A space is implied at the end of a line if no other syllable end character is found. The 4th special character, ␣, is used as a placeholder between syllables or at the end of line where there is a soprano note that does not have a corresponding text. The 5th special character, █, is used to denote a desired horizontal gap in the music and lyrics, which may alternately be used as a line break for narrow formats.
These can be translated back to Song Printer for Windows notation if required.
Programs that read bookmaster or bookix texts should be prepared to handle new attributes which may be added in the future.
Attributes fall into three categories, those used for Song Printer for Windows only, those used more generally, and those that can be calculated by bixutils.py.
Attributes specifically for Song Printer for Windows all start with "_SPW":
- _SPWCopyright - the text of the copyright notice
- _SPWEnglish - the number of this item in the English book, if it exists there.
- _SPWForm - N for normal
- _SPWprevtune - text to indicate that hymn should be song to the preceding tune.
- _SPWNumber - the number of the item in the printed book.
- author - the creator of the words or lyrics of the item.
- composer - the creator of the music for the item.
- meter - sequence of counts per line for determining metric foot.
- number - a sequence number of the item.
- _ebook_suffix - a unique identifier of the item, suffixed to repeated numbers (variations on the same theme) to make them visually unique to the user, and to make the file name for any related content unique as well.
- _ebook_alternate - an alternate number, all digits, for numbers that actually include other characters besides digits.
- title - the title of the item.
- topic - the topic of the item.
- tune - the name of the tune.
Calculated attributes (names start with two underscore characters):
- __ordinal - the count of the item, starting at 1 for the first item
- __first - the first line of text
- __chorus - the first line of the chorus
- __title1st - the part of the title attribute before the first instance of the value of the hdrsplit attribute
- __title2nd - the part of the title attribute after the first instance of the hdrsplit attribute and before the second instance of the value of the hdrsplit attribute
- __hdrlong - the "number" attribute if it exists (or the ordinal if there is a book attribute named "useordinals") and the "title" attribute if it exists (or the attribute specified in the book attribute named "hdrattr", if it is specified), separated by ". " if both exist. This value can be used to create an index that is like a traditional table of contents, and it is used in "groups" style indexes as well. Can be used for indexes.
- __hdrshort - __hdrshort is the value of __hdrlong up to the first instance of hdrsplit book attribute value. Used in group indexes, and can be used in other indexes.
- __hdrsplit - the list of fragments of the __hdrlong value when split by the hdrsplit book attribute value. Used for chapter titles.
New books should all be created as .master.txt files, and bookix should only be an output format from bixutls.py. The bixutls.py program can convert from bookix format to bookindex format or bookmaster format, but the latter needs to be done with care, and with additional information supplied manually.
The bookix format is a variation of XML. A DTD has not been created for it, but proably could be. An XML parser will parse it successfully, although without validation. A listing of the tags is below, and for each tag, a listing of attributes that are defined, and their expected values. Programs that process .master.txt files must be prepared to do default handling of new attributes (possibly as simple as ignoring them) which may be defined in the future.
The bookix format is used as an input format by Book Index version 2.
The bookindex format can be created by bixutls.py from the bookmaster format.
Used to enclose a complete book. Attributes are the same as for the .master.txt format book tag.
Same as for .master.txt format.
Content is the same as for .master.txt format. The attributes are also the same, but none of the attributes starting with "_SPW" are defined for bookix format.
This format is officially unsupported and extinct, as is the Book Index program that used it. New books should all be created as .master.txt files, and bookindex should only be an output format from bixutls.py, with dwindling usage. The bixutls.py program can convert from bookindex format to bookix format or bookmaster format, but such needs to be done with care, and with additional information supplied manually.
The bookindex format is used as an input format by Book Index 2009-10-19.
The bookindex format can be created by bixutls.py from the bookmaster or bookix formats.
The bookindex format is a sequence of XML-like tags, but has no enclosing tag, so may not be properly formed XML. No attributes are used for any of the tags.
No attributes. Must be first if it exists.
The same as the value of the sort attribute for the book tag of the .master.txt format, except this used Perl regular expression syntax for the list of characters, requiring that metadata characters be escaped. In practice, with existing books created in bookindex format, this was only the "-" character, so it is the only one converted when reading bookindex format files.
No attributes. Must precede all <item> or <hymn> tags.
Content is descriptive text for the book.
Content is an optional item number (digits following by a period and whitespace), followed by item text. If a number can be parsed, it is used (with a warning if non-sequential) and begins a new sequence; otherwise a sequence number is used, starting from 1 (or from the last parsed number). Duplicate numbers cause items to be combined, this is not recommended or guaranteed for the future.
An older form of the <item> tag. Use either one, but don't mix them in the same file.