The Book Master file format can be used to produce a collection of short items with simple formatting such as song lyrics and poems in various e-book formats usable by various generic and customized e-book readers. Note that the goal of the .pdf and .txt formats is not to produce printable books (those are available from other sources) but to provide easily searchable books, so particular lyrics or poems can be located quickly, based on remembered phrases or words.
The bmutils.py program is a set of Python 3.1+ utility functions for generating ebooks and reports from a bookmaster file. bmutils.py makes most of its functionality available from the command line as a standalone program as well as the utility functions which can be imported into other Python 3.2+ programs.
The first version of bmutils.py is a simple rename of bixutls.py (documented here), but with support for simplified file system hierarchies. Support for older formats are subject to being removed over time. Programmers may contact me for more details about the bmutils.py program.
Book Master 2 format is, at its heart, a UTF-8 TSV file format. UTF-8 refers to the Unicode definition of a particular character encoding scheme, and TSV refers to Tab Separated Values, a format supported by some spreadsheets. This format can be edited (with care) in a text editor (as long as it preserves TAB characters, and permits insertion of them), or in a spreadsheet program (as long as it does not expect or add extraneous quotation marks or other symbols to cell values). Examples of such programs include emacs, Notepad++, and OpenOffice Calc.
The file extension for a Book Master 2 file is .gbook.tsv
The full name of the file is expected to consist of 3 dot-separated parts prior to the extension. Before the first dot should be the type of content. Types presently supported are "Songs", "Hymns", "Poems", "Graces", "test". Between the first and second dot shoud be the language of the book, as known by speakers of the language, such as "English", "español", "Deutsch", etc. Between the second and third dot should be the title of the book in the language of the book and possibly the author or compiler or publisher and possibly the year of publication to make the title unique, such as "The United Methodist Hymnal 1989", "Praise! Our Songs and Hymns (Singspiration 1979)", "Don Juan by Lord Byron", "The Collected Poems of Emily Dickinson 1890", etc. If there have been or are expected to be multiple editions of the book with the same title, it is extremely useful to include the year of publication.
The TSV format is extremely easy to parse. Perhaps it is a bit cumbersome to edit, but not extremely so. Good error messages produced when analyzing the input file can overcome the cumbersomeness. It is convenient to use spreadsheet terminology when talking about TSV files, such as row for line of text and cell for data between TAB characters (or before the first TAB or after the last TAB in a row). If a description is given for a limited number of cells in a row, it is presumed that all additional cells in that row should be empty.
The Book Master file is divided into a number of tables. Tables begin with a row containing only a '¶' character in the first cell, and the name of the table in the second cell. Each table is described in its own section following. Most rows in the tables have a keyword entry in the first cell and one or more data values in following cells in the same row. The specific of the data values are given in the individual descriptions. In discussing the values of textual data values, they are often quoted in the descriptions, but should not be quoted in the values in the file, unless the quoting is part of the value (which is true only for JSON format entry values, all four are specifically documented as such, and for the lyrics part of items).
Item numbering within a book is most often a single linear sequence of consecutive integers from 1 through the number of items, inclusive. But there are variations that have additional complexity. To deal with the complexity, some terms are defined.
A suffix is not allowed if there are no digits, and sequential items having the same sequence must be monotonically increasing (each number formed from the digits of one complete number must be greater than the number formed from the digits in the prior complete number). Neither the sequence specification nor the suffix may contain digits.
Practical examples of needing the complexity include:
|version||version of the book data. Generally, the last two digits of the year (or all four after 2099), a period, one or two digits describing the month, a period, and one or two digits describing the day of the month of the last edit made to the master file. For example, 15.6.3 would represent 3 June 2015.|
|englishlanguage||The name of the language translated to English.|
|englishtitle||The title of the book translated to English.|
|language||The words used in the native language to describe itself. This is also the second part of the full name of the book. This should come from the text of the ISO 639-1 or ISO 639-2 standards.|
|repeat||A character sequence used to begin a repeated text sequence of lyrics. Because it may contain whitespace characters, this list is in JSON format for a string (surrounded by quotes, and JSON escape sequences usable to encode characters if necessary).|
|repeatend||A character sequence used to end a repeated text sequence of lyrics, if different from that used for beginning. Often the beginning sequence ends with a space, and the ending sequence begins with a space, making them different. Because it may contain whitespace characters, this list is in JSON format for a string (surrounded by quotes, and JSON escape sequences usable to encode characters if necessary).|
|_SPWpre||The prefix given to SPW output files (generally just one letter)|
|_SPWpost||The suffix given to SPW output files (generally .txt)|
|rootfile||Usually omitted. If there is some reason that the Book Master file cannot be name with the naming convention outlined above, this value can be used to override the first three parts of the output file names.|
|abbrevtitle||An extremely short form of the book title, used in directory structures on the web site, and as part of some file names of the GBook format.|
|sort||A primary sort order definition. A space separated list of groups of characters that are equivalent for sorting, in the order in which they should be sorted, at the highest level. Each group generally consists of a digit, or of both an uppercase and lowercase letter, and sometimes letters with added diacritical marks. The first character in each group is the one that will be displayed in index listings. Every character used in the file should also be included in the lists for "sort", "whitechars", "otherchars", or "ideograph".|
|sort.2||A secondary sort order definition. While it is the same syntax as "sort" above, the groupings are different. This secondary list only fine tunes the order of items that would be equivalent according to the primary sort order. An example is that some languages have characters with diacritical marks, which are generally treated as equivalent to the character without the diacritical mark for sorting... but if some character sequence would be otherwise equivalent, the accent marks may then play a role in the order. (N.B. Other languages that use characters with diacritical marks may treat the character with and without the diacritical mark as completely separate letters.)|
|whitechars||A list of white space characters that should be treated as word divisions. This list should generally always include the normal space character (U+0020), the TAB character (U+0009), and the newline character (U+000A), but may also include other characters. One common additional character is no-break space (U+00A0), which is used for synaloepha. Every character used in the file should also be included in the lists for "sort", "whitechars", "otherchars", or "ideograph". Because it contains white space characters, this list is in JSON format for a string (surrounded by quotes, and JSON escape sequences usable to encode characters if necessary).|
|otherchars||A list of other characters found in the text, that do no contribute to the sorting of text, generally punctuation characters. If hyphen is included in the list, it must be first.|
When creating the word list, or searching for words, these characters are stripped off the ends of words, and if found in the middle of words, each sequence of these characters is substituted with a "•" character for more consistent matching. Every character used in the file should also be included in the lists for "sort", "whitechars", "otherchars", or "ideograph".
|caponly||A misnamed attribute used only to silence warnings about unused letters. The initial case, from which its name was derived, was for letters that happen to appear only in lowercase throughout a particular text, but which might want to be used in index groupings in uppercase form.|
|quotes||A pair of characters that should be paired in the text. bmutils.py counts the characters and warns if they are mismatched. There are some cases of poetry where mismatches might be considered appropriate, but they don't often arise. Typical pairs of characters are “”, «», and „“. Straight quotes ("") are not pairable, and are not recommended for use.|
|ideograph||A comma separated list of decimal codepoints or codepoint ranges of characters used in the book. Using this attribute is a cop-out when there is no source of linguistic input, the text is unreadable by the book creator, but is assumed to be correct, for lack of better input, and because there tend to be lots of separate characters, and creating the list for sort would be nearly impossible, due to lack of knowledge and/or input method. Sorting of ideographs is often done by codepoint anyway, and searching is more useful than sorted lists for ideographs as well. Every character used in the file should also be included in the lists for "sort", "whitechars", "otherchars", or "ideograph".|
|splitchars||A list of characters that are used to split words, even in the presence of synaloepha. Defaults to just a normal space character (U+0020), but often is defined to also include a no-break space (U+00A0). The space character should be included and should be last in the list. Because it contains whitespace characters, this list is in JSON format for a string (surrounded by quotes, and JSON escape sequences usable to encode characters if necessary).|
|moreitemattrs||Each cell may contain the name of an item attribute that is legal for this book, but not in the list of universally available item attributes. They should be listed in the order they should be emitted into the copy of the .gbook.tsv file in the report directory.|
|jsonitems||Each cell may contain an attribute to be included, if found, in the metadata at the bottom of a .json format book.|
|jsonnum||If not supplied, defaults to "_jsonnum", if supplied, specifies the name of an item attribute that should be used to obtain the number of the item for use in a .json format book. At the present time, a .json format item must have a number attribute that consists only of digits. If a book has other numbering, it must be translated before a .json format book can be produced. If the attribute named by jsonnum is not found in a particular item, the "number" attribute for that item will be used for the .json format number, and must consist only of digits. In general, the "number" attribute for an item isn't constrained to digits, but that is described in more detail in "¶ item".|
|skip||A list of items (one per cell) that name item metadata attributes that should be skipped. "number" and "title", are always skipped, as they are placed elsewhere, or used for other purposes. Item metadata attributes starting with "_" are always skipped also. The only ways to get their values into the item attributes are via the metatrans or metatranslate options.|
|interlinear||Default is "-", generally used when only one language is involved in the text of the book. Specifies a letter for the "native" translation for this ebook. "-" never appears in file names, the letters do. See also "wordsord" in this section, and "codes" in "¶ translate".|
|wordsord||A list of unique characters (one per cell) defining the order of languages, in addition to and subsequent to that specified by interlinear, that may displayed in the text of the book. This requires the use of multiple Book Master files, one for each language, corresponding by ordinal. One Book Master file should define both "interlinear" and "wordsord", and the others should only define "interlinear".|
|chorus||Specifies the word to be used to introduce the chorus section of a poem or hymn containing a chorus. Default: Chorus|
|chorusstyle||Specifies the CSS font-style used for the text of the chorus. Note that some character sets (Chinese is an example) should never be styled with italics. Default: italic|
|sanitize.always||a space separated list of groups of characters that should be always treated as the last character of the group if found for the purpose of searching. This is particularly useful for languages that don't use diacriticals, except when the language is being abused. An example in English is when a word ending in the suffix "-ed" wants to use, for poetic reasons, the suffix as a separate syllable, even though it normally would merge with the prior characters to be pronounced as a single syllable. This is done by placing an accent on the "e", as "-éd". Note that some hymn typesetters have done this for the word "blesséd", even though that word can be pronounced as one syllable or ttwo depending on the usage (see dictionary.com).|
|sanitize.maybe||Syntax is the same as sanitize.always, but this is useful for languages where the diacriticals are used with meaning as a natural part of the language, but in situations where it is hard to type them, or for folks learning the language, not finding what they were looking for, and are concerned they may have used the wrong diacriticals, they can be treated the same for the purpose of an expanded search.|
|font-family||The default font choices for displaying this language. Font selection can also be done by the user, but these choices are always used as fallback in case the user selects a font with fewer characters than needed.|
|syllabication||Default is trailing. The only effective other value is "leading", which means that the original Song Printer for Olivetti, Atari, and Windows syllabication rules are used in the source. This will be converted to trailing syntax automatically for use with the GBook format. If there is no syllabication in the item text, music mode cannot be used, and music should not be provided for items without syllabication.|
|hyphenation||Default is 0. If set to 1, it turns on the ability to count meter for songs and hymns (and poems, if syllabicated), and the option to display the hyphenation in words-only mode. Hyphenation is always displayed in music mode.|
|novno||For books containing items that do not have verses, and do not want paragraph numbers, specifying this entry turns off verse numbering throughout the book.|
|nocho||For books containing items that do not have choruses, and do not want chorus semantics applied to indented paragraphs, specifying this entry turns off chorus semantics throughout the book.|
|hdrsplit||No default. If defined, a text string that causes an item title attribute to be split into 2 parts. See discussion under Computed item attributes.|
|GBookJS.plugins||The first data value (2nd cell) contains a short form of the title to be used as the <title> attribute for the GBook. Subsequent data values each specify particular plugins that are to be activated for the GBook. See also weblinks, just below.|
|weblinks||A list of media collections for the ebook. For GBooks, these are additional media plugins, and can be accessed online, and even offline if the media format is selected when installing the offline version.. For other formats that support links to web locations when online (all but .txt), links to the individual items in the media collections are listed in the metadata for that item.|
|Each of these entries is additional configuration for the particular type of PDF format corresponding to the extension of that PDF file.|
The data value cells are alternating keywords and values. The legal keywords and default values are as follows (Units are given in parentheses after the value but are not part of the value):
|.txt||A list of the types of indexes to be included in the ebook of this format (list of types is below).|
|numbers||For each type of index listed (more can be created), the first data value is the organization of the index, and the remaining data values are the list of item attributes that are included in the index.|
Index organization values are:
There are no keywords for entries in this section. Instead, each section of the book is given a row with its sequence description in the first cell and its sequence designator in the second cell.
The description is used in the "lookup" tab of GBooks, and may possibly be explained in an Info section that appears in other formats of ebooks. The designator is used in GBooks as part of the navigation numbers.
Metatrans is a technique for transforming values of item attributes. Each rule is given a row in this table. The first cell is the name of an item attribute. The second cell is the value of a prefix of the item attribute. If an item atttribute is found that matches, then new item attribute is generated, having the name in from the third cell, and the value from the fourth cell in the rule.
Rules are limited in power, and cumbersome to use, but can be used effectively where common prefixes appear in the item attribute data, and the metadata for the ebooks wants to be simpler. A better technique might be to just transform the data in the ebook with a text editor, but there may be compatibility reasons not to do so.
Metatranslate is a technique for renaming item attributes. This can change what would be a visible attirbute into an invisible one, or vice versa. It can change the text of the attribute name, which may be useful for translating, where attributes are named in one language, but should be displayed in a different language, or where attributes are given a coded name for some reason, but an uncoded name is desired in the ebook display.
Each translation rule is given a row in the table, with the first cell being the actual attribute name, and the second cell being the replacement attribute name.
N.B. Caution: these directives probably should be used once, then removed. Otherwise it would be impossible to manually edit the results of doing the cloning.
Each row has two cells. The first cell is an attribute name. If it exists, its value is used to create a new attribute, whose name is the second cell.
The operational aspects of an ebook are best provided in a language the user knows. For users using an ebook in their native tongue, that native tongue is also the best language for the operational aspects. However, for a student of a language, an ebook may be more useful if it also offers an option for the operational aspects to be provided in the native tongue of the student. For these reasons, all the text strings for the operational aspects of the GBooks are given code words that are translated into the text of the selected operational language, which may not be the language of the ebook.
The need for some of the media translations depends on the available of media collections for a particular ebook. Most of the other codes should be translated. It is easiest to look at an existing collection of translations, and create another, but a few codes deserve specific description so the proper form of the Book Master file can be understood. Those without descriptions are "simply" translations.
|codes||This entry must exist and defines the language code for each available translation, each of which gets its own column in this table. It is best to use the same language codes as are defined in the "¶ book" section for "interlinear" and "wordsord", and these are the minimal columns required. However, additional languages may be provided for operational aspects only, and so additional language codes may be invented and used in additional rows of this table.|
|intfcfont||This item is not really a translation, but rather it is the selected CSS font-family to use for displaying operational aspects of the GBook program when this language is selected. It can be list, for fallback reasons, as different browsers have different font collections available.|
|untranslated||This is a note that needs to be in one language only, the one that is provided because no real translations have been able to be made. This will generally be English for books I make, until I can find a translator to help.|
|nlang||This is the name of the language for use in a selection setting for choosing the operational language. Generally it should be the name of the language as used in the language itself.|
This is a collection of lines of text, preferably including some description of what the book is, in English, and may contain similar text in appropriate other languages as well. It is included in many formats, but not presently in the GBook. Much of the value is including it in the Book Master file.
An item defines one numbered item for the book. It consists of two parts separated by a blank row. The first part contains item attributes, which are similar to book attributes, but apply only to a single item. Everything after the first blank row is the text of the item.
|title||Title of the item. If provided, is displayed at the top of the item with the display number.|
|number||Display number. It may or may not include a sequence or suffix, and should correspond to the number content in the printed book.|
|_sequence||If the number does not contain a sequence, but one is needed for uniqueness, it can be provided here.|
|_ebook_suffix||If the number does not contain a suffix, but one is needed for uniqueness, it can be provided here.|
|author||Author of the item.|
|_author||Hidden author of the item.|
|composer||Composer of the item.|
|meter||Meter of the item, if it is poetical and hyphenated. Note that bmutils.py will calculate the meter if the "hyphenation" attribute is set on the book. If the "meter" is provided, but differs from the calculated meter, is is set as "meter2".|
|pitch||Starting pitch of the item, if it has music. [Someday this might be derived from MIDI or notation data if not provided.]|
|topic||A categorization of the item.|
|tune||Tune name of the item.|
|about||Information about the item.|
|notes||Notes regarding the item.|
|scripture||Scripture references pertaining to the item.|
|when-written||When the item was written.|
|where-written||Where the itum was written.|
|_nometer||Indication that no meter should be calculated.|
|_json-skip||Indication that this item should be skipped in the .json format.|
|_jsonnum||The number that is to be used in the .json format when the display number contains duplicates or non-digits, neither of which are acceptable in that format. One can use different ranges of numbers for each section for the .json format, and, if desired, cross-reference them as metadata.|
|_jsonitems||This list of item metadata entries overrides the list defined by the "jsonitems" book attribute.|
|Copyright||An indication of the existence of a copyright. In most countries, including the U.S.A. since 1 March 1989, copyright notice is not required to get copyright protection (registration may be in some countries). A whole copyright notice can be included as the value if that is necessary or desired.|
|@||An indication that a © notice should be placed in the .json output (because it applies to words, and there is a notice in the book).|
|themusic||For items that are words-only in the music book, this specifies a __complete_num value of the music to which these lyrics should be sung. It is usually an item on the same pair of facing pages, but not always: some books have words-only sections at the back. There are even some books which reference tunes not found within the book. For this latter reason, any value starting with "→" is ignored (and could contain a human readable reference regarding where the tune might be found).|
Experience has shown that it is convenient to have some additional item attributes, that can be computed from other data values already available in a different form. Computed item attributes are given names starting with two "_" characters. The following item attributes are created as items are read.
|__ordinal||The ordinal number assigned to the item.|
|__sequence||Because the numbering can be complex, and the parts of an item's number can be split across the "number", "_sequence", and "_ebook_suffix" attributes, or only some of them, these four attributes are computed to give easy access to the 3 individual parts and the complete number.|
|__book_num||A concatenation of the "abbrevtitle", a colon, a space, and the "__complete_num".|
|meter||If "hyphenation" is set in the book attributes, and "_nometer" is not set in the item attributes, then the meter of the item is computed. If a "meter" attribute already exists, the computed meter is set as "meter2", but if "meter" doesn't already exist, the computed meter is set as "meter".|
N.B. This is an exception to the naming rule for computed item attributes.
|__first||Set to the de-syllabicated value of the first line of the first verse.|
|__chorus||Set to the de-ssyllabicated value of the first line of the chorus, if a chorus exists.|
|__title1st||These are not presently used, and may be eliminated.|
If "hdrsplit" is set in the book attributes, then the title is split on that character sequence, and the first part is used as the value of __title1st, and the second part is used as the value of __title2nd.
|__hdrlong||These are not presently used, and may be eliminated.|
__hdrlong is calculated as the value of the item's "number" attribute, and if there is a "title" attribute, then both ". " and the value of the "title" attribute are included. If "hdrsplit" is set in the book attributes, then "__hdrshort" is calculated as the first part of splitting the "__hdrlong" attribute just calculated, and "__hdrsplit" is calculated as a list of the parts resulting from the split. If "hdrsplit" is not defined, then "__hdrshort" is the "number" attribute, and "__hdrsplit" is a list containing the "number" and "title" attributes.
Item attributes consist of information that may be somewhat private, information that is structural regarding the ebook, and information that is specific to particular applications, as well as metadata providing useful information about the item that should be available to the user. Item metadata is extracted from the item attributes using the following rules:
The text of the item consists of rows of text with occasional blank rows. Rows of text between blank lines are considered a group. A group can be treated as a paragraph, or as verses, or as a chorus. Paragraph support does not yet exist.Verses are identified as a group whose first row has its text in the first cell. A chorus is identified as a group whose first row has its first cell blank. Each row has up to four blank cells prior to the cell with text, which is used to indicate indentation. Most song and hymn lyrics would indent all the chorus lines one cell, but poetry may need additional indentation, not all lines of the group being indented the same amount. The existence of a chorus is optional. If a chorus contains rows containing "¤" in the first cell, and text in later cells, this is considered extra text, and is used in musical ebooks for echo voices or other text that is part of the lyrics in some less common manner.
Optionally, the item text can be syllabicated. The existence of syllabication in the item text is indicated by using the book attribute "hyphenation". Song Printer for Windows defined rules for syllabication of text, those are documented elsewhere, but are supported in Book Master by setting the book attribute "syllabication" to the value "leading". Without that setting, the syllabication rules are described here:
There are 5 special characters, space, hyphen, ¤, ␣, and █. Each text corresponding to a musical note (syllable) contains other characters, and ends with one of the first 3 special characters. Space means end of word, hyphen means syllable break between syllables of a compound word where the hyphen is part of the spelling and must appear, and ¤ is used for other syllable breaks, and a hyphen may optionally be generated if the adjacent syllables must be spaced apart during formatting. A space is implied at the end of a line if no other syllable end character is found. The 4th special character, ␣, is used as a placeholder between syllables or at the end of line where there is a soprano note that does not have a corresponding text syllable, or shares a text syllable with a prior note (i.e. slurred notes). The 5th special character, █, is used to denote a desired horizontal gap in the music and lyrics, which may alternately be used as a line break for narrow formats. When it is at the beginning of a poetic line, it indicates that that line should be formatted with the previous line, but with a horizontal gap. Also related is the no-break space used for synaloepha, and discussed in the descriptions of the book attributes "whitechars" and "splitchars".These can be translated back to Song Printer for Windows notation if required.
This signals the end of a Book Master file. Any content after the "¶ end" row is ignored, unless its first cell contains "¶". Avoid that. The regenerated Book Master file in the reports directory will not include any content after the "¶ end" row.