Electronic publishing standards
One key aspect of our new era of digital publishing is the standards upon which the publication process rests, from creation through distribution to preservation. The network of organizations developing standards in our community is varied and overlapping. So too are the standards. This chapter covers some background on standards and standards development, highlights some key existing standards in electronic publishing, and discusses some issues needing future standardization and some of the initiatives underway to address these issues.
Publishing is not a closed ecosystem involving only the publisher and the author. Interoperability is the key to cost-effective digital publishing such that it produces content that is discoverable, accessible and preservable, whether you are looking at the supply chain, libraries and other intermediaries, or end users. Standards are what ensure production efficiency and interoperability in every industry, and the use of standards has a long history in publishing. Along with the rapidly increasing growth of electronic content has come a new wave of standards for e-publishing. The complexity of the digital publishing supply chain requires significant consensus on distribution structures, authentication systems, identifiers and metadata to ensure overall interoperability and discoverability. This chapter discusses some background on standards and standards development, highlights some key existing standards in electronic publishing, and discusses some issues needing future standardization and some of the initiatives underway to address these issues.
The advancements of the distribution of knowledge have often been traced to advances in the technology of distributing that content. From the creation of incunabula1 (early printed books) in the 15th century, and from the development of the steam-powered printing press (Meggs, 1998) in the 19th century to the rapid movement of distributing content digitally toward the end of the 20th century, technological changes have radically altered how content is produced and distributed. When a new technology is first introduced, a variety of implementation methods are tested and implemented, but over time a few methods evolve into best practices. In our modern environment, one or more methods are eventually adopted as formal standards.
‘Document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context.’2
While formal standards play an important role, equally important are the best practices and de facto standards that exist in our marketplace and which moderate much of what is accomplished in a variety of industries, particularly one as old and established as publishing.
We don’t often give much thought to the design and production elements of a book or journal, except possibly in noticing their absence. Things like page numbers, binding, title pages, indices, paper, font styles and basic page design structures are taken for granted, but all are examples of different types of standards that most publishers adhere to voluntarily. Although these practices are not frequently enshrined in formal documents, in a way that matches the ISO’s definition, they are an ever-present element of publishing production, library management or the reading experience.
Many standards that have played a critical role for centuries in printed book and journal production and content distribution need to be reconsidered and adapted to the digital content environment. One good example of this is page numbering, which began as a production system for keeping plates and the subsequently produced sheets of paper in the correct order for binding the book. Today in an era of digital content with reflowable text – where the end-user can adjust the typeface, font or other traditionally fixed elements – the concept of a page number is nearly meaningless. And yet readers still rely on page numbering because it has become a useful system beyond its original production purpose. In part, pages are important because reading is often a social activity; while the act of reading may not itself be social, readers frequently want to share what they have read and discuss it with others.
Page numbers are a way of referencing something within a book or journal within that social context. This is especially true in the scholarly world, where citations and referencing content are critical elements in the process of research and publication. So even though the electronic format of content does away with the structural need for physical page numbers, a need still exists for some citation point that can be used for referencing.
This is an example of why standards development is so critical in the ongoing transition to digital content. In the coming years, standards development organizations (SDOs) will need to work through many similar issues to ensure new technologies are functional, appropriately usable and fully capable of supplementing or replacing their print counterparts. Much work has already been done to develop needed standards for electronic content, as this chapter will illustrate; but much more remains to be done.
There are a variety of ways in which people and organizations develop consensus around standards or best practices. These can range from completely independent, driven by the ‘free hand of the market’ to the extremely formal, driven by organizations focused specifically on the creation of standards. Both the method of development and the formality of the process used to develop the resulting standard determine its type. The two main recognized types of standards are de facto and de jure.
A de facto standard is one that has become accepted in practice but has not undergone any formal process to obtain consensus and may not even have publicly available documentation. Typically, de facto standards result from marketplace domination or practice. In the publishing world, page numbering, as discussed above, is an example of a de facto standard. A de jure standard is one that is developed through a formal process, usually managed by an official SDO. These standards typically require that the following principles be met (adapted from American National Standards Institute’s Essential Requirements):3
Compliance with both de facto and de jure standards is voluntary. However, in some cases, de jure standards can be cited by legal code or regulation, which could make it mandatory to follow in the affected legal jurisdiction.
In some cases, a de facto standard is taken through a formal standards process and thereby becomes a de jure standard. One recent example is Adobe’s Portable Document Format (PDF), which after many years as a de facto standard in the marketplace became an international standard (ISO 32000-14) in 2008.
The community of organizations that create standards is a complex one, operating at industry, national and international levels. Most countries have a national standards-setting body, sometimes independent or non-governmental and sometimes formed and/or authorized by the government. Examples of such national bodies are the American National Standards Institute5 (ANSI) in the US, the British Standards Institution6 (BSI) in the UK, the Deutsches Institut fur Normung7 (DIN) in Germany, the Association Frangaise de Normalisation8 (AFNOR) in France and the Standardization Administration of China9 (SAC).
The primary international standards body for publishing-related standards is the ISO.10 ISO’s voting members are the national standards bodies ‘most representative of standardization in [their] country’.11 ISO divides their standards work into Technical Committee (TCs), each of which has responsibility for a particular subject area and are identified by an ascending number scheme. A number of the standards related to electronic content, such as the International Standard Book Number (ISBN),12 the International Standard Serial Number (ISSN)13 and the Digital Object Identifier (DOI),14 are developed by TC 46 – Information and documentation.15 A joint committee – JTC1, Information Technology16 – of ISO and the International Electrotechnical Commission (IEC) is responsible for many of the standards related to format specifications and computer interactions, such as the JPEG 2000 image coding system.17 National participation in this process is organized differently in each participating country, but most countries use some form of ‘mirroring committee’ of national experts to provide input into the relevant ISO work.
Currently within the US, there are over 200 ANSI-accredited SDOs18 developing American National Standards in everything from manufacturing and safety to libraries and publishing. For electronic content, some of the key developers are: the Association for Information and Image Management19 (AIIM), the National Information Standards Organization20 (NISO), ARMA International21 and the InterNational Committee for Information Technology Standards22 (INCITS).
A number of other SDOs exist outside of the formal international and national bodies discussed above. Two international standards organizations, created specifically for internet- and web-related standards, are the Internet Society23 [responsible for the Internet Engineering Task Force24 (IETF)] and the World Wide Web Consortium25 (W3C).
Many trade or professional organizations have one or more committees developing standards and guidelines in areas of interest to their members. Within the e-publishing world, examples of such organizations include: the American Library Association26 (ALA), the Book Industry Study Group27 (BISG), EDItEUR,28 the International Digital Enterprise Alliance29 (IDEAlliance), the International Digital Publishing Forum30 (IDPF), the International Federation of Library Associations and Institutions31 (IFLA), the National Federation of Advanced Information Services32 (NFAIS) and United Kingdom Serials Group (UKSG).33
Some organizations form consortiums for the purpose of collaborating on the development of standards, such as the DAISY Consortium,34 the Entertainment Identifier Registry (EIDR),35 the Open Researcher and Contributor ID (ORCID)36 and the Organization for the Advancement of Structured Information Standards37 (OASIS).
Government agencies may also develop standards as part of their mission. In the US the National Institute of Standards and Technology38 (NIST) develops standards in a wide variety of areas to further US innovation and industrial competitiveness. The Library of Congress39 has developed numerous standards related to libraries, metadata and preservation.
What should be apparent from this SDO discussion is that there is no lack of organizations developing standards. Because the lines are not always clear where one organization’s mission ends and another begins, there is often overlap and conflicts in the resulting standards. Most SDOs, though, are good at working together on areas of common interest and will try to stay abreast of each others’ activities and collaborate whenever possible. However, it is not unusual to encounter situations where two organizations develop competing or incongruent standards due to either a lack of awareness of the other’s work, ‘forum shopping’ (when developers try to find a receptive community), or to suit the specific needs of a particular community that could be somewhat different from those of another community.
A discussion of all relevant standards for electronic content could fill an entire book. As this discussion is limited, a representative sample of critical standards will be discussed in the areas of: (1) structure and markup, (2) identification, (3) description, (4) distribution and delivery, (5) authentication for discovery, (6) reference and linking and (7) preservation. A few standards discussed in this section have not yet been published, but they are included if the initiative is well on its way and/or a draft is available.
One basis for understanding the need for structuring and marking up content is the simple fact that machines do not have the same level of understanding of nuances and inferences that human beings do. A person glancing at the title page of a book can easily tell which text is the title, which is the author and which is the publisher. A computer must have explicit markings to understand which text is which.
There are several benefits to producing content in structured content formats. The first of these is that it is easy to change the presentation of structured documents through the use of style sheets. In this way, the same content can be transformed from one page layout to another, for example, or from one screen size or rendering to another. This allows a publisher to produce a single source file for multiple final formats and reduces overall production costs. This capability is increasingly important as new rendering technologies and new mobile devices are constantly coming onto the market. Platform-agnostic file structures also aid in long-term preservation as they can be more easily migrated to new formats to prevent future obsolescence. Finally, structured content provides opportunities for both enriching or re-using content for a different publication or purpose, for example a teacher and student version of a textbook, a book compilation of journal articles, a web ‘mash-up’ of content or a multimedia version of a textual book. Increasingly, publishers are also exploring ways to semantically enrich content by linking words, phrases or references to other additional materials outside of the text. It is far easier to add links to references and terms or to add other tags and annotations using structured content than it is through more fixed formats, such as a PDF.
Markup languages, such as TeX,40 LaTeX41 (pronounced ‘tech’ and ‘la-tech’ as in ‘technical’, respectively) and Standard Generalized Markup Language (SGML),42 were originally developed for production and typesetting of content for print production. SGML, published as the international standard ISO 8879 in 1986, defined how to create Document Type Definitions (DTDs), which are specific markup language applications. Most markup languages have developed from this initial SGML standard, including TEI,43 DocBook44 and the Hypertext Markup Language45 (HTML) for presenting webpages.
A need developed for a slightly lighter-weight and more flexible structure than SGML that could also be used to transport and store data. The resulting language, developed by the W3C was introduced in 1998 as the Extensible Markup Language46 or XML. At its heart, XML is a set of rules for how content should be structured and marked up using tags and a customized set of elements and attributes. The specific elements and attributes used and their relationship to one another are typically declared in a schema. By referencing this schema, a computer program can understand how to interpret the markup to act on it intelligently.
In 1996, the US National Library of Medicine (NLM) launched an online reference system called PubMed, which pushed onto the world wide web the MEDLINE database of references and abstracts in the life sciences and biomedical sciences. In the development of this system, the National Center for Biotechnology Information (NCBI) of the NLM created a Journal Article and Interchange Tag Suite with the intent of providing a common format in which publishers and archives can exchange journal content. Since its enhancement and expansion in the early 2000s, the suite has become a de facto exchange format for journal content. In 2009, the NLM put forward the JATS for national standardization within NISO and it is currently being reviewed publicly and is due for publication in 2012.
Another important benefit of structured content is the ability to improve accessibility of content to people with disabilities, particular for the visually impaired. One important standard in this space is the Digital Accessible Information System, or DAISY, standard.47 This standard provides a means of creating books that include text-to-speech functionality for people who wish to hear – and navigate – written material presented in an audible format. The standard has recently been revised48 to make it more modular and to expand its scope beyond the ‘talking book’ to cover a broad range of digital content types in a variety of formats. The revision specifies a framework that will ‘provide the increased flexibility that producers were requesting to allow markup to be tailored to their unique needs’ (Garrish and Gylling, 2011).
We use identifiers to distinguish things from one another (disambiguation) and to serve as a shorthand way of referring to the item. Ideally, every content item would have at least one unique identifier that applies to it and only to it. Because identifiers are used for different purposes, many items may have multiple identifiers. A book, for example, could be assigned an ISBN by the publisher and also receive a call number by a library that adds the book to its collection.
The ISBN was designed as a product identifier for use in the supply chain. Separate ISBNs are assigned to a hardback, trade paperback and mass-market paperback version of the same book so that every trading partner would know exactly which of these products was involved in the transaction. The latest version of the ISBN standard (ISO 2108:2005) extends this concept to e-books, stating: ‘Each different format of an electronic publication (e.g. “.lit”, “.pdf”, “.html”, “.pdb”) that is published and made separately available shall be given a separate ISBN.’49
Due to confusion in the industry, the International ISBN Agency issued Guidelines for the Assignment of ISBNs to E-Books50 in 2009. Controversy over these guidelines remains due to many unique situations when dealing with electronic content. The number of ways that digital versions of the same content can differ from one file to another is exponentially higher than it is for the print world. This could result in an extremely large number of ISBNs being assigned to all the different variations of the same digital text.
For example, a digital file produced by a publisher could be wrapped in different digital rights management (DRM) by multiple suppliers further down the chain. Because the experience provided (or limited) by DRM impacts the user experience, it might be valuable to identify those digital objects separately. However, does each supplier need to obtain a new ISBN? A PDF file might have specific page layout formatting, while a reflowable version of the text for a mobile device might lack this presentation information. Some versions might have active links and searchable text, while others might simply be static page images.
An even more complex situation might be when the underlying file that is distributed is exactly the same, but a reader’s ability to access certain features or functions is associated with the rights or access keys that the reader purchases. It is an open debate as to whether it makes sense to assign unique ISBNs to each of these manifestations of the underlying work.
The International Standard Text Code (ISTC) (ISO 2104751) is a relatively new identifier that was designed to uniquely identify the content of a textual work, regardless of its format or product packaging. The intended users of the ISTC are ‘publishers, bibliographic services, retailers, libraries, and rights management agencies to collocate different manifestations of the same title within a work-level record’ (Weissberg, 2009). There are a variety of use cases of work-level identification provided by the ISTC. From a library perspective, it would be valuable to know that you have the same text of a particular work in a variety of different forms.
Another use for the ISTC is in the compiling of sales data for best-seller lists across all the different formats of a work. When there were generally two manifestations (a paperback and a hardcover version) of a book, compiling sales data was comparatively simple. In a digital environment, when there could be dozens of manifestations, the value of having an ISTC as a work-level identifier could be quite significant.
The International ISTC Agency52 reported in April 2011 that some 7000 ISTCs had been assigned and at least 70 000 more were pending. At this writing it is still too early to predict the uptake on using the ISTC and the different ways it might be applied.
The e-resource isn’t the only thing that could be identified in the publication process. There are a variety of people associated with content, including authors, editors, compilers, translators, performers, songwriters, producers and directors. While libraries have created and maintained name authority files for their collections for decades, other organizations, such as society publishers, abstracting and indexing services and rights organizations to name just a few, have managed their own repositories of name-related information. As a result, a given individual or entity has no unique naming convention or identifier that crosses over organizations and systems and might link together the various information about that person or entity.
Among the many serious challenges of maintaining information related to people is the rapidity with which information changes. People are constantly changing roles and positions, creating ever more content in a variety of forms and – with less frequency – changing names, retiring or dying.
In addition to the logistic challenges of continually updating this information, there are legal and policy issues surrounding privacy of personal data. Many countries have passed laws53 regarding the handling of personally identifiable information and more legislation is pending.54 As the number and scope of online name identifier registries increase, it is likely that the concerns regarding privacy and therefore the potential regulation of this information will increase in the future.
A soon to be published international standard for the International Standard Name Identifier (ISNI)55 (ISO 27729) specifies an identifier for the public identity and disambiguation of parties associated with digital content creation, distribution and management. The party being identified could be a real person, a pseudonym, a fictional character, a group, an abstract concept or even a corporate entity. The administration system described in the standard provides for a minimum set of core metadata about the party to be provided at the time of registration and stored in a central repository. These metadata are used to determine if a party already has an ISNI when one is requested or if a new identifier is needed.
The VIAF (The Virtual International Authority File) project56 – a name authority system developed as a partnership between the Library of Congress (US), the Deutsche Nationalbibliothek and the Bibliotheque Nationale de France, in cooperation with a number of other national libraries and bibliographic agencies – has provided its database of over 14 million authority files to the ISNI International Agency,57 who will be managing registration. The ISNI Agency is matching the data files provided by its founding members to the VIAF files to make initial ISNI assignments where there are more than three VIAF sources or two independent sources confirming the data (Gatenby and MacEwan, 2011).
The Open Researcher Contributor ID (ORCID)58 initiative was the outcome of a meeting co-organized by the Nature Publishing Group and Thomson-Reuters in November 2009. The ORCID consortium was chartered with a mission to create ‘a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current author ID schemes.’59 Potential uses of the system are envisioned as: linking together a researcher and his or her publication output, aiding institutions and professional societies in compiling their researchers’ publication output, improving efficiency of project funding and aiding researchers to find potential collaborators.
Although the system is not yet publicly available, it has been described in its principles as a mix of computed information about researchers and user-provided information. Researchers will be able to create and maintain their own profile and set their own privacy setting; data will be released under the Creative Commons CC0 waiver. Public launch of the service is projected for autumn 2012.
There is potentially synergy between the ISNI and ORCID systems. Although no agreement yet exists, there is a possibility that the ORCID system could use the ISNI identifier scheme for the assignment of ORCID IDs, which would mean both systems would be using the same number format and could interoperate. There could also be greater collaboration where, for example, the ORCID project could participate directly in the ISNI system as one of the Registration Agencies representing the researcher community.
The key difference between these two seemingly overlapping standards is that the ORCID is a sector-specific project for the researcher community, while ISNI is a generalized bridge identification structure bringing people-related data together from many diverse communities for disambiguation and linking. The metadata that would be relevant to describe the work of a researcher, such as their institutions, funding sources, research focus, papers and research affiliations, are quite specific and different from the data that would be necessary to describe the work of a guitarist in a band.
Works, books and authors are not the only entities that need identification in a well-functioning content supply chain. It is also extremely useful to identify the organizational entities trading information. This is critical not only for the sales process from publisher to library, but also among libraries for resource sharing. There are a number of existing organization identification systems, although they are limited in scope and none takes into account the linkages and relationships between organizations in the scholarly community. To address this gap, NISO launched the Institutional Identifier (I2) initiative to build on work from the Journal Supply Chain Efficiency Improvement Pilot,60 which demonstrated the improved efficiencies of using an institutional identifier in the journal supply chain.
The I2 working group has developed a metadata schema to map unambiguous identification of institutions and their related entitles. For example, the Welch Medical Library at Johns Hopkins University (JHU) Medical School is related to the Eisenhower Library at the Homewood campus of JHU, but in many instances, the two entities operate independently. There are also departments on campus that have similar ties and relationships with other entities within JHU and outside of it. In addition, the library and institution are members of a variety of consortia and regional entities. Mapping these relationships is critical for effective management of subscriptions, access control systems and other institutional cooperation.
I2 is currently exploring collaboration with the ISNI International Agency to extend the use of the ISNI infrastructure and business model to institutions and to harmonize their metadata profiles. A likely requirement will be one or more ISNI Registration Agencies that specifically assigns and registers institution ID with the ISNI system (DeRidder and Agnew, 2011). It is anticipated that this system will be operational some time in 2012.
Metadata, often called data about data or information, are used to describe, explain, locate or otherwise make it easier to retrieve, use or manage an information resource.61 The metadata we choose to describe a thing are often for a specific purpose. When one is buying a journal, for example, the title, price, frequency of publication, and the address of the publisher where the payment should be sent are each relevant and critical data elements for that transaction. This can be contrasted with the back-end production metadata required for long-term electronic archiving of the journal, such as file format, creation application, encoding system, embedded fonts and color space.62
These collective distinctions about metadata, describing a thing and the classes of things, are based on a concept called functional granularity, which refers to the need to provide as much information as is needed to conduct a specific task. To conduct a purchase transaction, one must know the price. However, in the context of preservation, price is an irrelevant data element. Alternatively, knowing the title and author of a book might be sufficient in a rights transaction, but the size, shape and weight of a particular copy would be important to a shipping fee calculation. Curating metadata is a time-consuming and expensive investment, which often requires a business rationale to justify this investment. If there is no need to create or maintain metadata for a particular purpose, then it doesn’t make sense to invest the significant costs required to maintain it. The challenge, however, is that one may not know all of the future needs related to the content item. For this reason there is no simple rule for what is the appropriate metadata quality or detail level.
One of the more widely adopted standards for describing published works in the library community is the MAchine-Readable Cataloging (MARC) standard.63 The current version, MARC 21, is a harmonization of the US and Canadian MARC standards and is jointly maintained by the Library of Congress and the Library and Archives Canada. This system for conveying the bibliographic information about a work consists of three core elements: the record structure, the content designation and the data content of the record. The record structure is described in two standards, Information Interchange Format64 (ANSI/ NISO Z39.2) and Format for Information Exchange65 (ISO 2709).
The content designation is ‘the codes and conventions established to identify explicitly and characterize … data elements within a record’66 and supports their manipulation. The content of data elements in MARC records is defined by standards outside of the MARC formats, such as AACR2,67 Library of Congress Subject Headings68 and Medical Subject Headings69 (MeSH). MARC records form the basis for most collaborative cataloging initiatives and most library systems are built around the basis of MARC record data structures. The MARC record format is maintained by the Library of Congress, with input from the American Library Association’s ALCTS/LITA/RUSA Machine-Readable Bibliographic Information Committee (MARBI).
There has been considerable movement toward adapting the MARC record structure as systems have improved and changed over the decades since it was introduced. This has included MARCXML,70 a framework for working with MARC data in an XML environment, and MARCXchange,71 a standard (ISO 25577) for representing a complete MARC record or a set of MARC records in XML.
The modern cataloging rules, the IFLA Anglo-American Cataloging Rules, version 2 (AACR2), that help to define what a cataloging record should include (i.e. the information contained within a MARC record) are also being changed to adapt to our new web-based, linked data environment. The AACR2 cataloging rules were revised into a linking data type of format called Resource Description and Access72 (RDA) in 2010. These changes will significantly impact the systems that include or exchange bibliographic information. As use of RDA requires significant changes in cataloging practices and supporting systems, a full transition to the new format will take some time.
ONIX is an acronym for ONline Information Exchange and is used to designate a family of standards ‘to support computer-to-computer communication between parties involved in creating, distributing, licensing or otherwise making available intellectual property in published form, whether physical or digital.’73 ONIX for Books74 is an XML structure for the communication of book product metadata in supply chain transactions. Originally developed by the Association of American Publishers and a variety of trade entities in the late 1990s, ONIX for Books is now published and maintained by EDItEUR in association with the Book Industry Study Group (BISG) and Book Industry Communications (BIC).
A typical ONIX record would include basic bibliographic metadata about the title, including author, subject classification, description and ISBN, as well as sales information such as price, availability, territorial rights and reviews. The ONIX for Books standard has been widely adopted by major publishers, trade organizations, major retailers and distributors as the primary way in which product information is distributed in retail trade. There have been a few initiatives to create mappings between the ONIX for Books system and the MARC record system to help improve interoperability75 of metadata between the publishing and library communities.
Once content is created it needs to find its way out of the publisher’s production department and through a supply chain of distributors, possibly libraries, and finally into the hands of a reader. In a print environment, this process included printing the content, as well as warehousing and then shipping physical copies of the book to a retailer or end-user. In a digital environment, content is similarly produced, but packaging and distribution takes place in quite different ways and different and additional distributors are involved in the process.
The format of these digital files has been shifting quickly since electronic content distribution first became available. In the early days of electronic distribution, content was distributed as simple text files. These eventually were transformed into more complex structured documents, to digitally recreate the ‘page’ format and Adobe’s PDF became a de facto standard for delivering a facsimile of the page as it had appeared in print. PDF eventually became an international standard as noted above.4 Many publishers are now providing content directly via the World Wide Web in HTML format via their website and in various e-reader formats through third-party providers.
As content distribution has grown in complexity and the variety of devices on which content needs to be rendered has increased, there has developed a need for a common distribution format. The EPUB specification is the leading contender to resolve this challenge.
EPUB® is a set of standards for creating and distributing reflowable digital content. Developed by the International Digital Publishers Forum (IDPF), the latest version, EPUB 3,76 has a function called ‘media queries’ that allows the use of style sheets to control the content layout while still allowing for reflowable text, and to ‘produce, for example, a two-page spread on a tablet held in landscape mode, a one-page two-column layout when that tablet is turned to portrait mode, and a single column format on a mobile phone, all from the same XHTML5 file.’ (Kasdorf, 2011). XHTML5 is the latest version of the HTML specification from the W3C in XML format (as opposed to straight text). This markup form will closely match what is renderable by a modern web browser.
EPUB consists of three main elements: the content itself, some descriptive information about the content and its components, and a packaging structure to combine all of the content and metadata elements. The need for a standard in this space is significant because of the complexity of interoperation needed not only between the publishers and the supply chain, but also with the reading-device manufacturers. In addition, as publications become more interactive – including things like audio, video and other multi-media features – the need to rely on a common structure for distributing electronic content is increasingly critical.
Publishers who license or sell subscriptions to their electronic content are interested in managing access to that content. There are a variety of systems for limiting access based on some form of end-user credentials, authentication of IP addresses or through use of proxy servers. All of these methods have drawbacks particularly with the increasing desire of users to access resources remotely and from mobile devices. As organizations, such as libraries, that license the content add more and more content from multiple providers, the maintenance of authentication credentials becomes more onerous. End-users may be faced with the situation of having to re-authenticate every time they change from one content collection to another. Balancing ease of use and security is the biggest challenge in selecting an authentication method.
A newer type of authentication system is gaining traction as a single sign-on solution, based on Security Assertion Markup Language (SAML),77 a standard for exchanging authentication and authorization data between an identity provider and a service provider. The most widely known implementation of SAML in the publishing community is Shibboleth.78 With this system, the service provider’s system requests a user’s credentials from the organization’s identity management system, which ‘asserts’ to the provider’s system the relevant access rights. A user has only to log on to the home system once per session and all assertions passed between the home system and different providers’ systems occur in the background without the user’s involvement.
For many decades, the only content available online were abstract and indexing databases and full text still had to be delivered on paper, usually at an additional cost to the end-user, and always with a delay. As more content, especially journals, became available online, the issue became how to redirect the end-user from their search result to the full-text content, especially when there were many databases and collections involved from different content providers. Reference and linking standards were the solution, allowing interoperability between disparate systems.
The Digital Object Identifier (DOI®)79 is a system for providing persistent and actionable identification and resolution to resource objects in a digital environment. A continuing criticism of the Web is the problem of broken links. The DOI addresses this by assigning an identifier that is separate from the location of the item. A registry system80 of DOIs contains metadata and a resolution link to the actual content. If the content’s location changes, the resolution link can be easily updated in one place and all existing links to it will still work.
The strength of the DOI system relies not on disambiguity, as in identifiers like the ISBN, but on the use of the DOI name as a Uniform Resource Identifier (URI) and the application of the Handle system,81 a global name service enabling secure name resolution over the Internet. The DOI offers the capability of embedding another identifier, such as a journal’s ISSN or an e-book’s ISBN, to allow even more interoperability. The syntax for a DOI is described in the ANSI/NISO Z39.84 standard,82 but the entire DOI system has been approved as an international standard (ISO 2632414) that was published in early 2012.
The DOI system is in wide application and it forms the basis for the CrossRef system, initially designed for scholarly publishing reference linking. As of August 2010, CrossRef managed a database of more than 43 million items associated with a DOI. In 2010, the DataCite group formed to provide data linking and metadata for scholarly research data using the DOI system.
Due to the very nature of the Internet, there are often multiple copies of the same content available on the network. These may exist in mirrored sites, in institutional repositories, or in licensed content databases managed by the publisher and/or by third-party aggregation services. For libraries that provide access to many of these content resources to their patrons, how do you point the end-user from a search result to the most appropriate copy of a resource, for example one that the library has already paid for? The solution was OpenURL (Van de Sompel and Beit-Arie, 2001), which provides a context-sensitive link through the use of a resolver system called a knowledge base. The knowledge base matches the institution’s specific resource availability, e.g. licensed journal databases, to the requested item and presents the user with one or more available options for linking directly to the resource, or even ordering it if no electronic version is available.
In 2004, a NISO working group took the de facto OpenURL specification and created a de jure standard (ANSI/NISO Z39.88-200483) that expanded the capability of the OpenURL to be used in applications beyond the original ‘available copy’ solution. One example of such a new application is COinS (Hellman, 2009) (ContextObjects in Spans) that allows OpenURLs to be used in HTML. The OpenURL system has seen wide adoption in library systems and content distributors. The successful implementation of OpenURL is dependent on the accuracy of the information found in knowledge bases, which was becoming a growing problem. In 2008, NISO and the UKSG jointly launched the Knowledge Bases and Related Tools (KBART) project84 to raise awareness of the importance of data quality in OpenURL knowledge bases and to improve the overall efficiency of the resolution system. The group issued a recommended practice (NISO RP-9-201085) with data formatting and exchange guidelines for publishers, aggregators, agents, technology vendors and librarians to adhere to when exchanging information about their respective content holdings. As of mid summer 2011, there were 24 organizations (representing 55 publishers) that had endorsed86 the KBART recommendations and were adhering to the best practices to improve their resolution services.
Preservation is an activity that ‘future proofs’ content. While preserving physical items entails a lot of environmental attributes, such as acid levels, humidity and temperature, the range of possibilities with digital preservation fall broadly into four levels of very different preservation activities.
The first level of preservation of digital content relies on the physical preservation of the bits that comprise the document. This could be done using a variety of media, from hard disks, to flash drives, to computer hard drives or even network drives. You then need a technological layer of preservation, which means that you need to preserve not only the physical media, but also some method of reading or extracting the content from the storage device. The next layer of preservation is a formatting layer, where software must be available that can render the extracted files. Finally, there is a semantic layer of preservation, which ensures the rendered content has not been altered through the transformation process.
The process of preservation relies heavily on standards adherence to allow for content preservation. The creation of metadata describing the technical and format structures is crucial to understanding the form, structure and rights related to a content object. The PREMIS Data Dictionary for Preservation Metadata87 defines preservation metadata as ‘the information a repository uses to support the digital preservation process’.
The common file structures for creating and distributing content, described previously, will also help to ensure preservation of content and simplify the software needs to re-create content in its original form. There are models for how content should be replicated and stored and many projects, such as LOCKSS88 and CLOCKSS,89 build on network replication to ensure long-term availability of content. Finally, there are social and organizational structures related to long-term preservation, such as the creation of organizations like Portico,90 whose mission is to actively pursue the preservation of content in partnership with publishers and libraries.
Most journals today are distributed in electronic form and take advantage of the possibilities provided by digital distribution by adding citation linking (see DOI and OpenURL above), animated images, video, audio, extended tables, data sets or other related content, which does not fit within the construct of a print journal. Authors began to see the value in these new forms and began submitting this supplemental material along with their articles. Some publishers welcomed these submittals to enhance their electronic journal offerings91 while others were at a loss as to what to do with this content.
The rapid increase in this non-traditional content is straining even the electronic distribution publication model because of the time, effort and skills necessary to vet these additional materials, the specialized software and practices needed to distribute and store them, and questions about the responsibility for long-term preservation. Some publishers have declared they would no longer accept supplemental materials (e.g. Maunsell, 2010) with journal articles or are limiting what can be submitted (e.g. Borowski, 2011). Still other journals are mandating the availability of supplemental materials, such as data sets, be open and publically available. Several grant-funding agencies are also requiring that data generated as a result of funded research be made accessible. Whether these materials will find their ways into the publication process remains an open question, but certainly linking at a minimum and more likely interoperation at some level will be required.
To help address these issues, a joint NISO and NFAIS working group92 is exploring the business practices and technological infrastructure needs for Supplemental Journal Article Materials. The question of what constitutes supplemental materials is one factor that is critical to understanding any standard practices and was the first step in the group’s project. This working group is expected to issue their recommended practices for publishers to handle these supplemental materials by the end of 2011.
Traditional journal articles with supplemental information are only one part of a much larger challenge related to the increasing reliance on data in the scientific process. New forms of publications are developing, such as ‘data papers’, which are simply available data sets with some associated metadata. The entire realm of data management, data citation and data reuse is fraught with complicated questions, which will require significant study and standards development. For example, traditional provenance questions need to be addressed, such as: Who created this data set? How can we be assured that the data set hasn’t been altered or manipulated since its release or publication? If the data are a sub-set of a larger data collection, how was that subset created, and can it be recreated? If a data set is constantly being updated with new data, how can we return to the state of the database at the time the initial query was run, as new data may influence the results of an analytical tool?
There are a variety of communities looking at these questions. A W3C Provenance Incubator Group published a state-of-the art paper with a roadmap for possible standardization efforts.93 Another group organized by the CODATA organization in partnership with the International Council on Scientific & Technical Information (ICSTI) is exploring the issue of data citation practices.94 Other groups have conducted surveys of researchers (Tenopir et al., 2011) and of institutional repository practice (Soehner et al., 2010). The Data Cite95 organization is working with some of the largest repository providers to develop standards for the application of DOIs to data sets and developing institutional best practices regarding data preservation and sharing. Finally, the semantic web community is working to provide linking opportunities96 to connect disparate data sets and expand the ability to reuse and mash up heterogeneous data sets.
As the creation and distribution of content moves increasingly toward digital forms, the publishing community must address the many challenges confronting an industry being turned on its head because of new technology and user expectations. Systems for describing, communicating and preserving content that have served us well for decades, if not centuries, need to be re-evaluated and often revamped. The standards process provides an opportunity to conduct this evaluation in a thoughtful way that engages all of the relevant stakeholders in the process. Although there are many organizations engaged in developing standards in the electronic publishing space, stakeholders must work together to achieve an interoperable environment where content is created efficiently, is discoverable and retrievable in a digital environment, and will also be accessible not only for all users today but into the future as well. A lot of progress has been made over the past few decades, but even more work remains, especially given the near certainty that new technologies will continue to develop, expanding the need for new standards. As an industry, we certainly have our work cut out for us.
Borowski, C. Enough is enough [editorial]. The Journal of Experimental Medicine. 208(7): 1337 http://jem. rupress. org/content/208/7/1337. full. pdf, 2011.
Garrish, M., Gylling, M. The evolution of accessible publishing: revising the Z39. 86 DAISY standard. Information Standards Quarterly. 23(2): 35-9. doi: 10. 3789/isqv23n2. 2011. 08 http://www. niso. org/publications/isq/2011/v23no2/garrish, 2011.
Hellman, E. OpenURL COinS: A Convention to Embed Bibliographic Metadata in HTML, stable version 1. 0. 16 June 16. http://ocoins. info/, 2009.
Kasdorf, B. EPUB 3 (Not your father’s EPUB). Information Standards Quarterly. 23(2): 4–11. doi: 10. 3789/isqv23n2. 2011. 02 http://www. niso. org/publications/isq/2011/v23no2/kasdorf, 2011.
Luther, J. Streamlining Book Metadata Workflow. NISO and OCLC, 30 June. http://www. niso. org/publications/white_papers/StreamlineBookMetadataWorkflowWhitePaper. pdf, 2009.
Maunsell, J. Announcement regarding supplemental material. The Journal of Neuroscience. 30(32): 10599–600 http://www. jneurosci. org/content/30/32/10599. full. pdf, 2010.
Soehner, C., Catherine, S., Jennifer, W. e-Science and data support services: a survey of ARL members. Presented at: International Association of Scientific and Technological University Libraries, 31st Annual Conference, 23 June. http://docs. lib. purdue. edu/iatul2010/conf/day3/1, 2010.
Tenopir, C., Allard, S., Douglass, K., et al. Data sharing by scientists: practices and perceptions. PLoS ONE 6(6): e21101, doi:10. 1371/journal. pone. 0021101 http://www. plosone. org/article/info%3Adoi%2F10. 1371%2Fjournal. pone. 0021101, 2011.
Van de Sompel, H., Beit-Arie, O. Open linking in the scholarly information environment using the OpenURL Framework. D-Lib Magazine. 7(3). http://www. dlib. org/dlib/march01/vandesompel/03vandesompel. html, 2001.
1Incunabula – Dawn of Western Printing [website]. Based on Orita, Hiroharu. Inkyunabura no Sekai (The World of Incunabula). Compiled by the Library Research Institute of the National Diet Library. Tokyo: Japan Library Association, July 2000. http://www.ndl.go.jp/incunabula/e/
2ISO/IEC Guide 2:2004, Standardization and related activities – General vocabulary. Geneva: International Organization for Standardization, 2004.
3ANSI Essential Requirements: Due process requirements for American National Standards. New York: American National Standards Institute (ANSI), January 2010 edition. http://publicaa.ansi.org/sites/apdl/Documents/Standards%20Activities/American%20National%20Standards/Procedures,%20Guides,%20and%20Forms/2010%20ANSI%20Essential%20Requirements%20and%20Related/2010%20ANSI%20Essential%20Requirements.pdf
4ISO 32000-1:2008, Document management – Portable document format – Part 1: PDF 1.7. Geneva: International Organization for Standardization, 2008.
11Member bodies. International Organization for Standardization website. http://www.iso.org/iso/about/iso_members/member_bodies.htm
12ISO 2108:2005, Information and documentation – International Standard Book Number (ISBN). Geneva: International Organization for Standardization, 2005.
13ISO 3297:2007, Information and documentation – International Standard Serial Number (ISSN). Geneva: International Organization for Standardization, 2007.
14ISO 26324, Information and documentation – Digital object identifier system. Geneva: International Organization for Standardization, publication forthcoming (expected in 2011).
15TC46, Information and documentation. International Organization for Standardization website. http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=48750
16JTC1, Information Technology. International Organization for Standardization website. http://www.iso.org/iso/standards_development/technical_committees/list_of_iso_technical_committees/iso_technical_committee.htm?commid=45020
17ISO/IEC 15444-1:2004, Information technology – JPEG 2000 image coding system: Core coding system. Geneva: International Organization for Standardization, 2004.
18ANSI Accredited Standards Developers [Listing with Complete Scope Info]. New York: American National Standards Institute, 12 July 2011. http://publicaa.ansi.org/sites/apdl/Documents/Standards%20Activities/American%20National%20Standards/ANSI%20Accredited%20Standards%20 DevelopersJULY11ASD-2.pdf
26American Library Association (ALA) Standards and Guidelines [webpage]. http://www.ala.org/ala/professionalresources/guidelines/standards guidelines/index.cfm
40TeX Frequently Asked Questions on the Web, version 3.22. UK TeX Users’ Group (UK TUG), last modified 27 April 2011. http://www.tex.ac.uk/cgi-bin/texfaq2html?introduction=yes
42ISO 8879:1986, Information processing – Text and office systems – Standard Generalized Markup Language (SGML). Geneva: International Organization for Standardization, 1986.
49See clause 5.5 in ISO 2108:2005, Information and documentation – International Standard Book Number (ISBN). Geneva: International Organization for Standardization, 2005.
51ISO 21047:2009, Information and documentation – International Standard Text Code (ISTC). Geneva: International Organization for Standardization, 2009.
53See for example the European Union Data Protection Directive (Directive 95/46/EC). http://en.wikipedia.org/wiki/Directive_95/46/EC_on_the_protection_of_personal_data
54In the United States, the Personal Data Privacy and Security Act has been introduced in three of the last four congressional sessions but has yet to be approved.
55ISO 27729, Information and documentation – International Standard Name Identifier (ISNI). Geneva: International Organization for Standardization, publication forthcoming (expected in 2011).
61Understanding Metadata. Bethesda, MD: NISO Press, 2004. http://www.niso.org/publications/press/UnderstandingMetadata.pdf
63Library of Congress Network Development and MARC Standards Office and the Library and Archives Canada Standards and Support Office. MARC 21 [documentation website]. Washington, DC: Library of Congress, date varies depending on section. http://www.loc.gov/marc/
65ISO 2709:2008, Information and documentation – Format for information exchange. Geneva: International Organization for Standardization, 2008.
66American Library Association’s ALCTS/LITA/RUSA Machine-Readable Bibliographic Information Committee (MARBI). The MARC 21 Formats: Background and Principles. Washington, DC: Library of Congress, November 1996. http://www.loc.gov/marc/96principl.html
71ISO 25577:2008, Information and documentation – MarcXchange. Geneva: International Organization for Standardization, 2008.
76EPUB 3, Proposed Specification. International Digital Publishing Forum, 23 May 2011. [This draft is expected to be finalized and published in autumn 2011.] http://idpf.org/epub/30/spec/epub30-overview.html
85NISO RP-9-2010, KBART: Knowledge Bases and Related Tools. Baltimore, MD: National Information Standards Organization, January 2010. http://www.niso.org/publications/rp/RP-2010-09.pdf
87Introduction and Supporting Materials from PREMIS Data Dictionary for Preservation Metadata, version 2.1. PREMIS Editorial Committee, January 2011. http://www.loc.gov/standards/premis/v2/premis-report-2-1.pdf
91Elsevier’s ‘Article of the Future’ is now available for all Cell Press Journals. Elsevier Press Release, 7 January 2010. http://www.elsevier.com/wps/find/authored_newsitem.cws_home/companynews05_01403
93W3C Incubator Group. Provenance XG Final Report. World Wide Web Consortium, 8 December 2010. http://www.w3.org/2005/Incubator/prov/XGR-prov-20101214/