What is Scholarly HTML?
Abstract
Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.
Motivation
Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.
According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.
This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.
To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «schema.org» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.
Our high-level goals were:
- Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
- Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
- Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
- Entirely built on top of open, royalty-free standards.
- Long-term viability as a data format.
Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.
An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.
Where semantic modelling is concerned, our approach is to stick as much as possible to schema.org. Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.
In a small, circumscribed number of cases we have had to depart from
schema.org, using the https://ns.science.ai/
(prefixed with sa:
) vocabulary instead
(Science.AI, 2015a). Our goal is
to work with schema.org in order to extend their
vocabulary, and we will align our usage with the outcome of these discussions.
Definition
A Scholarly HTML document is a valid HTML document that follows some additional rules to specialise its meaning and make it predictable to processors wishing to produce or consume scholarly articles. These rules are outlined in the following sections.
Please note that in its current state this specification is often informal in the manner in which it describes its constraints. This is to facilitate review by people unfamiliar with formal specification writing. As the format solidifies, it will be made more formal progressively (but attempt to remain readable).
File & Supporting Structure
The document must be encoded in UTF-8, and transmitted with a media type of
text/html
. It must feature a DOCTYPE
as its preamble.
The html
root element must feature a valid lang
attribute.
The head
element of the document must contain a
<meta charset="utf-8">
element (preferable as its
first child), a
<meta name="viewport" content="width=device-width">
element (and no other viewport meta
), and a title
element. All
the other content of the head
is ignored.
The body
element must have a prefix
attribute, which must
declare the following mapping:
Prefix | URL |
---|---|
schema | http://schema.org/ |
xsd | http://www.w3.org/2001/XMLSchema# |
sa | https://ns.science.ai/ |
Having to declare prefixes is undoubtably an annoyance and it does hurt the human
authorability of the format (since hand-creating a document essentially requires a
boilerplate prefix declaration). This trade-off is made for several reasons. The most
important motivation is that having predictable prefixes means that the content can be
styled with CSS using reliable attribute selectors on the semantic information that
describes the document’s structure. The alternative would be to use URLs everywhere,
such that instead of sa:Abstract
we would have
https://ns.science.ai/Abstract
; but in practice that approach is more
painful since the content then becomes bloated with URLs that are longer than is
comfortable.
The RDFa content of the article must systematically make use of these prefixes when the
values match their URLs as prefixes. Authors may declare other prefix-URL mappings in
the prefix
attribute of the body
element (or
prefix
attributes elsewhere), including other prefixes mapping to the same
URLs if needed, but inside of the article’s content these prefixes must be used for
these URLs.
Article Structure
The article content is everything that is contained inside the first
article
element in document order that has a
typeof="schema:ScholarlyArticle"
. Everything part of the body
outside of that subtree is ignored. This enables publishers to surround the article
content with any amount of supporting markup, for instance for headers, footers,
or navigation, as well as to wrap the article inside arbitrary markup that may be
needed for stylistic reasons.
The article
element should have
a resource
attribute, usually with a value
of #
. The reason for that is to grant it a
URL that can be targeted by other properties. The
resource
attribute can take any value, but it
must then be matched by the
about
attributes of the properties targeting
it. If resource
is omitted, the only way in
which those properties can target it is by knowing the URL
from which the document was retrieved.
The first element child of the article
element must be an h1
heading that serves as the primary title for the document. It may itself contain markup.
The white-space-normalised text value of the h1
must appear as a substring
of the white-space-normalised text value of the title
element. This ensures
semantic alignment between the two, while enabling publishers to add their name to the
title
so as to identify themselves there alongside the content.
Any children of article
that are not section
elements are
ignored.
The first section
child element of the article
must be the
Authors and Affiliations section. It has no typeof
and its specific rules are outlined in its own chapter below.
The section
elements can be nested arbitrarily deep. Each
section
element must have as its first element child an hX
heading element the numeric part of which must be the number of section
ancestor elements that heading elements has up to the article
element, plus
one. If the numeric part is greater than 6, then h6
must be used but an
aria-level
attribute must be added that reflects the accurate depth.
(The aria-level
attribute can be used at lower depths but is not required
there.)
Each section
element may contain an arbitrary number of
hunk elements, followed by an arbitrary number of
section
elements being subsections. Note that
hunk elements must imperatively appear before the
subsections.
Sections are expected to be typed using the typeof
attribute. The following
typeof
values are currently understood:
sa:Funding
(which has its specific structure)sa:Abstract
sa:MaterialsAndMethods
sa:Results
sa:Conclusion
sa:Acknowledgements
sa:ReferenceList
Hopefully these types are largely self-documenting, they are described further in the Scholarly Article ontology (Science.AI, 2015a).
The section typed sa:ReferenceList
has special processing rules described
in the References section.
Hunk Elements
Hunk elements are the meaningful blocks from which sections are built. They contain text and inline elements. There are several types of hunk elements.
The most common hunk element is p
, which is used to capture paragraphs. It
requires no special processing.
The blockquote
, ul
, ol
, and dl
elements can be used as they typically would and require no special treatment.
The aside
hunk element is used to capture text boxes. If it contains an
hX
heading element, that element must be its first element child and its
numeric part must reflect its depth, making use of aria-level
according to
the same rules as apply for section
. The other children of
aside
must all be hunk elements.
The figure
element is a general container for content units that are
embedded inside the main body of the text. It can come in several flavours that are
dictated by its typeof
attribute.
If figure
has typeof="sa:Image"
then it is an image container.
It must contain an img
child element and should contain a
figcaption
labelling that image. An example of an image figure would be:
If figure
has typeof="sa:Table"
then it is a table container.
It must contain nothing other than a table
element. If a caption is
available, it should be included using the caption
child element of the
table
, and not the figcaption
child of the
figure
.
If figure
has typeof="sa:Formula"
then it is a formula
container. It must contain a math
element and optionally a
figcaption
describing the formula. The math
element must be
valid MathML 3. Additionally, given the dismal state of support for MathML in Web
browser the math
element must contain an annotation
descendant with the TeX equivalent of the formula.
If figure
has typeof="schema:SoftwareSourceCode"
then it is a code container. It must contain a pre
element
and optionally a figcaption
. The pre
element must contain as
its only child a code
element.
If you wish to specify the type of the language used in the code, the
figure
needs to have a schema:programmingLanguage
property
containing a type schema:Language
, itself with a schema:name
containing the the lowercase name of one of the languages from the
list
of programming languages. Canonically, this would look like the following source:
Inline Elements
Inline elements essentially decorate, describe, and enrich text. Inside of
hunk elements, of heading elements, and of captioning
elements (caption
and figcaption
) the following inline
elements can be used (and where applicable they can nest within one another):
a
abbr
bdi
bdo
cite
code
data
del
dfn
em
img
(for small, contextual images that should not be figures)ins
kbd
mark
math
(for inline equations that should not be figures; they must also contain a TeX annotation)meter
q
ruby
(with embeddedrb
,rt
,rtc
, andrp
)samp
span
strong
sub
sup
svg
(for small, contextual images that should not be figures)time
var
wbr
If an a
element is linking to a citation, then it must have
property="schema:citation"
; if it is linking
to a figure or another creative work, it must
have property="schema:hasPart"
or property="schema:isBasedOnUrl"
. These are
known as flavoured links, they can be used to
enhance the user experience by treating their behaviour
differently from regular links.
The References Section
The references section is a special type of section
element with
typeof="sa:ReferenceList"
.
Apart from its heading element, it must contain nothing other than an ol
or a dl
element.
If using a dl
element, its content must be exclusively a strictly
alternating sequence of dt
then dd
elements, with the latter
being the citation-bearing element. The dt
is used as a label in some
citation formats.
If using an ol
, then its content is only li
elements that are
the citation-bearing elements.
The citation-bearing element will have an id
and be
schema:Book
for books or
typeof="schema:ScholarlyArticle"
(or its subclass
schema:MedicalScholarlyArticle
, with probably more to come). Its
content follows the «flexcite» format (being defined as part of this document, see
#4). The
references section of this document is an example.
At the semantics level, a citation is a schema:ScholarlyArticle
(or subtype)
with an id
to reference it internally in the document and a
resource
that is a URL identifying it (its DOI for instance, preferable in
HTTP-retrievable form).
That schema:ScholarlyArticle
has any number of schema:author
which are schema:Person
(with the usual schema:givenName
,
schema:familyName
, etc.). A child cite
element, with
property="schema:name"
(and optionally a link child) provides the title
of the article.
The publisher is described using a nested schema:isPartOf
structure of
schema:PublicationIssue
, schema:PublicationVolume
,
and schema:Periodical
(with only those that are known being used). Both
schema:volumeNumber
and schema:issueNumber
may be used on the
volume and issue.
A time
element with property="schema:datePublished"
provides
the publication date, which is expressed in text in human-readable form and in the
datetime
attribute in standard form. A datatype
attribute
matching the date format must be provided.
Both schema:pageStart
and schema:pageEnd
may be provided.
Beyond the semantics, a more specific serialisation known a «Flexcite» is in the works and will be added here soon. Its properties are simple: when unstyled it reads linearly in a human-friendly manner (so as to be accessible), and it can be styled with CSS to be turned into arbitrary citation style preferences.
The Funding Section
The funding information attached to an article involves a list of sponsors each of which offers a list of funding sources. Again, an example probably makes the idea clearer:
The section has typeof="sa:Funding"
and an arbitrary heading title, like
other sections.
It contains a series of hunks that are
rel="schema:sponsor"
(there can also be other content, it is ignored for
our purposes). The example above uses a p
and a narrative style for its
content, but you have freedom to use other encodings.
As for contributor affiliations, source of fundings are
expressed using a subclass of schema:Role
(sa:SponsorRole
).
The usage of schema.org Role
type is needed
to describe the source of fundings specific to a scholarly
article as opposed to all the source of funding of an
organization (relevant or not to our scholarly article of
interest).
The funder will be typeof="schema:Organization"
(or a subtype thereof), as the object of a
schema:sponsor
property on the sa:SponsorRole
. It
will be identified through its URL (as in the a
above), and will typically
have schema:name
and often schema:alternateName
.
The specific source of funding is of type sa:FundingSource
, as the object of a
sa:roleOffer
property on the sa:SponsorRole
. It
should have a URL identifying it and a schema:serialNumber
that is its
labeled identity.
Data rich scholarly articles
Scholarly Articles are often part of a larger network of creative works containing dataset, code, additional figures, tables or media (audio, video). Even within a scholarly article, some creative works can be encoded in different ways (for instance, figures typically comes in different resolutions). Scholarly HTML aims to formally describe (and help archive) this larger context.
Further data about a scholarly graph can be exposed within the article in RDFa or as JSON-LD islands. When JSON-LD is used, it is recommended not to duplicate the data already serialized in RDFa. JSON-LD should be reserved to expose data not directly present in the HTML markup.
A scholarly graph provides a manifest for a scholarly
article listing all the creative works, their encodings
and the relationship between these objects (expressed
with schema:hasPart
and schema:isBasedOnUrl
).
Creative Work | Property | Encoding |
schema:ScholarlyArticle |
schema:encoding |
sa:DocumentObject |
sa:Image |
schema:encoding |
schema:ImageObject |
sa:Audio |
schema:encoding |
schema:AudioObject |
sa:Video |
schema:encoding |
schema:VideoObject |
schema:Dataset |
schema:distribution |
schema:DataDownload |
schema:Table |
schema:encoding |
sa:TableObject |
schema:SoftwareSourceCode |
schema:encoding |
schema:MediaObject |
Semantic context of a scholarly article
Scholarly Articles (and their associated resources) are
frequently tagged to improve their discoverability. For
instance,
the National
Library of Medicine uses
the Medical
Subject Headings (MeSH) controlled vocabulary to index
journal articles in the life sciences. Scholarly HTML
leverages schema.org and the schema:about
property to efficiently expose this information to search
engines. When possible,
schema.org MedicalEntity
(and subclasses) should be used to describe biomedical
concepts.
Hypermedia controls
A Scholarly Article (or any resource part of a scholarly graph) can be made actionable with the addition of hypermedia controls provided through schema.org actions. Readers not familiar with schema.org Actions should refer to the actions overview document for a quick introduction.
Acknowledgements
Scholarly HTML would like to thank Scholarly HTML (you read that right) for blazing the trail perhaps a few years too soon. Particularly, the following people were particularly kind and helpful: Peter Sefton, Richard Smith-Unna, and Peter Murray-Rust.
PLOS has a short history of Scholarly HTML that is worth reading (and would be worth updating).
Dan Brickley was kind enough to drop by the office to chat about our usage of schema.org even though he was tired and hungry. As always, examples involving fish tanks are the most helpful. Tzviya Siegman and Dave Cramer have shared ideas that we happily stole.
Patrick Johnston's input has been crucial, notably in modelling authoring. We can only hope that getting those details exactly right have not caused him to lose too much sleep.
We also received very useful feedback and pointers from: Kjetil Kjernsmo (DAHUT!), Silvio Peroni, Justin Johansson, Alf Eaton, Raniere Silvia, and Mike Smith.
If we somehow forgot you in this list and you are too gracious to complain, we love you all the same.
Conclusion
Scholarly HTML is currently a work in progress and is open to change. If you have feedback, simply open an issue on GitHub, or make a pull request.
We believe that this vernacular establishes that it is possible to capture scholarly information accurately, while retaining a clean HTML structure that does minimum violence to the language’s spirit.
The science.ai platform currently supports (most of) Scholarly HTML, but our goal is not to make this a proprietary standard — quite the opposite. We would like the scholarly publishing as much as possible to align with common practices so that we can all focus on problems more interesting than content conversion.
References
- NYT, 2014
- The Full New York Times Innovation Report, by New York Times ; .
- Science.AI, 2015a
- The Scholarly Article ontology, by Science.AI ; .
- Science.AI, 2015b
- Vernacular — HTML Made Special, by Science.AI ; .