URIs and SBOL
The SBOL specification uses URIs to identify data instances. This is captured in the UML diagram as a uri
property. The intention is that this URI uniquely identifies the SBOL data instance, allowing the URI to be used in place of the instance where it makes sense, perhaps as a far pointer, for checking equality or for computing a hash code. In data integration applications it should be possible to rely upon the URIs for data alignment.
Here we give some best-practice suggestions to help you get the best out of these URIs, and give us the best-possible chance of all our URIs playing nicely together.
URIs do Identify Things, don't Locate Them
The role of a URI is to uniquely identify
a resource, not necessarily to locate
it. In the case of SBOL, we use URIs to identify SBOL data instances that are self-standing. By this we mean that they can meaningfully be referred to, either by other SBOL instances, or by things outside of the specification, without needing to also mention some 'context' that contains them. For example, it makes sense to be able to refer to a DNA Component in its own right, independently of any catalogue of parts that it may be part of, so DNA Components have an associated URI property.
Of course, at some point a software agent needs to locate information about that DNA Component, but it is important that the URI is not simply this location. The same DNA component may be in several catalogues, or stored on your local disk, or exist solely within the memory of an application. In all these locations it has the same URI. It is up to SBOL tooling to work out how to resolve this URI to one or more locations from which the data can be fetched.
This leaves you a lot of scope for inventing naming schemes for your collections, components, sequences and annotations. While much of the leg-work will be handled by tooling, to reduce the space a bit, we recommend the following:
- Keep in mind that the URI is intended to identify the resource, not primarily to locate it. Choose URIs that capture this identity in a meaningful and systematic way.
- Consider using human-meaningful URIs for anything you author that may be referred to by other people. While there's no technical reason to use URIs that people can read, it certainly makes it easier when coming back to things later or sharing them with others.
- Think about the naming scope. Are you generating these designs as part of a project? If you are, then you could start the URIs of these designs with a URI for the project. Are you working as part of a larger organisation? If so, you probably want to include that organisation in the URI. For example, if you are at synbio-alchemy.com in the leadToGold division authoring a collection of parts for the alchemy pathway, you could use a URI like: http://synbio-alchemy.com/divisions/leadToGold/alchemy for the collection and URIs like http://synbio-alchemy.com/divisions/leadToGold/alchemy/leadUptake for individual components.
- For nested objects, you may want to use URIs based upon their containing object. As an example, SequenceAnnotation instances have URIs, but always exist relative to a containing DnaComponent. You may wish to use the URI of the container as a base for the URI of the annotation. For example, you could use
http://synbio-alchemy.com/divisions/leadToGold/alchemy/leadUptake#pump_operon_promoter, rather than
By sticking to these principles, we may be able to handle large numbers of SBOL instances from many sources without instantly getting into a horrible identifier mess. It may be appropriate for SBOL to give best-practice for or formal specification of the mechanism for dereferencing these URIs to data but at this time it has not yet been addressed.
Absolute and Relative URIs
The URI specification allows URIs to be absolute
. An absolute URI stands alone as a globally unique resource identifier. It is composed from a protocol (e.g. HTTP) and a protocol-dependent path. It may also include a fragment suffix (e.g. #item_3). An absolute URI contains the complete information needed to uniquely identify the resource. If the URI is a URL then it contains all the information needed to dereference
(fetch) the associated resource.
In contrast to absolute URIs, relative URIs contain incomplete information for uniquely identifying the resource. They are intended to be resolved
relative to another URI that provides context. For example, a relative link in an HTML document does not specify either the protocol or host name, but is interpreted in the context of the surrounding web page. By resolving
the relative link against the page's URL, an absolute URL is obtained that can then be dereferenced
The SBOL data model allows both absolute and relative URIs to be used as values of the uri
property. However, as it doesn't also include an explicit notion of the resolution context, there is no standard way to resolve these for the purpose of comparison. This makes working with relative URLs problematic in practice. To mitigate this, we suggest the following:
- In documents, always use absolute URIs for far-pointers. If you have an SBOL document, in whatever format, and you need to refer to an entity by URI that resides in another document, don't be tempted to use a relative URI. Your document may be moved or copied to another location, so all relative URIs that depend upon the location of your document will break.
- If you use relative URIs, make them 'fragment only' URIs. By this we mean restrict them to the #something form, and don't be tempted to include any relative path information. When in a database or in-memory representation, there is usually an implicit current context against which fragments can be resolved. In XML documents, using fragment relative identifiers can help avoid identifier clashes.
- Resolve relative URIs when publishing SBOL documents to hide any internal, bespoke URI resolution context. It may be convenient in your application to use various relative URI schemes to reduce computational overhead. For example, a database storing many collections each with many DNA Components may store an absolute URI for a collection and then for each DNA Component give a fragment to be resolved relative to the collection. The outside world doesn't know about, and shouldn't need to care about, your internal URI resolution schemes. At data publication time, resolve these URIs so that your users can remain oblivious.
Minting a URI is the fancy name for publishing a new one. There are some rules about minting URIs that are helpful to re-iterate here.
- A URI MUST only be minted by the URI's owner or delegate.
- If the URI scheme includes a domain name, then the URI's owner is the owner of that domain.
- Do not mint new URIs for existing data, reuse the existing URI.
SBOL entities will come from two primary sources. Firstly, there are the parts and design databases. These will typically be responsible for minting URIs for the parts and designs that they publish, and we would expect these URIs to relate to the data source as a whole. It is possible that aggregate databases may (re)expose designs with URIs originally minted by 3rd parties. It is perfectly legal in this case to publish SBOL instances with these 3rd party URIs intact, and indeed this is the recommended behaviour. What they cannot do is generate new URIs within the 3rd party's domain.
The second case where URIs need minting is by software running on behalf of a user. In this case, it is the responsibility of the tool to take reasonable steps to ensure that the user is the owner or delegate for these minted URIs. This may be achieved by getting the user to fill in a wizard or preference for the URI root, or perhaps some URI scheme derived from their email address. The tool should not mint URIs in some tool-specific domain, as it is acting as a delegate of the user, and the user is not a delegate of the owner of the tool's domain.
Consider using URLs
The URL scheme is a URI scheme. A URL uniquely identifies a location on the web. There are lots of existing tools and a great deal of lore for working with URLs, much of which applies when using them as URIs. Additionally, it gives you the possibility of having these URLs point to some actual resource, ideally a document providing the data for the SBOL entity itself.
To get the most out of HTTP URIs, we recommend the following:
- You must own the domain, or be acting on behalf of the domain owner. If you want to use HTTP URLs to uniquely identify SBOL entities, then please make sure that you own the domain name. If you are using URLs under http://awesome.synbio.com/myDesigns then please first make sure that you own the awesome.synbio.com domain name, and if you are part of a larger organisation, that you own the myDesigns space within that domain. Since domains are each unambiguously owned by an entity, sticking to this will help URIs to be unique, and will prevent miss-attribution of designs.
- Make the path meaningful. Although the URL's path is opaque from the point of view of its use as a URI, it will help you if it forms part of a naming scheme for locating and categorising your SBOL data.
- Avoid file extensions. For the entity URIs, prefer URLs like
http://awesome.synbio.com/myDesigns/collection1.rdf where ever possible.
URIs are intended to be passed to a lookup service to find one or more URLs that can be used to fetch the resource. It is the lookup service that is responsible for knowing to add .rdf to the end before making the GET transaction to fetch the data. It is likely that data exposed over RESTful interfaces will provide a whole range of URLs with different extensions, probably including rdf, json and xml, and there's no reason for you to make guesses about this when authoring your URIs.
- Avoid localhost, IP addresses, and anything else that is not stable or globally meaningful. The URI is intended to be globally meaningful. Using HTTP URLs with localhost defeats the object entirely. Using raw IP addresses makes the URL brittle over time as while a machine name may stay stable for some extended period of time, its IP address may change frequently. Similarly, avoid using volatile machine names like those assigned by on-demand compute services. Similarly, it is probably a bad idea to use URLs that are only visible within an organisation for SBOL entities that you will be publishing outside of the organisation.
Use Standard Prefixes
Some data-serializations support URI prefixes. We recommend the following standard prefixes:
- sbol-v1 = http://sbols.org/v1# : SBOL core vocabulary.
Data stability and Versioning
Some SBOL documents will be used for real-time data exchange. Here, the URIs are forgotten about almost before they have been minted. In these cases, it makes sense to use a mechanical naming scheme, either a session-scoped counter, or some digest of key data fields, for example. In other applications, such as long-term archiving in public databases, the URIs will exist for some extended period of time. Versioning schemes are outside the scope of the current specification, and we have little experience of long-term maintenance of SBOL URIs. We tentatively recommend the following in the hope that it may reduce the chances of versioning causing problems in the future:
- If you are using HTTP URIs, consider using the query fragment to distinguish a versioned URI from an unversioned one. As an example, we can take
http://synbio-alchemy.com/divisions/leadToGold/alchemy/leadUptake as the unversioned URI and derive
http://synbio-alchemy.com/divisions/leadToGold/alchemy/leadUptake?version=0.2.7 as the versioned URI with version 0.2.7.
Tooling should not blindly rely upon this scheme for converting between verioned and unversioned URIs from 3rd parties.
- If a URI can be dereferenced to an SBOL instance, it should always dereference to an equivalent instance. We don't yet have a water-tight definition of what 'equivalent' means in this context, but the intuition is that if we dereference the same URI multiple times, that the entities retrieved should convey the same knowledge to the tool or user. As an example, it may be reasonable to judge that editing a collection to fix spelling mistakes within the description produces an equivalent SBOL instance, but that altering the members of the collection does not. Similarly, a DNA Component may be considered equivalent with or without the reference to the associated DNASequence, but would probably be considered non-equivalent if in two cases it referred to non-equivalent DNASequence instances.
- Multiple versions of the 'same' entity should have different URIs. If an entity has been altered enough that it requires a new version, then it is no longer equivalent.
- It may be necessary to have unversioned URIs for resources that persist in a meaningful way but change data over time. It would be reasonable for these to not be directly dereferenced, but for dereferencing to be indirected via a service that finds the most recent versioned URI.
- Where possible, the uri property of an entity should give its versioned URI. Far-pointers to other entities may use versioned or unversioned URIs. Any future versioning standard will address the use of versions, version ranges and unversioned URIs.