This blog post is by Haydn Maidment, project manager at ocProducts.
The web often seems like a wild place, semi-anarchic and filled with the thoughts and ideas of innumerable people, each seeking the right audience for the right idea at the right time. Imagine how it must look to a computer: the content is written for humans, by humans, but sorted and delivered by machine.
The problem with data formatted solely for humans is that it is often messy, non-standardised and poorly structured, leaving our cognitive abilities to make sense of what we see on screen. We all have our preferred writing style and sense of aesthetics, which inevitably leads to inconsistent content that lacks the defined structure required for machine indexing. Without a way of identifying the structure and meaning of web content, machines will always have difficulty identifying and classifying the types of information that humans would recognise immediately.
For example, a machine reading plain text will be unable to readily distinguish a list of ingredients for a recipe from the cast-list for a movie; additional information is required to place the text in context. Artificial Intelligence (AI) can be used to try and work out context from the particular words and word structure, but to recall our example again: what if some words associated with the movie are also words that can equally describe food, such as the Hong Kong horror film “Dumplings” by director Chan Fruit?
Any machine written to interpret plain text has no means of identifying the true context of these terms, which may lead to inappropriate cross referencing. We can hope for better AI, but unfortunately decades of computer science research has failed to simulate the brain's information processing abilities.
The challenge, therefore, for site owners and search engines alike is to provide metadata that allows this information to be invisibly structured, sorted and filtered. Implementing semantic markup would simplify the process of locating topic-specific information, with the net benefit of allowing greater interoperability between data sources and aggregators.
The importance of interoperability has raised over recent years by the boom in mobile apps, many of which source information from web based services, such as third party websites and social networks.
Implementing standardised semantic markup would allow cross-site content to be indexed and classified much more readily without adversely affecting the end-user experience. Search engines already use metadata systems such as MicroFormats and Dublin Core to extract and display more relevant details in their listings, however these have not been very widely adopted and are very limited in scope.
There have been many more thorough attempts to bring order to the chaotic mass of information available online, however development has been blighted by slow uptake of the complex semantic markup languages involved (such as RDF and OWL) and lack of agreement on any actual schema to use on top of those languages. This now looks set to change, with the introduction of HTML 5 microdata, which is simple/compact as well as (with schema.org) expansive in it's schematic scope.
In particular, the Schema.org initiative has taken a giant leap towards creating a de-facto standard through the collaboration of three industry giants; Google, Microsoft and Yahoo.
As supporting the Schema.org standard offers numerous benefits to the users of my company's CMS, ocPortal, we ensured that ocPortal 7.1 was the first CMS on the market to provide Schema.org support as an integrated feature. The obvious benefits for search engines will be a boon to amateur webmasters and niche-content providers, as web content that appeals to a limited audience will be much easier for the audience to locate.
Mobile apps are another industry sector that can benefit from the ability to extract and display information from a wider array of information sources; there are many apps that are currently tied to a specific data source, which would benefit greatly from the ability to search and extract information from any page that matches the right semantic criteria.
Schema.org may have created a new means of syndicating content that surpasses primitive semantic technologies, and feed mechanisms such as RSS, in every respect; what digital developer could not be excited by this prospect?
We're excited that now we have done the work switching to HTML5 and setting up schemas for all the inbuilt ocPortal content types, we can now roll this technology out for our clients' websites with little extra thought or cost. The whole ocProducts team now look forwards to a new wave of innovation as our community puts the technology to use.
Welcome to the Semantic Web.
Schema.org: Setting your content free