Universities in RDF - process

As promised, here is the much longer and more boring explanation of my RDF project. I also have a better, more entertaining explanation of how I think Universities should use RDF.


This project began in November when I started in my position as Web Developer at the University of Pittsburgh. Since then, I have been looking for a way to effectively model faculty members (and communicating correlations between them) for publishing on front facing University Web sites.

Criteria for effectiveness are:

  • creating a write once, publish anywhere model that centralizes the information about individual faculty members and distributes that information as needed throughout all department, school, and center sites
  • enabling faceted search, which allows users to dissect the information in different ways, giving a better understanding of the connections between faculty members and between research areas
  • optimally, allowing for visualization of these connections
  • potentially, opening this information for use by other organizations (this would be particularly useful if other Universities used the same modeling technique -- information about researchers from two separate universities collaborating on the same project could be shared on both universities' Web sites)

Requirements & Considerations

There are technical requirements that must be met to fulfill the above criteria. There are also issues that must be considered when trying to fulfill the criteria.

Open to Revision

The fact that we cannot perfectly model faculty is a given. The community is very large, diverse, and committed to individual ways of expressing information. Attempting to find all edge cases before starting the modeling process so those edge cases can be accommodated would be onerous.

Even if we could derive a schema that includes all special cases today, there is a good chance that it would be outdated within a year. We are discovering and inventing new things to know about people (mobile phone number, email address, Linked-in profile) all the time, so any schema created today needs to be extensible or it will be obsolete before we can import all the data.

Interoperable with Other Systems

The goal of the semantic web is to develop shared meaning that can be understood by machines. When shared semantics begin to emerge and data is structured with the terms, data can be used in ways that the creators did not imagine. Either the creators find different ways to manipulate the data or, if it is exposed, others find ways to mash it up and provide new (and sometimes profound) information or services. As some Oxford scientists said in the Journal of the International Society for Computational Biology recently, "Scientific innovation depends on finding, integrating, and re-using the products of previous research," and its hard to reuse it if its not expressed in a compatible way.

In the context of a faculty database, interoperability would have multiple benefits. If two faculty from different universities are working on the same research project, their profile information can be drawn from their respective university Web pages. If they are applying for grants, the faculty member's bio info could be queried automatically at the time of saving the grant application, so the grant application is sure to have consistent and up to date data on all participating faculty members.

If enough universities exposed their data, researchers could find potential partners for research simply by querying faculty databases based on the foaf:interest term. Grad students could find potential advisors the same way. Results could even be displayed on a map, using something like MIT's Exhibit, which could identify loci of activity for particular research interests.

Address Privacy Concerns

In a Nodalities podcast, Ivan Herman talks about how the W3C is very much in the R&D stage on privacy in the Semantic Web. Privacy is an issue that this working group had not been focusing on much up to this point. However, it is a major concern, especially for people who aren't as enthusiastic about the potential of the Semantic Web.

The FOAF specification notes this concern and offers one solution. "Many people are wary of sharing information about their mailbox addresses in public. To address such concerns whilst continuing the FOAF convention of indirectly identifying people by referring to widely known properties, FOAF also provides the foaf:mbox_sha1sum mechanism, which is a relationship between a person and the value you get from passing a mailbox URI to the SHA1 mathematical function."

There is also the more privacy friendly XRI Data Interchange that offers "link contracts that enable control over the authority, security, privacy, and rights of shared data to be expressed in a standard machine-readable format."

Herman does mention in the podcast that existing tools can be used, and there are modules in Drupal to restrict access to feeds which would be one step. I did not address the privacy concern directly in my work for this assignment, but would need to before proposing this as a solution for the University.

Solution & Process

For my deliverables, I have:

1. Created a visualization of an RDF model for describing faculty members, and their publications and courses. This model is (to borrow phrasing from the Semantic Web Conference Ontology specification) mainly a convention of how to use classes and properties from other ontologies. This model is still in a very roughed out first draft and details will still be filled in, but it encompasses the breadth of domains that the initial project would model.
2. Created a proof of concept for automatically generating RDFa. I have used the more basic data from the faculty type, creating a form that can be used to enter new faculty. All fields are automatically mapped to RDF and those terms are embedded as RDFa in the XHTML. The URI for the resources is automatically generated using the first, middle, and last name of the faculty member. The new faculty member is included in the RDF document that can be queried using SPARQL. (See Deliverables for login information to test out the system)
3. Created example SPARQL queries to demonstrate how the data can be extracted.
4. Created an Exhibit of the data to demonstrate basic faceted search. The Exhibit project offers many possibilities, which I will explore further in the future.

Finding External Vocabularies

If interoperability is the goal, we must use the same vocabularies to describe our information.

In my XML schema project, I attempted to model faculty with an external vocabulary to increase the potential for interoperability. However, most of the XML vocabularies I found were more appropriate for modeling documents, not for modeling the complex properties and relationships that define people.

The one vocabulary I did find for modeling people was developed for homeland security by the government. While I did attempt to integrate that vocabulary into my schema, I noted that it was rather creepy to use a vocabulary that was meant to collect data on suspected terrorists to model faculty and that I would continue to look for an alternative.

In exploring RDF vocabularies, I found many more that were intended to model human's and their relationships. Some that seemed particularly suited to this project included:

  • FOAF—the most well known vocabulary for modeling people (and one of the most well known RDF vocabularies), FOAF can represent basic person information, such as contact details, and basic relationships, such as who a person knows.
  • Academic Institution Internal Structure Ontology (AIISO)—effectively models organizational relationships, such as Institution->School->Department->Faculty with the property part_of and defines courses taught by those Departments with the teaches property.
  • AIISO-Roles and Participation Schema—AIISO-Roles used with the Participation Ontology can relate the individuals (modeled with FOAF) to the institution (modeled with AIISO)
  • University Ontology—University ontology is undergoing active development and is currently unstable, but does a good job of modeling the details of course scheduling. It is being developed by Patrick Murray-John at University of Mary Washington, who is in touch with the developers of the AIISO ontology at Talis.
  • Semantic Web for Research Communities—there is much overlap between AIISO and SWRC. While there is a text on the development of SWRC, it is hard to find a clear documentation of the ontology itself, so a comparison of the two would take more time.
  • Dublin Core—One of the original and most widely used vocabularies, Dublin Core can be used for cataloging publications.
  • bibTeX.owl—bibTeX is a format description for source citation. bibTeX.owl is the bibTeX ontology chosen by Nick Matsakis to use in his BibTeX RDFizer that is part of MIT's SIMILE project. Depending on whether bibTeX data is prevalent and used throughout the community, this may be another option for cataloging publications.
  • Bibliography ontology—Bibliography reuses many existing ontologies such as Dublin Core and FOAF properties. It's goal is to be a superset of legacy formats like BibTeX. It has multiple levels, such as level one which is for simple bibliographic data, or level three which can aggregate many medium sources like: writings, speeches, and conferences. It is used in the University ontology.
  • An essential criterion for selecting vocabulary is whether it will be adopted across the domain (and across many domains, preferably). As more organizations expose their data, FOAF will likely see widespread adoption as it is quite general, easily extended, and extensive work has already gone into establishing it as a standard. Dublin Core is also quite established.

    However, it is unclear whether the AIISO ontologies will be adopted. AIISO was developed within the past year by Talis, a software company dedicated to semantic technologies, for their academic resource list management system, Talis Aspire. It has since been picked up by the Building the Research Information Infrastructure Project at Oxford. BRII attempts to enable efficient sharing of research management information through the use of semantic technologies.

    University ontology is a smaller project than either FOAF or AIISO, but it fills a niche modeling purpose that other ontologies do not, so it may see adoption.

    Dublin Core is well known and widely used. However, Bibliography may see an upswing as people try to model newer kinds of artifacts.

    In the end, I chose to use FOAF to model people, AIISO to model the instiution, University to model the courses, and Bibliography to model the publications. I am also using Address ontology, since FOAF does not have any terms for physical address.

    Choosing an RDF output type

    There are three possible ways to output the RDF: a separate RDF page (ie www.example.com/page/rdf), an RSS 1.0 feed, or embedded RDFa. Throughout the project, I vacillated between the three for the following reasons:

    • RSS 1.0 feeds could be easily imported by sites, but I found that the SPARQL query applications could not query them.
    • Separate RDF pages could easily be queried by the SPARQL query applications, but could not be aggregated into one RDF document (at least, I did not see an easy way to do it). Separate RDF pages are also not as intuitive as having RDF embedded on the page.
    • I found RDFa was queriable and could conceptually be easily aggregated into one RDF document. It is also a more intuitive way of handling the information related to the HTML page.

    I eventually decided to use RDFa. I have described the challenges in using RDFa below. Additionally, I am using the RSS feed (which I have serialized as RDF/XML) to feed into the Exhibit, described below, although an Exhibit JSON feed could be created directly from Drupal.

    Creating an Exhibit

    Exhibit is a data visualization toolkit developed by the MIT Simile project. It requires structured data to function, but is quite simple to implement once you have structured data.

    Exhibit requires data be formatted as Exhibit JSON. There is a tool called Babel that can take other input formats such as RDF/XML and output Exhibit JSON. Arto Bendiken, who was the force behind the original RDF module for Drupal, published a tutorial in late March, RDFizing Drupal: Upgrading the RSS Feeds, which shows how to output a feed in RDF/XML format.

    Once the data is in your exhibit, you can show it in different kinds of views. For instance, you could show it on a map if it is geo-data, or on a timeline if it is
    chronology data.

    I have only scratched the surface of the data visualization that could be acheived with Exhibit. For instance, if we had geo data for the courses and faculty office, we could easily use Exhibit to plot each freshman's classes and teachers' offices on a personalized map to send with their intro package.

    Challenges

    Embedding RDFa

    The module for embedding RDFa in Drupal is still under active development. It was released in mid-March by Stéphane Corlosquet, who I talk more about below. I found when using it that it was embedding the local terms for properties but not the external vocabulary to which properties were mapped. For instance, the first name field had a property of "site:field_first_name" instead of "foaf:firstName".

    I altered the module to use the external vocabulary instead and posted my comment on one of the project pages. Corlosquet has since revised the module so that others will have external vocabulary embedded.

    After that, I found another challenge. I wanted to have one URI to query for all information in the faculty database. However, when I queried that URI of aggregated faculty data, all properties took that URI as their subject, instead of the URI for the individual faculty pages.

    I was able to make a slight modification to alter this. Now all queries relate to the faculty page URI with '#me' appended to the end (which is an RDFa convention). My method is quite hacky, so I now need to find a way to generalize this modification so I can contribute it back to the community.

    Referencing across domains

    This is the portion of the project I was most excited about, so I am anxious to work on it in the future.

    MK Bergman says in "Advantages and Myths of RDF", "RDF ontologies and controlled vocabularies also have some hidden power, not yet often seen in standard applications: by virtue of its structure and label properties, we can populate context-relevant dropdown lists and auto-complete entries in user interfaces solely from the input data and structure. This ability is completely generalizable solely on the basis of the input ontology(ies)."

    At DrupalCon 2009, Stéphane Corlosquet, a researcher with the Digital Enterprise Research Institute in Galway, demonstrated a module that he is developing for Drupal. This module does exactly what Bergman suggested; it queries a dataset and uses the results of the query to populate an autocomplete field.

    This could be used to connect publications to their faculty authors and courses to their faculty teachers. For instance, say a data entry person is inputting new courses to the course database. When he reaches the Instructor field, he would simply begin entering the last few letters of the instructor's last name (for instance 'Clar'). As he typed, all faculty with those characters beginning their last names would appear in a list below for him to choose from. When he chose the instructor, it would point to the URI (www.faculty.pitt.edu/lindsay-wardrop-clark#me) instead of the literal value (Lin Clark). This all happens without any additional work on the part of the data entry person. In fact, his job is made easier as he doesn't have to type the full name and typos are eliminated.

    Unfortunately, at the time of this writing, Corlosquet does not feel the module is ready for contribution to the larger Drupal community, but says he is working on getting it ready.

    Conclusion

    The solution outlined would consolidate information about faculty, courses, and publications. It would streamline the publishing process and make it easier for marketing staff, support staff, and the faculty themselves to publicize faculty and their work. It would give the end users a better view of the University structure, allowing the user to traverse the relationships between faculty and research groups using faceted search. If this standard was adopted by other universities, it would help collaborating faculty streamline their grant seeking and publicizing efforts. It could also open the doors to an entirely new way of making academic connections.

    There would be some cost in implementing the solution. However, most of the cost could be distributed across the Web projects that the University already engages in. New faculty could be added to the database as their department's sites were being redeveloped. As more faculty were added and new kinds of information came to light the system could easily be extended, so there wouldn't be much up front cost in planning for the entire University system.