



























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This thesis explores the application of linked data principles to tackle data heterogeneity and integration issues in the context of the 101companies project. Related work, the concept of linked data, and its implementation through rdf. The goal is to enable the exploration and querying of data from various sources in a machine and human-readable format.
What you will learn
Typology: Essays (university)
1 / 67
This page cannot be seen from the preview
Don't miss anything!
(Ort, Datum) (Unterschrift)
1.2 The problem
As storing contributions and data for as many languages, technologies and con- cepts as possible is a necessity for a software chrestomathy, one can easily imag- ine it as a collection of highly heterogeneous code artifacts, documentation and relationships. This is problematic, as 101 is aimed at representing and conveying knowledge about these things in a structured manner. In particular, all code ar- tifacts and documentation should be conveniently explorable, relationships and all available data should be discoverable. The data should also be consumable by humans as well as machines. The thesis tries enrich the 101companies chrestomathy with a Linked Data approach to tackle these problems. By applying the principles as described by Tim Berners-Lee [BL07], the data supposedly becomes more structured and easier to consume, both from a human perspective as well as from the perspective of a machine operating over the data set. A summary of the results presented in this thesis and co-authored by Kevin Klein, Ralf Lämmel, Thomas Schorleiz and Andrei Varanovich, has been submit- ted for publication [KLL+13].
Within the 101companies project, with and without participation of the present author, relevant work has been published. The concept of the 101companies chrestomathy was introduced in [FLSV12]. The idea of linking entities to re- sources like languages and technologies was presented in [FLV12] and extended to the automatic recovery of such links from source artifacts in [FLL+12].
Exposing of heterogeneous data through Linked Data is not a new problem. A related approach has been described by [KFH+12] and [KFRC11]. There, Linked Data enabled software repositories are used to expose software artifacts as well as the results of preprocessing and analysis steps on these artifacts in the context of software repository mining. Other research on applying Linked Data principles onto software repositories includes the linking and documentation of data in different repositories through RDF as described by [How08]. The goal is to overcome heterogeneous documen- tation techniques used by the different repositories, such as documenting in wikis or database schemas, in order to improve usability of the data. Another approach is the Linked Data Driven Software Development methodology as described by [IUHT09]. There, the goal is to transform "data from version control systems, bug
As presented in the introduction the 101companies system stores contributions as well as their documentation. Additionally, it also stores data about the con- cepts, languages and technologies used in these contributions. In overview, the 101project is defined by three major systems [FLL+12]:
− 101repo for storing code artifacts.
− 101wiki for storing 101-specific documentation.
− 101worker for creating derived resources containing metadata about 101repo artifacts.
3.1.1 101repo
As the contributions are small, self-running and independent software systems, it makes sense to store them in repositories. 101 uses the 101repo, a GitHub based distributed repository, to store all software artifacts related to the chrestomathy. The confederation is necessary to enable a smooth collaboration process, in which new contributions can be added easily without authors having to check out the complete repository. 101repo does not only store contribution data. As some- times code artifacts are created for highlighting special concepts or "Hello World" programs for languages, it also has to store this data.
3.1.2 101wiki
The code artifacts in 101repo are documented in two ways. For one, they are treated as regular source code and should therefore be documented with regard to software engineering best practices. Code comments and Readme files should provide guidance on what the code does and how it does it. However, this doc- umentation does not consider the 101-specific concerns, as it would be to dis- rupting to store this with the code artifacts. A wiki based approach is used for the 101-related documentation, which highlights the interesting code parts and integrates the contribution in the 101companies ecosystem.
3.1.3 101worker
The 101worker system is tasked with deriving knowledge about the code arti- facts in 101repo. This is done in several preprocessing and analysis steps, such as fact extraction, tokenization, metrics computations or metadata association based on a rule set. The worker also tries to compute the links that exist between the code artifacts and the documentation. For example, a file might use a certain lan- guage or technology, introducing a link from that file to the documentation for the language or technology. The worker system is completely based on file I/O, meaning that it will seri- alize results of every analysis step in files. These files then contain the valuable derived knowledge and are referred to as derived resources.
3.2 Linked Data
Linked Data can best be described as a "set of best practices for publishing and connecting structured data on the Web" [BHBL09]. The goal is not only to pro- vide data, but also create typed links between the data of different, possibly very diverse, data sources. The interoperability of these systems is achieved by a machine-readable description language used to encode the informations. Through the adoption of these best practices, a global data space has been cre- ated already containing billions of assertions [BHBL09].
a graph. In this example, it is described that a resource with the name of "Joe Hackaton" knows another resource called "Olga Subbotnik".
These RDF descriptions are represented as triples, consisting of a subject, predicate and object, making basic assertions about a resource. Subjects are al- ways resources and therefore URIs. Objects can either be literal values, like strings and numbers, or they can be links to other resources. Predicates express the type of relationship that exists between subject and object. They are basically URIs pointing to their definition in vocabularies. Figure 3.2 shows the same graph as figure 3.1, just this time in triple form.
ex:Joe_Hackathon ex:has_name "Joe Hackathon" ex:Joe_Hackathon ex:knows ex:Olga_Subbotnik ex:Olga_Subbotnik ex:has_name "Olga Subbotnik" ex:Olga_Subbotnik ex:knows ex:Joe_Hackathon
Vocabularies can be expressed in RDF schema (RDFS). This is a "declarative, machine-processable language", that "can be used to formally describe an ontol- ogy or metadata schema as a set of classes (resource types) and their properties" [Jac03]. It is also used to specify relations between classes and properties as well as to specify some constraints on these properties. It supports basic inheritance through the "subClassOf" predicate, while "domain" can be used to state the class of the subject. The predicate "range" is used to state the "object" class of a prop- erty. Figure 3.3 shows an example of RDFS statements that fit the previous exam- ples. It is important to notice that RDFS is a relatively simple ontology language, meaning that it can define the right usage of a predicate, but it has no advanced restrictions like cardinality.
ex:Person a rdfs:Class
ex:has_name a rdfs:Property ex:has_name rdfs:domain ex:Person ex:has_name rdfs:range rdfs:Literal
RDF is just a data model and can be serialized in several ways [HB11]. One of the most common formats, that is also used in this thesis, is RDF/XML [BM03], where the RDF statements are serialized as XML. A central root node exists, while every resource described in this language is a child of this root node. The children of every resource node make up the stated triples for this resource. Figure 3. shows the previously introduced example in such a serialization, while 3.5 shows the RDFS for the example.
All RDF data and schemas shown in this thesis are displayed in their serial- ized RDF/XML form.
The basic idea of this thesis is to tackle the problems created by the diversity of the data and the heterogeneity of the systems through the Linked Data principles described in section 3.2.1. Applying the principals to the 101companies project implies that entities in 101, which means all wiki pages, all code artifacts, all derived resources and all ontological entities, must be referable through HTTP URIs. Dereferencing those must reveal meaningful data - this includes the actual content of the entity (if applicable) as well as available metadata. It should be possible to display all data in the machine readable RDF data format. Also, ev- erything should be as interlinked as possible meaning that every entity should provide links to all other, for this entity relevant entities, in the chrestomathy.
As the description in the previous section is rather general and hard to validate, more specialized requirements, that can be evaluated later, are formulated in the next sections.
4.2.1 Navigate from wiki to repo
As everything has to be interlinked, it must be possible to navigate from the wiki to the repository to get from the documentation to the actual source code. This is
complicated by the fact that the repository is distributed, meaning that the distri- bution mechanism must be exploited to actually create the links to the physical repositories.
4.2.2 Navigate from repo to wiki
The navigation should be bidirectional, meaning that it should also be possible to navigate from the files and folders of the repository to associated wiki pages. As the physical repositories are not under control of 101companies, a view that can act as an replacement for direct repository access shall be used.
4.2.3 Referencing source code on wiki pages
As documentation on the wiki has to highlight source code from 101repo, it will refer to certain parts of the code. The unambiguous identifier through URIs, as required by Linked Data principles, shall be exploited so that wiki pages can reference source code parts. The wiki can then dereference the URI and directly display the source code on the wiki page.
4.2.4 Displaying of metadata for 101repo entities
As the Linked Data principles require meaningful data when dereferencing an URI, the resources in 101repo shall be extended with metadata derived through 101worker. A example for this meaningful metadata is the language a file uses.
4.2.5 Associate derived resources with primary resources
The derived resources that 101worker creates can be difficult to use, as it is hard to discover what data exists and where it can be found. Therefore, in accordance with the Linked Data principle of providing links to all other relevant resources, derived resources should be linked to their source code artifacts. Additionally, it should be linked how the metadata was derived. Thus:
− For every source code artifact, link to the derived metadata. − For the derived metadata, link to the source code artifact which it was taken from.