Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Linked Data Principles in Action: A Thesis on Data Heterogeneity and Integration, Essays (university) of Information Technology

This thesis explores the application of linked data principles to tackle data heterogeneity and integration issues in the context of the 101companies project. Related work, the concept of linked data, and its implementation through rdf. The goal is to enable the exploration and querying of data from various sources in a machine and human-readable format.

What you will learn

  • What are the benefits of using Linked Data for data exploration and querying?
  • What are the Linked Data principles and how do they contribute to data integration?
  • What are the challenges and limitations of applying Linked Data principles to non-Linked Data sources?
  • How is Linked Data applied to software repositories and source code?
  • How does the 101companies project implement Linked Data principles?

Typology: Essays (university)

2018/2019

Uploaded on 12/16/2019

nguyen-minh-truc
nguyen-minh-truc 🇻🇳

1 document

1 / 67

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Fachbereich 4: Informatik
Enhancement of a software
chrestomathy for open linked data
Masterarbeit
zur Erlangung des Grades eines Master of Science
vorgelegt von
Martin Leinberger
Erstgutachter: Prof. Dr. R. Lämmel
Institut für Softwaretechnik
Zweitgutachter: M. Sc. A. Varanovich
Institut für Softwaretechnik
Koblenz, im Juli 2013
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43

Partial preview of the text

Download Linked Data Principles in Action: A Thesis on Data Heterogeneity and Integration and more Essays (university) Information Technology in PDF only on Docsity!

Fachbereich 4: Informatik

Enhancement of a software

chrestomathy for open linked data

Masterarbeit

zur Erlangung des Grades eines Master of Science

vorgelegt von

Martin Leinberger

Erstgutachter: Prof. Dr. R. Lämmel

Institut für Softwaretechnik

Zweitgutachter: M. Sc. A. Varanovich

Institut für Softwaretechnik

Koblenz, im Juli 2013

Erklärung

Ich versichere, dass ich die vorliegende Arbeit selbständig verfasst und

keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe.

Ja Nein

Mit der Einstellung der Arbeit in die Bibliothek bin ich ein-

verstanden.

Der Veröffentlichung dieser Arbeit im Internet stimme ich

zu.

(Ort, Datum) (Unterschrift)

Contents

  • 1 Introduction
    • 1.1 101companies
    • 1.2 The problem
  • 2 Related work
    • 2.1 Previous work within the 101companies project
    • 2.2 Related work inside a Linked Data context
    • 2.3 Related work outside a Linked Data context
  • 3 Background
    • 3.1 101companies
      • 3.1.1 101repo
      • 3.1.2 101wiki
      • 3.1.3 101worker
    • 3.2 Linked Data
      • 3.2.1 Principles
      • 3.2.2 Data model
    • 3.3 JSON schema
  • 4 Linked Data requirements
    • 4.1 Linked Data principles and 101companies
    • 4.2 Further requirements
      • 4.2.1 Navigate from wiki to repo
      • 4.2.2 Navigate from repo to wiki
      • 4.2.3 Referencing source code on wiki pages
      • 4.2.4 Displaying of metadata for 101repo entities
      • 4.2.5 Associate derived resources with primary resources
      • 4.2.6 Operate the wiki like a graph CONTENTS ii
      • 4.2.7 Operate the repo like a tree
      • 4.2.8 Querying for data
      • 4.2.9 Human and machine readable data
  • 5 Data modeling
    • 5.1 101repo data
      • 5.1.1 Mounting of repositories
      • 5.1.2 Data model behind the stored data
    • 5.2 101wiki data
      • 5.2.1 Relation between 101wiki and 101repo
      • 5.2.2 Links in 101wiki
    • 5.3 101worker data
      • 5.3.1 Metadata derivation for code artifacts
      • 5.3.2 Enabling of fragments
      • 5.3.3 Creation of derived resources
  • 6 Implementation
    • 6.1 101wiki
    • 6.2 101worker
      • 6.2.1 Module Implementations
        • 6.2.1.1 Assembly of 101repo through a module
        • 6.2.1.2 Assignment of metadata
      • 6.2.2 Web access
    • 6.3 The exploration service
      • 6.3.1 Serving of 101repo entities
      • 6.3.2 Link construction
  • 7 Evaluation
    • 7.1 Evaluation through the requirements
      • 7.1.1 Navigate from wiki to repo
      • 7.1.2 Navigate from repo to wiki
      • 7.1.3 Referencing source code on wiki pages
      • 7.1.4 Displaying of metadata for 101repo entities
      • 7.1.5 Associate derived resources with primary resources
      • 7.1.6 Operate the wiki like a graph
      • 7.1.7 Operate the repo like a tree CONTENTS iii
      • 7.1.8 Querying for data
      • 7.1.9 Human and machine readable data
    • 7.2 Further scenario based evaluation
      • 7.2.1 Clone detection
      • 7.2.2 Metrics based comparison of contributions
      • 7.2.3 Concept analysis
  • 8 Conclusion
    • 8.1 Summary
    • 8.2 Future work
  • 3.1 Simple RDF graph example List of Figures
  • 3.2 Examples of RDF Triples
  • 3.3 Examples for RDFS statements
  • 3.4 Example RDF/XML document
  • 3.5 Example RDFS serialized in RDF/XML
  • 3.6 Example for a JSON schema
  • 5.1 Registry for mounting repositories in 101repo
  • 5.2 Data model of 101repo
  • 5.3 JSON schema for file entities in 101repo
  • 5.4 RDF schema for metadata in 101repo
  • 5.5 Wiki namespaces and folders in 101repo
  • 5.6 Properties for typed links internal to the wiki
  • 5.7 Properties (types) for typed links to external resources
  • 5.8 RDF schema of 101wiki
  • 5.9 Rules and preconditions (excerpt) in 101meta
  • 5.10 Assignments (excerpt) in 101meta.
  • 5.11 Model for facts
  • 5.12 Query language for fragment locators
  • 5.13 Derived files for every source artifact in 101repo
  • 5.14 Derived files for folders in 101repo
  • 5.15 Derived dumps
  • 5.16 JSON schema for module descriptions.
  • 6.1 Overview over the wiki
  • 6.2 Triples for the language Haskell as exposed by the server
  • 6.3 A page as returned by the server
  • 6.4 Rendered page of the language Haskell LIST OF FIGURES v
  • 6.5 Triples for the language Haskell rendered in 101wiki.
  • 6.6 Modules executed as a batch process
  • 6.7 Serialized version of the repository registry (excerpt).
  • 6.8 Assembling 101repo
  • 6.9 The dump created based on assembling 101repo
  • 6.10 Serialized version of 101meta identifying Java files.
  • 6.11 Principal workflow of executing 101meta rules.
  • 6.12 Services on 101worker
  • 6.13 Screenshot of the exploration service
  • 6.14 Some simple metadata units about a Java file
  • 6.15 Excerpt from the dumped wiki data
  • 6.16 Architecture of the exploration service
  • 6.17 Workflow of the exploration service
  • 6.18 Summary of module descriptions
  • 6.19 Creation of links to other data sources
  • 7.1 Navigating from the wiki to the repo
  • 7.2 Navigating from the exploration view to the wiki
  • 7.3 Metadata in the exploration service
  • 7.4 Derived resources in the exploration service
  • 7.5 101wiki data for a Haskell-based contribution
  • 7.6 Browsing through 101repo with the exploration service
  • 7.7 SPARQL query extracting all Java files
  • 7.8 SPARQL query on 101wiki
  • 7.9 JSON serialized data about a file
  • 7.10 RDF/XML serialized data about a file
  • 7.11 Python code for clone detection with the exploration service
  • 7.12 Results of perfect clone detection
  • 7.13 LOC for contributions of the same feature set
  • 7.15 Concepts associated with the functional and object oriented paradigms
  • 7.16 Concept analysis in Groovy

1.2. THE PROBLEM 2

1.2 The problem

As storing contributions and data for as many languages, technologies and con- cepts as possible is a necessity for a software chrestomathy, one can easily imag- ine it as a collection of highly heterogeneous code artifacts, documentation and relationships. This is problematic, as 101 is aimed at representing and conveying knowledge about these things in a structured manner. In particular, all code ar- tifacts and documentation should be conveniently explorable, relationships and all available data should be discoverable. The data should also be consumable by humans as well as machines. The thesis tries enrich the 101companies chrestomathy with a Linked Data approach to tackle these problems. By applying the principles as described by Tim Berners-Lee [BL07], the data supposedly becomes more structured and easier to consume, both from a human perspective as well as from the perspective of a machine operating over the data set. A summary of the results presented in this thesis and co-authored by Kevin Klein, Ralf Lämmel, Thomas Schorleiz and Andrei Varanovich, has been submit- ted for publication [KLL+13].

Chapter 2

Related work

2.1 Previous work within the 101companies project

Within the 101companies project, with and without participation of the present author, relevant work has been published. The concept of the 101companies chrestomathy was introduced in [FLSV12]. The idea of linking entities to re- sources like languages and technologies was presented in [FLV12] and extended to the automatic recovery of such links from source artifacts in [FLL+12].

2.2 Related work inside a Linked Data context

Exposing of heterogeneous data through Linked Data is not a new problem. A related approach has been described by [KFH+12] and [KFRC11]. There, Linked Data enabled software repositories are used to expose software artifacts as well as the results of preprocessing and analysis steps on these artifacts in the context of software repository mining. Other research on applying Linked Data principles onto software repositories includes the linking and documentation of data in different repositories through RDF as described by [How08]. The goal is to overcome heterogeneous documen- tation techniques used by the different repositories, such as documenting in wikis or database schemas, in order to improve usability of the data. Another approach is the Linked Data Driven Software Development methodology as described by [IUHT09]. There, the goal is to transform "data from version control systems, bug

Chapter 3

Background

3.1 101companies

As presented in the introduction the 101companies system stores contributions as well as their documentation. Additionally, it also stores data about the con- cepts, languages and technologies used in these contributions. In overview, the 101project is defined by three major systems [FLL+12]:

− 101repo for storing code artifacts.

− 101wiki for storing 101-specific documentation.

− 101worker for creating derived resources containing metadata about 101repo artifacts.

3.1.1 101repo

As the contributions are small, self-running and independent software systems, it makes sense to store them in repositories. 101 uses the 101repo, a GitHub based distributed repository, to store all software artifacts related to the chrestomathy. The confederation is necessary to enable a smooth collaboration process, in which new contributions can be added easily without authors having to check out the complete repository. 101repo does not only store contribution data. As some- times code artifacts are created for highlighting special concepts or "Hello World" programs for languages, it also has to store this data.

3.2. LINKED DATA 6

3.1.2 101wiki

The code artifacts in 101repo are documented in two ways. For one, they are treated as regular source code and should therefore be documented with regard to software engineering best practices. Code comments and Readme files should provide guidance on what the code does and how it does it. However, this doc- umentation does not consider the 101-specific concerns, as it would be to dis- rupting to store this with the code artifacts. A wiki based approach is used for the 101-related documentation, which highlights the interesting code parts and integrates the contribution in the 101companies ecosystem.

3.1.3 101worker

The 101worker system is tasked with deriving knowledge about the code arti- facts in 101repo. This is done in several preprocessing and analysis steps, such as fact extraction, tokenization, metrics computations or metadata association based on a rule set. The worker also tries to compute the links that exist between the code artifacts and the documentation. For example, a file might use a certain lan- guage or technology, introducing a link from that file to the documentation for the language or technology. The worker system is completely based on file I/O, meaning that it will seri- alize results of every analysis step in files. These files then contain the valuable derived knowledge and are referred to as derived resources.

3.2 Linked Data

Linked Data can best be described as a "set of best practices for publishing and connecting structured data on the Web" [BHBL09]. The goal is not only to pro- vide data, but also create typed links between the data of different, possibly very diverse, data sources. The interoperability of these systems is achieved by a machine-readable description language used to encode the informations. Through the adoption of these best practices, a global data space has been cre- ated already containing billions of assertions [BHBL09].

3.2. LINKED DATA 8

a graph. In this example, it is described that a resource with the name of "Joe Hackaton" knows another resource called "Olga Subbotnik".

Figure 3.1: Simple RDF graph example.

These RDF descriptions are represented as triples, consisting of a subject, predicate and object, making basic assertions about a resource. Subjects are al- ways resources and therefore URIs. Objects can either be literal values, like strings and numbers, or they can be links to other resources. Predicates express the type of relationship that exists between subject and object. They are basically URIs pointing to their definition in vocabularies. Figure 3.2 shows the same graph as figure 3.1, just this time in triple form.

ex:Joe_Hackathon ex:has_name "Joe Hackathon" ex:Joe_Hackathon ex:knows ex:Olga_Subbotnik ex:Olga_Subbotnik ex:has_name "Olga Subbotnik" ex:Olga_Subbotnik ex:knows ex:Joe_Hackathon

Figure 3.2: Examples of RDF Triples (URIs replaced by "ex").

Vocabularies can be expressed in RDF schema (RDFS). This is a "declarative, machine-processable language", that "can be used to formally describe an ontol- ogy or metadata schema as a set of classes (resource types) and their properties" [Jac03]. It is also used to specify relations between classes and properties as well as to specify some constraints on these properties. It supports basic inheritance through the "subClassOf" predicate, while "domain" can be used to state the class of the subject. The predicate "range" is used to state the "object" class of a prop- erty. Figure 3.3 shows an example of RDFS statements that fit the previous exam- ples. It is important to notice that RDFS is a relatively simple ontology language, meaning that it can define the right usage of a predicate, but it has no advanced restrictions like cardinality.

3.2. LINKED DATA 9

ex:Person a rdfs:Class

ex:has_name a rdfs:Property ex:has_name rdfs:domain ex:Person ex:has_name rdfs:range rdfs:Literal

Figure 3.3: Examples for RDFS statements.

RDF is just a data model and can be serialized in several ways [HB11]. One of the most common formats, that is also used in this thesis, is RDF/XML [BM03], where the RDF statements are serialized as XML. A central root node exists, while every resource described in this language is a child of this root node. The children of every resource node make up the stated triples for this resource. Figure 3. shows the previously introduced example in such a serialization, while 3.5 shows the RDFS for the example.

Joe Hackathon Olga Subbotnik

Figure 3.4: Example RDF/XML document.

All RDF data and schemas shown in this thesis are displayed in their serial- ized RDF/XML form.

Chapter 4

Linked Data requirements

4.1 Linked Data principles and 101companies

The basic idea of this thesis is to tackle the problems created by the diversity of the data and the heterogeneity of the systems through the Linked Data principles described in section 3.2.1. Applying the principals to the 101companies project implies that entities in 101, which means all wiki pages, all code artifacts, all derived resources and all ontological entities, must be referable through HTTP URIs. Dereferencing those must reveal meaningful data - this includes the actual content of the entity (if applicable) as well as available metadata. It should be possible to display all data in the machine readable RDF data format. Also, ev- erything should be as interlinked as possible meaning that every entity should provide links to all other, for this entity relevant entities, in the chrestomathy.

4.2 Further requirements

As the description in the previous section is rather general and hard to validate, more specialized requirements, that can be evaluated later, are formulated in the next sections.

4.2.1 Navigate from wiki to repo

As everything has to be interlinked, it must be possible to navigate from the wiki to the repository to get from the documentation to the actual source code. This is

4.2. FURTHER REQUIREMENTS 12

complicated by the fact that the repository is distributed, meaning that the distri- bution mechanism must be exploited to actually create the links to the physical repositories.

4.2.2 Navigate from repo to wiki

The navigation should be bidirectional, meaning that it should also be possible to navigate from the files and folders of the repository to associated wiki pages. As the physical repositories are not under control of 101companies, a view that can act as an replacement for direct repository access shall be used.

4.2.3 Referencing source code on wiki pages

As documentation on the wiki has to highlight source code from 101repo, it will refer to certain parts of the code. The unambiguous identifier through URIs, as required by Linked Data principles, shall be exploited so that wiki pages can reference source code parts. The wiki can then dereference the URI and directly display the source code on the wiki page.

4.2.4 Displaying of metadata for 101repo entities

As the Linked Data principles require meaningful data when dereferencing an URI, the resources in 101repo shall be extended with metadata derived through 101worker. A example for this meaningful metadata is the language a file uses.

4.2.5 Associate derived resources with primary resources

The derived resources that 101worker creates can be difficult to use, as it is hard to discover what data exists and where it can be found. Therefore, in accordance with the Linked Data principle of providing links to all other relevant resources, derived resources should be linked to their source code artifacts. Additionally, it should be linked how the metadata was derived. Thus:

− For every source code artifact, link to the derived metadata. − For the derived metadata, link to the source code artifact which it was taken from.