diff --git a/_episodes/00-Introduction-to-Module.md b/_episodes/00-Introduction-to-Module.md index 1178980..889801a 100644 --- a/_episodes/00-Introduction-to-Module.md +++ b/_episodes/00-Introduction-to-Module.md @@ -26,7 +26,7 @@ A ReproNim module is a set of steps or "lessons", in which we have gathered mate ### Who is this module for? -The module is for you, if you are a biomedical researcher or an informatics researcher or student, is working with neuroimaging (or not) and you want to know about reproducibility and data. To ensure data supports reproducible research, the FAIR principles were issued through [FORCE11](http://force11.org): the Future of Research Communications and e-Scholarship. The FAIR principles put forth characteristics that contemporary data resources, tools, vocabularies and infrastructures should exhibit to ensure understandability, assist discovery and reuse by third-parties. [Wilkinson et al.,2016](https://www.nature.com/articles/sdata201618). FAIR stands for: Findable, Accessible, Interoperable and Re-usable. +The module is for you, if you are a biomedical researcher or an informatics researcher or student, who is working with neuroimaging (or not) and you want to know about reproducibility and data. To ensure data supports reproducible research, the FAIR principles were issued through [FORCE11](http://force11.org): the Future of Research Communications and e-Scholarship. The FAIR principles put forth characteristics that contemporary data resources, tools, vocabularies and infrastructures should exhibit to ensure understandability, assist discovery and reuse by third-parties. [Wilkinson et al.,2016](https://www.nature.com/articles/sdata201618). FAIR stands for: Findable, Accessible, Interoperable and Re-usable. ### What will you learn @@ -45,7 +45,7 @@ That really depends on your familiarity with concepts covered in the episodes an 5. [Lesson 5.]({{site.root}}/06-Semantic-Data-Representations) How to setup a testing framework to revalidate analyses as data and software change -### Do I need to code ? What language ? +### Do I need to code? What language? You can learn a lot without coding, however, some of the lessons and exercises will require some coding. So, yes, you should code. We have (mostly) adopted python for the language, it may not be your first choice but we think some knowledge of python coding will help you anyways. We will try to help as much as possible by providing tutorials, examples, and links to installation instructions. @@ -53,4 +53,4 @@ You can learn a lot without coding, however, some of the lessons and exercises w The ReproNim training events can only accommodate a limited number of participants. Nevertheless, we are committed to openness and we are committed to providing our -materials in an open format, with liberal licenses, and through a publicly accessible website. You can also contribute to this training module - just [fork us on Gihub](https://github.com/ReproNim/module-FAIR-data)! +materials in an open format, with liberal licenses, and through a publicly accessible website. You can also contribute to this training module - just [fork us on GitHub](https://github.com/ReproNim/module-FAIR-data)! diff --git a/_episodes/01-Web-of-Data.md b/_episodes/01-Web-of-Data.md index 2ea730f..3aa8ea6 100644 --- a/_episodes/01-Web-of-Data.md +++ b/_episodes/01-Web-of-Data.md @@ -47,16 +47,18 @@ This lesson provides an overview of strategies for making research outputs avail The advent of computers and networks have allowed scientists to share not only the final report of the work, i.e., the scientific article, but also additional research products that form an integral part of the work, e.g., data, code, workflows and even works like slide presentations. The term "research object" in a general sense connotes these many outputs of scientific research. The term ["research object"](https://en.wikipedia.org/wiki/Research_Object) is also used specifically to refer to a method for the identification, aggregation and exchange of scholarly information on the Web. The basic idea is that each of these objects should have its own persistent identifier (see below) and objects that belong together, e.g., an article with its associated code and data, should have some means of being aggregated, so that all associated research objects can be discovered together. Although this might seem to be obvious, as research objects are scattered across different repositories on the web, the connections between them are often lost. -One of the most important concepts to understand and implement when developing information systems on the web for robust and reproducible science is that of globally unique and persistent identifiers [(PIDs)](https://en.wikipedia.org/wiki/Persistent_identifier). You will see references to PIDs throughout this tutorial. Globally unique and peristent identifiers are exactly what they sound like: a unique and long-lasting reference to something, e.g., documents, files, books, people and even the concepts that define a field (ontologies). Globally unique means that the identifier points to only a single entity on a global scale. Many of you may be familiar with locally unique identifiers, e.g., database accession numbers, that are unique within a single source. But when utilizing the web to retrieve information, we have to consider the entire web.For example, a web URL is a globally unique identifier because each URL is different. If they weren't, then there would be no sure way to locate documents on the web. +One of the most important concepts to understand and implement when developing information systems on the web for robust and reproducible science is that of globally unique and persistent identifiers [(PIDs)](https://en.wikipedia.org/wiki/Persistent_identifier). You will see references to PIDs throughout this tutorial. Globally unique and persistent identifiers are exactly what they sound like: a unique and long-lasting reference to something, e.g., documents, files, books, people and even the concepts that define a field (ontologies). Globally unique means that the identifier points to only a single entity on a global scale. Many of you may be familiar with locally unique identifiers, e.g., database accession numbers, that are unique within a single source. But when utilizing the web to retrieve information, we have to consider the entire web. For example, a web URL is a globally unique identifier because each URL is different. If they weren't, then there would be no sure way to locate documents on the web. -Identifiers for non-digital objects: Persistent identifiers can still be assigned to things that are not natively digital, e.g., people, reagents, concepts. In this case, some authority issues identifiers for these entities. The PID can be thought of as "digital middle names" for things we refer to. If they are unique-that is, each identifier identifies only one thing-and they are designed to be used by computers, then PIDs can be used to locate references to these particular entities reliably and disambiguate that one entity from others that may have similar names. For example, the concept "nucleus" can point to the nucleus of a cell, of an atom or of the brain. But these three concepts each have unique identifiers in biological terminologies such as the Gene Ontology or UMLS (Unified Medical Language System). When these identifiers are used inside of databases or research articles, a machine can easily tell them apart. +#### Identifiers for non-digital objects -Although a URL can be thought of as a form of PID, PIDs are actually distinct from locators like URL's in that they retrieve an object independent of where the object is located on the web. We are all familiar with broken links on the web. A broken link happens when a web browser sends out a request to a server for a document with a specified URL. If that URL no longer exists, because, for example, the system administrator migrated systems and created a new URL or if someone decided no longer to pay the fee to keep up a particularly domain name on the internet, we get a 404: Document not found error. In scholarship, the [404 error](http://www.cs.umd.edu/~golbeck/LBSC690/SemanticWeb.html) is particularly pernicious as it breaks the chain of evidence that is provided within a scientific paper. If we refer to a web page that no longer exists, we have no way to verify or benefit from the information in that web page. Broken links can be avoided by making sure that re-direct pages are available for old URLs, but how do you know that the new URL actually points to the same document? Also, what happens when multiple copies of the same article exist in different places? Each of those places would have its own URL, making it look to a computer as if the objects they reference are different. +Persistent identifiers can still be assigned to things that are not natively digital, e.g., people, reagents, concepts. In this case, some authority issues identifiers for these entities. The PID can be thought of as "digital middle names" for things we refer to. If they are unique -- that is, each identifier identifies only one thing -- and they are designed to be used by computers, then PIDs can be used to locate references to these particular entities reliably and disambiguate that one entity from others that may have similar names. For example, the concept "nucleus" can point to the nucleus of a cell, of an atom or of the brain. But these three concepts each have unique identifiers in biological terminologies such as the Gene Ontology or UMLS (Unified Medical Language System). When these identifiers are used inside of databases or research articles, a machine can easily tell them apart. -PIDs actually address both of these problems. The identifier is separated from the location so that it buffers against changes in location. If a web document is moved to a new location, the new location is registered with the resolving system so that it now points to the same object in a new place. If the same object is present in multiple locations, e.g., a scientific article can be found at the publisher's web site, in Pub Med Central and in platforms such as Mendeley,the DOI listed as part of the metadata is the same, so we know that it is the same article in different places. It is important to note that these identifier systems, as will discussed below in the FAIR principles, are not magic. Rather they are a social contract between the publisher of research objects and users that they will maintain the integrity of the resolution services. +Although a URL can be thought of as a form of PID, PIDs are actually distinct from locators like URLs in that they retrieve an object independent of where the object is located on the web. We are all familiar with broken links on the web. A broken link happens when a web browser sends out a request to a server for a document with a specified URL. If that URL no longer exists, because, for example, the system administrator migrated systems and created a new URL or if someone decided no longer to pay the fee to keep up a particularly domain name on the internet, we get a 404: Document not found error. In scholarship, the [404 error](http://www.cs.umd.edu/~golbeck/LBSC690/SemanticWeb.html) is particularly pernicious as it breaks the chain of evidence that is provided within a scientific paper. If we refer to a web page that no longer exists, we have no way to verify or benefit from the information in that web page. Broken links can be avoided by making sure that re-direct pages are available for old URLs, but how do you know that the new URL actually points to the same document? Also, what happens when multiple copies of the same article exist in different places? Each of those places would have its own URL, making it look to a computer as if the objects they reference are different. + +PIDs actually address both of these problems. The identifier is separated from the location so that it buffers against changes in location. If a web document is moved to a new location, the new location is registered with the resolving system so that it now points to the same object in a new place. If the same object is present in multiple locations, e.g., a scientific article can be found at the publisher's web site, in Pub Med Central and in platforms such as Mendeley, the DOI listed as part of the metadata is the same, so we know that it is the same article in different places. It is important to note that these identifier systems, as will discussed below in the FAIR principles, are not magic. Rather they are a social contract between the publisher of research objects and users that they will maintain the integrity of the resolution services. #### Persistent Identifiers in Neuroimaging -The application of persistent identifiers in neuroimaging has been disdcussed in the context of data citation: "Data Citation in Neuroimaging: Proposed Best Practices for Data Identification and Attribution" (Honor, Haselgrove, Frazier, Kennedy 2016; https://doi.org/10.3389/fninf.2016.00034). This paper presents a prototype system to integrate Digital Object Identifiers (DOI) and a standardized metadata schema into a XNAT-based repository , allowing for identification of data at both the project and image level. This identification scheme allows any newly defined combination of images, aggregated from any number of projects, to be tagged with a new group-level DOI that automatically inherits the individual attributes and provenance information of its constituent parts. This allows for the consistent identification of data used as part of an analysis - a key aspect in ensuring analyses are reproducible. +The application of persistent identifiers in neuroimaging has been discussed in the context of data citation: "Data Citation in Neuroimaging: Proposed Best Practices for Data Identification and Attribution" (Honor, Haselgrove, Frazier, Kennedy 2016; https://doi.org/10.3389/fninf.2016.00034). This paper presents a prototype system to integrate Digital Object Identifiers (DOIs) and a standardized metadata schema into a XNAT-based repository , allowing for identification of data at both the project and image level. This identification scheme allows any newly defined combination of images, aggregated from any number of projects, to be tagged with a new group-level DOI that automatically inherits the individual attributes and provenance information of its constituent parts. This allows for the consistent identification of data used as part of an analysis -- a key aspect in ensuring analyses are reproducible. > ## Selected External Lesson Material > 1. An overview of persistent identifiers from the Australian National Data Service: [Unpacking Persistent Identifiers for Research](https://www.slideshare.net/AustralianNationalDataService/unpacking-persistent-identifiers-for-research) @@ -66,15 +68,15 @@ The application of persistent identifiers in neuroimaging has been disdcussed in > ## Exercise: Identifiers (click on the arrow to the right to open) > **What is the difference between a globally unique and locally unique identifier?** > -> Consider the Pub Med database. Pub Med assigns a unique idenfier, the PMID, to each article, e.g., PMID:26978244. If you type in the identifier, [26978244](https://www.ncbi.nlm.nih.gov/pubmed/?term=26978244), into the Pub Med search box, you will get exactly one article, in this case on the FAIR data principles. But now type that number into Google search. You will see the [the article about the FAIR data principles](https://www.ncbi.nlm.nih.gov/pubmed/26978244) but also lots of other things identifed by this number, e.g., [an image of a soccer player](https://www.dreamstime.com/stock-images-sk-rapid-vs-austria-wien-image26978244), [a house for sale](http://www.rightmove.co.uk/property-for-sale/property-26978244.html). In other words, divorced from a particular database, the identifer 26978244 is meaningless. +> Consider the Pub Med database. Pub Med assigns a unique idenfier, the PMID, to each article, e.g., `PMID:26978244`. If you type in the identifier, [26978244](https://www.ncbi.nlm.nih.gov/pubmed/?term=26978244), into the Pub Med search box, you will get exactly one article, in this case on the FAIR data principles. But now type that number into Google search. You will see the [the article about the FAIR data principles](https://www.ncbi.nlm.nih.gov/pubmed/26978244) but also lots of other things identified by this number, e.g., [an image of a soccer player](https://www.dreamstime.com/stock-images-sk-rapid-vs-austria-wien-image26978244), [a house for sale](http://www.rightmove.co.uk/property-for-sale/property-26978244.html). In other words, divorced from a particular database, the identifer `26978244` is meaningless. > -> In contrast, when you type in a globally unique identifer, e.g., a DOI, it should identify one and only one object on the web, in this case the article about the FAIR data principles. To see the difference, notice the list of search results when you type in the DOI for this article: 10.1038/sdata.2016.18. -> As we will discuss in a later session, it is possible to turn a locally unique ID into a globally unique ID by adding additional features, e.g., namespaces before the ID, e.g., pubmed/. +> In contrast, when you type in a globally unique identifer, e.g., a DOI, it should identify one and only one object on the web, in this case the article about the FAIR data principles. To see the difference, notice the list of search results when you type in the DOI for this article: `10.1038/sdata.2016.18`. +> As we will discuss in a later session, it is possible to turn a locally unique ID into a globally unique ID by adding additional features, e.g., namespaces before the ID, e.g., `pubmed/`. > > > **What is the difference between searching for an PID and resolving an PID?** > -> There is a difference between identifying an object and accessing it. In the above exercise, when you type the DOI or PMID into a web search, it performs a search for that number just like it would for any search string. Note that the list of results includes URl's that will take you to the article itself, e.g., https://www.nature.com/articles/sdata201618, or that contain references to the article, e.g., http://www.ontoforce.com/why-data-should-be-fair/. A resolving service is more specific; it is designed to resolve to the object that is being referenced. To see how a resolver works, copy http://dx.doi.org/ into your browser followed by the DOI for the FAIR principles article 10.1038/sdata.2016.18. Note that it takes you to the web URL https://www.nature.com/articles/sdata201618 where the article is found. This exercise also illustrates the difference between a PID and URL. If the URL for the article changes, Nature is obliged to notify the DOI maintaniner of the new location. +> There is a difference between identifying an object and accessing it. In the above exercise, when you type the DOI or PMID into a web search, it performs a search for that number just like it would for any search string. Note that the list of results includes URLs that will take you to the article itself, e.g., https://www.nature.com/articles/sdata201618, or that contain references to the article, e.g., http://www.ontoforce.com/why-data-should-be-fair/. A resolving service is more specific; it is designed to resolve to the object that is being referenced. To see how a resolver works, copy http://dx.doi.org/ into your browser followed by the DOI for the FAIR principles article 10.1038/sdata.2016.18. Note that it takes you to the web URL https://www.nature.com/articles/sdata201618 where the article is found. This exercise also illustrates the difference between a PID and URL. If the URL for the article changes, Nature is obliged to notify the DOI maintainer of the new location. > {: .challenge} @@ -91,9 +93,9 @@ The application of persistent identifiers in neuroimaging has been disdcussed in ### Short history of open & linked data technologies -The concepts introduced in the above section: research objects and PIDs, are important for understanding how we can adapt the web for sharing and integrating data. We are all familiar with web-accessible databases, that is, databases that are available for searching through the web. PubMed is one well known example. But just because a database is web accessible, doesn't mean that its data are designed for use on the web. In fact, data contained in dynamic databases are often considered to be part of the [hidden web](https://en.wikipedia.org/wiki/Deep_web), that is, it is not easily indexed in search engines like Google. The [Neuroscience Information Framework](http://neuinfo.org) was designed to search inside of these databases. But there are ways to design databases so that they are more web friendly. A major proponent of using the internet for sharing data and not simply web document is [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee). +The concepts introduced in the above section: research objects and PIDs, are important for understanding how we can adapt the web for sharing and integrating data. We are all familiar with web-accessible databases, that is, databases that are available for searching through the web. PubMed is one well known example. But just because a database is web-accessible, doesn't mean that its data are designed for use on the web. In fact, data contained in dynamic databases are often considered to be part of the [hidden web](https://en.wikipedia.org/wiki/Deep_web), that is, it is not easily indexed in search engines like Google. The [Neuroscience Information Framework](http://neuinfo.org) was designed to search inside of these databases. But there are ways to design databases so that they are more web friendly. A major proponent of using the internet for sharing data and not simply web documents is [Tim Berners-Lee](https://en.wikipedia.org/wiki/Tim_Berners-Lee). -Linked data is a method of publishing structured data on the web so that it can be interlinked and queried using a formal semantic query language (https://en.wikipedia.org/wiki/Linked_data). Linked Open Data is linked data that is covered under on open source license so that it can be reused by third parties. Linked data builds upon standard Web technologies, e.g., HTTP, RDF (Resource Description Framework) and URIs (Uniform resource identifiers). But rather than using these technologies to share web pages, linked data extends them to share information in a way that can be read automatically by computers. The goal is to allow data from different sources to be connected and queried through a web browser. +Linked data is a method of publishing structured data on the web so that it can be interlinked and queried using a formal semantic query language (https://en.wikipedia.org/wiki/Linked_data). Linked Open Data is linked data that is covered under an open source license so that it can be reused by third parties. Linked data builds upon standard Web technologies, e.g., HTTP, RDF (Resource Description Framework) and URIs (Uniform Resource Identifiers). But rather than using these technologies to share web pages, linked data extends them to share information in a way that can be read automatically by computers. The goal is to allow data from different sources to be connected and queried through a web browser. Tim Berners-Lee coined the term [Linked Data in 2006](https://www.w3.org/DesignIssues/LinkedData.html) and laid out 4 basic principles: @@ -104,18 +106,18 @@ Tim Berners-Lee coined the term [Linked Data in 2006](https://www.w3.org/DesignI We've introduced a few new terms here, so let's give a few explanations. -[URI](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier): A URI is a PID, i.e, a string of characters used to identify an entity. URI's have a specific syntax (that's what makes them uniform), which we will not go into here (see [Wikipedia](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)). One of the biggest points of confusion in the world of PIDs is the difference between a URI and a URL. In fact, the terms URI and URL are often used interchangeably, but they are distinct. The easiest way to think about it is that a URL is a URI that happens to point to a resource over a network ([Wikipedia](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)). In other words, all URLs are URIs but not all URIs are URLs. In our above examples, we could turn a local identifier into a globally unique identifier by adding a namespace before the ID, pubmed/26978244 If we add a network access protocol to that identifier, https://www.ncbi.nlm.nih.gov/pubmed/26978244, we have a URI that is also a URL. Because we are dealing here with data on the web, most of our URI examples will, in fact, be URL's in accordance with rule 2 of Linked Data above. +[URI](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier): A URI is a PID, i.e., a string of characters used to identify an entity. URIs have a specific syntax (that's what makes them uniform), which we will not go into here (see [Wikipedia](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)). One of the biggest points of confusion in the world of PIDs is the difference between a URI and a URL. In fact, the terms URI and URL are often used interchangeably, but they are distinct. The easiest way to think about it is that a URL is a URI that happens to point to a resource over a network ([Wikipedia](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)). In other words, all URLs are URIs but not all URIs are URLs. In our above examples, we could turn a local identifier into a globally unique identifier by adding a namespace before the ID, `pubmed/26978244`. If we add a network access protocol to that identifier, https://www.ncbi.nlm.nih.gov/pubmed/26978244, we have a URI that is also a URL. Because we are dealing here with data on the web, most of our URI examples will, in fact, be URLs in accordance with rule 2 of Linked Data above. -[RDF](https://www.w3.org/RDF/): According to the W3C,the standards body for the web, RDF is "a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. +[RDF](https://www.w3.org/RDF/): According to the W3C, the standards body for the web, RDF is "a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications. -This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations." +This linking structure forms a directed, labeled graph, where the edges represent the named links between two resources, represented by the graph nodes. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations." [SPARQL](https://en.wikipedia.org/wiki/SPARQL) is the query language that is used to query RDF databases. **Examples**: -An RDF triple comprises a subject, predicate and object, each identified by a URI. In a simple example, a statement such as: Kevin Bacon stars in the movie Footloose can be turned into RDF by coding the subject (Kevin Bacon), the predicate (is star of) and object (Movie:Footloose). A SPARQL query can then be formulated to Find all movies where "is star of" = Kevin Bacon. Note that this example is highly simplified: please refer to [Wikipedia](https://en.wikipedia.org/wiki/Resource_Description_Framework) and [W3C](https://www.w3.org/RDF/) examples for actual coding examples. +An RDF triple comprises a subject, predicate and object, each identified by a URI. In a simple example, a statement such as: Kevin Bacon stars in the movie Footloose can be turned into RDF by coding the subject (Kevin Bacon), the predicate (is star of) and object (Movie:Footloose). A SPARQL query can then be formulated to` Find all movies where "is star of" = Kevin Bacon`. Note that this example is highly simplified: please refer to [Wikipedia](https://en.wikipedia.org/wiki/Resource_Description_Framework) and [W3C](https://www.w3.org/RDF/) examples for actual coding examples. Although the majority of biomedical data sources have been slow to adopt the [Open Linked Data protocol](http://lod-cloud.net/), a perusal of the Linked Open Data Cloud indicates that life sciences comprises a substantial part of the web of data. @@ -129,7 +131,7 @@ What these benefits mean is that we can use the general architecture of the web With Linked Open Data (LOD), this process becomes much easier. If each of these databases had exposed their data as LOD, it would be possible to construct the query: find all images of cerebellum showing genes known to be implicated in Parkinson's disease, so that it returned the same data in far fewer steps. Why? Because the links between these different data sources would be explicit, so the "mash up" of results from different data sources is facilitated. -Of course, this type of integration is only possible if developers of databases for the life sciences agree to the same set of URI's for the things they refer to, in this case, concepts like "cerebellum" and "disease" and the relationships that connect them. These concepts are typically assigned URI's when they are represented in data structures called ontologies, essentially, formal models of knowledge within a domain. If each database develops its own identifier for Parkinson's Disease, then we are in the same boat as today: we'd have to go to each database, find their data dictionaries and extract the particular identifier. A single set of identifiers for entities in the life sciences is one of the goals of the [OBO Foundry: Open Biological Ontologies](http://www.obofoundry.org/). Where it has been achieved, e.g., the Gene Ontology in genomics, cross query of databases is highly facilitated. But for other biological domains, e.g., neuroscience, consistent use of identifiers for concepts and relationships across databases has been less successful. For this reason, life sciences has also had to maintain a lot of cross-mapping between concepts in ontologies and other controlled vocabularies. +Of course, this type of integration is only possible if developers of databases for the life sciences agree to the same set of URIs for the things they refer to, in this case, concepts like "cerebellum" and "disease" and the relationships that connect them. These concepts are typically assigned URIs when they are represented in data structures called ontologies, essentially, formal models of knowledge within a domain. If each database develops its own identifier for Parkinson's Disease, then we are in the same boat as today: we'd have to go to each database, find their data dictionaries and extract the particular identifier. A single set of identifiers for entities in the life sciences is one of the goals of the [OBO Foundry: Open Biological Ontologies](http://www.obofoundry.org/). Where it has been achieved, e.g., the Gene Ontology in genomics, cross query of databases is highly facilitated. But for other biological domains, e.g., neuroscience, consistent use of identifiers for concepts and relationships across databases has been less successful. For this reason, life sciences has also had to maintain a lot of cross-mapping between concepts in ontologies and other controlled vocabularies. > ## Selected External Lesson Material > 1. Open Data Support's (a project of DG CONNECT of the European Commission) [Linked Open Data Principles, Technologies and Examples](https://www.slideshare.net/OpenDataSupport/linked-open-data-principles-technologies-and-examples) @@ -163,13 +165,13 @@ R1.1 |(meta)data are released with a clear and accessible data usage licens R1.2 |(meta)data are associated with detailed provenance R1.3 |(meta)data meet domain-relevant community standards -A detailed explanation of each of these is included in the [Wilkinson et al.,2016](https://www.nature.com/articles/sdata201618) article, and the Dutch Techcenter for Life Sciences has a set of excellent [tutorials](https://www.dtls.nl/fair-data/fair-principles-explained/), so we won't go into too much detail here. +A detailed explanation of each of these is included in the [Wilkinson et al.,2016](https://www.nature.com/articles/sdata201618) article, and the Dutch Techcentre for Life Sciences has a set of excellent [tutorials](https://www.dtls.nl/fair-data/fair-principles-explained/), so we won't go into too much detail here. ## Findable -Findable comprises two major attributes: the data set be identifed by a PID, so that it can be unambiguously located by a machine and the data be described by sufficient metadata so that it an be discovered by a human. F3 connects the identifier to this rich metadata so that a human can verify that the dataset is corretly resolved. Finally, the data must be registered or indexed so that it can be found. The more machine-readable metadata that a data resource provides for its contents, the more likely a web search through a search engine like Google will find them. +Findable comprises two major attributes: the data set be identified by a PID, so that it can be unambiguously located by a machine and the data be described by sufficient metadata so that it an be discovered by a human. F3 connects the identifier to this rich metadata so that a human can verify that the dataset is correctly resolved. Finally, the data must be registered or indexed so that it can be found. The more machine-readable metadata that a data resource provides for its contents, the more likely a web search through a search engine like Google will find them. ## Accessible -Accessible provides requirements for ensuring that third parties can access the data by specifying that the PID is resolvable and that the access protocol is standard, e.g., http, and open. The access protocol should allow for authentiation and authorization to protect sensitive data. Finally, a critical requirement to ensure the integrity of links across the web is that the metadata persist, even if the data they describe have been removed for some reason. In this way, a basic description of the data can still be found at the same PID, much like descriptive metadata of out of print books can still be found. Ideally, the page describing the metadata includes a statement about when and why the data were removed [(Starr et al., 2015)](https://peerj.com/articles/cs-1/). +Accessible provides requirements for ensuring that third parties can access the data by specifying that the PID is resolvable and that the access protocol is standard, e.g., http, and open. The access protocol should allow for authentication and authorization to protect sensitive data. Finally, a critical requirement to ensure the integrity of links across the web is that the metadata persist, even if the data they describe have been removed for some reason. In this way, a basic description of the data can still be found at the same PID, much like descriptive metadata of out of print books can still be found. Ideally, the page describing the metadata includes a statement about when and why the data were removed [(Starr et al., 2015)](https://peerj.com/articles/cs-1/). ## Interoperable Interoperable outlines requirements for ensuring that data across databases can be combined with minimal effort. Note that this principle specifically calls for the use of formal vocabularies widely used in a community that are themselves FAIR, i.e., have a persistent identifier and access protocol. @@ -177,7 +179,7 @@ Interoperable outlines requirements for ensuring that data across databases can ## Reusable Reusable also emphasizes rich metadata, but in this case so that the dataset can be sufficiently understood by a human so that it can be reused appropriately. Many communities have published minimal information models that provide the minimum set of attributes required for proper interpretation of a given data set, e.g., [MIAME](http://fged.org/projects/miame/). Reusable also specifies that a data licence, preferable in a machine-readable form, accompanies the data specifying how it may be re-used, e.g., a CC-BY license specifies that the data may be re-used without restriction, but attribution of the source must accompany the reuse. And, of course, re-usable emphasizes the use of community-accepted standards for metadata and for data themselves. A major goal of ReproNIM is to help define these standards for neuroimaging. -It is important to reiterate that, unlike LOD described in the previous section, the FAIR principles are not a protocol or a standard. Rather, they are viewed as characteristics modern databases, and, indeed, any researh object, should have if they are to be useful. Along these lines, some of the original authors of the FAIR paper published a follow up paper [Mons et al., 2017](http://content.iospress.com/articles/information-services-and-use/isu824), outlining what the FAIR principles are not. FAIR is not: +It is important to reiterate that, unlike LOD described in the previous section, the FAIR principles are not a protocol or a standard. Rather, they are viewed as characteristics modern databases, and, indeed, any research object, should have if they are to be useful. Along these lines, some of the original authors of the FAIR paper published a follow up paper [Mons et al., 2017](http://content.iospress.com/articles/information-services-and-use/isu824), outlining what the FAIR principles are not. FAIR is not: * A standard * Equivalent to RDF, Linked Data, or the Semantic Web * Equivalent to open @@ -213,11 +215,11 @@ The Linked Data protocol can be fully FAIR, if implemented properly, but as the > 3. "The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments" (Gorgolewski et al 2016) [doi:10.1038/sdata.2016.44](http://n2t.net/doi:10.1038/sdata.2016.44) {: .callout} -> ## Exercise: Create a BIDS Compliant Dataset(click on the arrow to the right to open) +> ## Exercise: Create a BIDS Compliant Dataset (click on the arrow to the right to open) > > In this exercise you will work through the creation of a BIDS dataset using sample data originally obtained from [OpenFMRI](http://openfmri.org). Please follow the following steps: > -> 1. Download the [sample base dataset]({{site.root}}/data/ds000030_single_subj_base.zip) This sample data has been modified from the original OpenFMRI distribution for use in this exercise. +> 1. Download the [sample base dataset]({{site.root}}/data/ds000030_single_subj_base.zip). This sample data has been modified from the original OpenFMRI distribution for use in this exercise. > 2. Working with the BIDS material provided (above) and information from the data publication in Nature Scientific Data [](https://www.nature.com/articles/sdata2016110) and re-work the base dataset to conform to BIDS. > 3. As you work through this exercise - you can use the [BIDS validator](http://incf.github.io/bids-validator/) to check your progress (must use Google Chrome or Firefox). > diff --git a/_episodes/03-Ethics.md b/_episodes/03-Ethics.md index 895986d..58c4b76 100644 --- a/_episodes/03-Ethics.md +++ b/_episodes/03-Ethics.md @@ -19,7 +19,7 @@ This lesson links to externally available information to introduce the student > ## Selected External Lesson Material > ### Online courses: > -> - [edX: Data Science Ethics](https://www.edx.org/course/data-science-ethics-michiganx-ds101x-1#!) +> - [edX: Data Science Ethics](https://www.edx.org/course/data-science-ethics) > > **Abstract**: As patients, we care about the privacy of our medical record; but as patients, we also wish to benefit from the analysis of data in medical records. As citizens, we want a fair trial before being punished for a crime; but as citizens, we want to stop terrorists before they attack us. As decision-makers, we value the advice we get from data-driven algorithms; but as decision-makers, we also worry about unintended bias. Many data scientists learn the tools of the trade and get down to work right away, without appreciating the possible consequences of their work. This course focused on ethics specifically related to data science will provide you with the framework to analyze these concerns. This framework is based on ethics, which are shared values that help differentiate right from wrong. Ethics are not law, but they are usually the basis for laws. Everyone, including data scientists, will benefit from this course. No previous knowledge is needed. > @@ -29,7 +29,7 @@ This lesson links to externally available information to introduce the student > > Floridi, L., & Taddeo, M. (2016). What is data ethics? Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 374(2083), 20160360. http://doi.org/10.1098/rsta.2016.0360 > -> **Abstract**: This theme issue has the founding ambition of landscaping data ethics as a new branch of ethics that studies and evaluates moral problems related to data (including generation, recording, curation, processing, dissemination, sharing and use), algorithms (including artificial intelligence, artificial agents, machine learning and robots) and corresponding practices (including responsible innovation, programming, hacking and professional codes), in order to formulate and support morally good solutions (e.g. right conducts or right values). Data ethics builds on the foundation provided by computer and information ethics but, at the same time, it refines the approach endorsed so far in this research field, by shifting the level of abstraction of ethical enquiries, from being information-centric to being data-centric. This shift brings into focus the different moral dimensions of all kinds of data, even data that never translate directly into information but can be used to support actions or generate behaviours, for example. It highlights the need for ethical analyses to concentrate on the content and nature of computational operations-the interactions among hardware, software and data-rather than on the variety of digital technologies that enable them. And it emphasizes the complexity of the ethical challenges posed by data science. Because of such complexity, data ethics should be developed from the start as a macroethics, that is, as an overall framework that avoids narrow, ad hoc approaches and addresses the ethical impact and implications of data science and its applications within a consistent, holistic and inclusive framework. Only as a macroethics will data ethics provide solutions that can maximize the value of data science for our societies, for all of us and for our environments.This article is part of the themed issue 'The ethical impact of data science'. +> **Abstract**: This theme issue has the founding ambition of landscaping data ethics as a new branch of ethics that studies and evaluates moral problems related to data (including generation, recording, curation, processing, dissemination, sharing and use), algorithms (including artificial intelligence, artificial agents, machine learning and robots) and corresponding practices (including responsible innovation, programming, hacking and professional codes), in order to formulate and support morally good solutions (e.g. right conducts or right values). Data ethics builds on the foundation provided by computer and information ethics but, at the same time, it refines the approach endorsed so far in this research field, by shifting the level of abstraction of ethical enquiries, from being information-centric to being data-centric. This shift brings into focus the different moral dimensions of all kinds of data, even data that never translate directly into information but can be used to support actions or generate behaviours, for example. It highlights the need for ethical analyses to concentrate on the content and nature of computational operations-the interactions among hardware, software and data-rather than on the variety of digital technologies that enable them. And it emphasizes the complexity of the ethical challenges posed by data science. Because of such complexity, data ethics should be developed from the start as a macroethics, that is, as an overall framework that avoids narrow, ad hoc approaches and addresses the ethical impact and implications of data science and its applications within a consistent, holistic and inclusive framework. Only as a macroethics will data ethics provide solutions that can maximize the value of data science for our societies, for all of us and for our environments. This article is part of the themed issue 'The ethical impact of data science'. > > ### Relevant Organizations: > - [Council for Big Data, Ethics, and Society](http://bdes.datasociety.net) diff --git a/_episodes/04-Data-Publishing.md b/_episodes/04-Data-Publishing.md index 9548d0f..20663ec 100644 --- a/_episodes/04-Data-Publishing.md +++ b/_episodes/04-Data-Publishing.md @@ -10,7 +10,7 @@ objectives: - "To learn about resources that can be used to assist in the process of data stewardship and publication" keypoints: - There exist well defined recommendations for publishing data. -- There are prctical guidelines and tools to assist in the publication of data. +- There are practical guidelines and tools to assist in the publication of data. --- @@ -20,15 +20,15 @@ This lesson provides an overview of best practices in data publishing. Note tha #### History of open mandates and guidelines -Prior to computers and the internet, pubishing data routinely was really not possible beyond what we could publish in articles or books. Consequently, a culture grew up around scholarly publishing where data were considered disposable after some specified regulatory period. Rather it was the hypotheses proposed, the experimental design, the analysis and the insights gained from collected data that were valued and preserved through our system of journals and books. +Prior to computers and the internet, publishing data routinely was really not possible beyond what we could publish in articles or books. Consequently, a culture grew up around scholarly publishing where data were considered disposable after some specified regulatory period. Rather it was the hypotheses proposed, the experimental design, the analysis and the insights gained from collected data that were valued and preserved through our system of journals and books. Almost all major funding agencies in the US and abroad are developing policies around open sharing of research data and other research products. These policies have been put into place to promote the integrity of scientific research through greater transparency in light of recent concerns about scientific reproducibility across several fields. These policies are also driven by the promises of new insights to be gained from increased human- and machine-based access to research outputs. -Within Neuroimaging there exist a set of recommendations for best practices in data analysis and sharing. To o advance open science in neuroimaging the Organization for Human Brain Mapping's Committee on Best Practice in Data Analysis and Sharing (COBIDAS) has released a number of recommendations. (doi: https://doi.org/10.1101/054262). These guidelines for various aspects of a study are provided via tabular listings of items that will help plan, execute, report and finally share research in support of reproducible neuroimaging. +Within Neuroimaging there exist a set of recommendations for best practices in data analysis and sharing. To advance open science in neuroimaging the Organization for Human Brain Mapping's Committee on Best Practice in Data Analysis and Sharing (COBIDAS) has released a number of recommendations. (doi: https://doi.org/10.1101/054262). These guidelines for various aspects of a study are provided via tabular listings of items that will help plan, execute, report and finally share research in support of reproducible neuroimaging. #### Data sharing versus data publishing -Towards these ends, we are seeing more and more calls for domains not just to establish a basic culture of human-centric sharing, i.e., “upon request”, but to move towards a more e-Science vision, where researchers conduct their research digitally and where data standards and programmatic interfaces make it easy for machines to access and ingest large amounts of data. To achieve this goal requires that the products of research, including the data, be easily coupled to computational resources that can operate on them at scale, and that the provenance of these research products can be tracked as they transition between uses. It requires the ability to find, access and reuse digital artifacts using computational interfaces (API’s) with minimal restrictions. +Towards these ends, we are seeing more and more calls for domains not just to establish a basic culture of human-centric sharing, i.e., “upon request”, but to move towards a more e-Science vision, where researchers conduct their research digitally and where data standards and programmatic interfaces make it easy for machines to access and ingest large amounts of data. To achieve this goal requires that the products of research, including the data, be easily coupled to computational resources that can operate on them at scale, and that the provenance of these research products can be tracked as they transition between uses. It requires the ability to find, access and reuse digital artifacts using computational interfaces (APIs) with minimal restrictions. In other words, to take advantage of data, time, attention and resources must be devoted to publishing data and code, just as we spend time and energy publishing enduring narrative works to be understood and used by third parties, now and in the future. @@ -42,15 +42,15 @@ We are still in the early days of publishing data, and the FAIR principles have Funding agencies are starting to require that researchers have a data management plan in their grant applications. Towards that end, many research libraries are now creating [practical guidance and resources for effective data management](https://library.ucsd.edu/research-and-collections/data-curation/data-management/best-practices.html) and for creating acceptable data management plans for funding agencies. -**2) Ensure that your data is deposited into a reputable repository when publishing data:** The simplest and most effective mechanism for making data FAR is to deposit data in a qualified data repository (Roche et al., 2014; Gewin, 2016; White et al., 2013). Hosting data on personal websites or even as supplementary material for a published article is not ideal. The first instance is prone to link rot while both generally mean that data will not be further curated. Known impediments to re-usability, e.g, proprietary formats, may not be caught. Putting data in the cloud is also not a panacea, if the FAIR principles are not followed. +**2) Ensure that your data is deposited into a reputable repository when publishing data:** The simplest and most effective mechanism for making data FAR is to deposit data in a qualified data repository (Roche et al., 2014; Gewin, 2016; White et al., 2013). Hosting data on personal websites or even as supplementary material for a published article is not ideal. The first instance is prone to link rot while both generally mean that data will not be further curated. Known impediments to re-usability, e.g., proprietary formats, may not be caught. Putting data in the cloud is also not a panacea, if the FAIR principles are not followed. -Many data repositories have been created for scientific data (see Exercise 1: Finding a data repository for your data). Qualified data repositories ensure long term maintenance of the data, generally enforce enforce community standards, and handle things like obtaining an identifier, maintaining a landing page with appropriate descriptive metadata, and providing programmatic access. Many institutions are beginning to provide data management services +Many data repositories have been created for scientific data (see Exercise 1: Finding a data repository for your data). Qualified data repositories ensure long term maintenance of the data, generally enforce community standards, and handle things like obtaining an identifier, maintaining a landing page with appropriate descriptive metadata, and providing programmatic access. Many institutions are beginning to provide data management services. -Types of repositories: A variety of types of repositories are available, from specialized repositories developed around a specific domain, e.g., NITRC-IR, openfMRI, to general repositories that will take all domains and most data types, e.g.,Figshare, Dryad, OSF, DataVerse, Zenodo (Table 2). Many research institutions are maintaining data repositories for their researchers as well (e.g., University of California DASH). +Types of repositories: A variety of types of repositories are available, from specialized repositories developed around a specific domain, e.g., NITRC-IR, openfMRI, to general repositories that will take all domains and most data types, e.g., Figshare, Dryad, OSF, DataVerse, Zenodo (Table 2). Many research institutions are maintaining data repositories for their researchers as well (e.g., University of California DASH). The advantage of more specific repositories is that they can invest in much more specialized metadata, data models, formats and tools, compared to the more generalist repositories. Because the generalist repositories contain mixtures of different data types across many domains, they have a difficult time harmonizing across different data sets or developing data representations that allow programmatic access to the full data without the need for significant human intervention. -**3) Plan ahead when publishing your data:** The process and costs associated with publishing research articles are well known to researchers, and they ensure that they include adequate resources within their research proposals to ensure that they can prepare and publish articles. Data are a lot more varied in size and complexity than research articles, and thought must be given to how they are going to be published before they are collected. For data of a reasonable size, most repositories are still hosting the data free of charge. Some repositories, e.g., NDAR, require a fee to deposit data, but they also provide cost estimates that can be included within grant proposals.In addition to costs, just as with publishing articles, you need to ensure that all parties who have contributed to the data are credited and agree to publishing the data (see below: Data citation). +**3) Plan ahead when publishing your data:** The process and costs associated with publishing research articles are well known to researchers, and they ensure that they include adequate resources within their research proposals to ensure that they can prepare and publish articles. Data are a lot more varied in size and complexity than research articles, and thought must be given to how they are going to be published before they are collected. For data of a reasonable size, most repositories are still hosting the data free of charge. Some repositories, e.g., NDAR, require a fee to deposit data, but they also provide cost estimates that can be included within grant proposals. In addition to costs, just as with publishing articles, you need to ensure that all parties who have contributed to the data are credited and agree to publishing the data (see below: Data citation). Here are some things to consider: @@ -80,11 +80,11 @@ DataLad builds on top of git-annex and extends it with an intuitive command-line Publishing data should be no different than publishing an article: the creators should be credited and the authors and data formally cited when it is reused. This view is expressed in the [Joint Declaration of Data Citation principles](https://www.force11.org/group/joint-declaration-data-citation-principles-final), issued and endorsed by organizations around the globe. -In recognition of the growing importance of publishing data, publishers are providing specilized journals, e.g., Scientific Data, published by Springer-Nature, or a specialized article type called a data paper, specifically for publishing well curated and described data sets. These papers are published using traditional publishing metadata and article structure, but are not expected to include any analyses or conclusions; rather the paper is devoted to providing rich metadata and a rigorous description of experimental and data collection mechanisms. Scientific Data also implements a standard format for structuring metadata, to ensure that the data is FAIR. These journals usually require that data are deposited in an approved community repository and they provide lists of [recommended repositories](https://www.nature.com/sdata/policies/repositories). +In recognition of the growing importance of publishing data, publishers are providing specialized journals, e.g., Scientific Data, published by Springer-Nature, or a specialized article type called a data paper, specifically for publishing well curated and described data sets. These papers are published using traditional publishing metadata and article structure, but are not expected to include any analyses or conclusions; rather the paper is devoted to providing rich metadata and a rigorous description of experimental and data collection mechanisms. Scientific Data also implements a standard format for structuring metadata, to ensure that the data is FAIR. These journals usually require that data are deposited in an approved community repository and they provide lists of [recommended repositories](https://www.nature.com/sdata/policies/repositories). Publishing a data paper is one way to ensure that data can be cited. But many journals are currently working on implementing more formal systems of data citation, led by community efforts to push for equal status for data in the publishing pipeline (e.g., Joint Declaration of Data Citation Principles; Starr et al., 2015)). Citations to data sets would look like citations to articles, with a standard set of metadata, and would appear in the reference list. With the ability to list published datasets on a scientific CV, cite them within published articles, and search for them via tools like DataMed, data is finally taking its place as a primary product of scholarship. -A formal citation system assigns credit for the re-use of data, but also establishes links to the evidence on which claims are based while providing the means for tracking and therefore measuring impact of data re-use. In our current publishing system, authors adopt a variety of styles for referencing data when they are re-used, from accession numbers, to URL’s, to citing a paper associated with the data set. Some journals set aside a special section on data use which contain lists of data sets and other resources. Unlike references to articles which have a standard format and tools and services to insert them and analyze citations, uncovering and tracking re-use of data typically involves using manual identification, text mining and natural language processing approaches, requiring full text access and considerable time and effort (Read et al., 2015). +A formal citation system assigns credit for the re-use of data, but also establishes links to the evidence on which claims are based while providing the means for tracking and therefore measuring impact of data re-use. In our current publishing system, authors adopt a variety of styles for referencing data when they are re-used, from accession numbers, to URLs, to citing a paper associated with the data set. Some journals set aside a special section on data use which contain lists of data sets and other resources. Unlike references to articles which have a standard format and tools and services to insert them and analyze citations, uncovering and tracking re-use of data typically involves using manual identification, text mining and natural language processing approaches, requiring full text access and considerable time and effort (Read et al., 2015). [Honor et al. (2016)](http://journal.frontiersin.org/article/10.3389/fninf.2016.00034/full) have published a recommendation for citing neuroimaging data sets: @@ -98,6 +98,5 @@ A formal citation system assigns credit for the re-use of data, but also establi > - [Honor, L. B., Haselgrove, C., Frazier, J. A., & Kennedy, D. N. (2016). Data Citation in Neuroimaging: Proposed Best Practices for Data Identification and Attribution. Frontiers in Neuroinformatics, 10, 34. doi:10.3389/fninf.2016.00034](http://journal.frontiersin.org/article/10.3389/fninf.2016.00034/full) > > - [White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR. (2013) Nine simple ways to make it easier to (re)use your data. PeerJ PrePrints 1:e7v2 https://doi.org/10.7287/peerj.preprints.7v2](https://doi.org/10.7287/peerj.preprints.7v2) -> - [Description](http://URL) > {: .callout} diff --git a/_episodes/05-Your-Laboratory-Datastore.md b/_episodes/05-Your-Laboratory-Datastore.md index 3e58aad..fe5b199 100644 --- a/_episodes/05-Your-Laboratory-Datastore.md +++ b/_episodes/05-Your-Laboratory-Datastore.md @@ -9,7 +9,7 @@ objectives: - "Learn about different databasing options if a custom solution is desired" keypoints: - There are a number of tools, developed by the research community and also by companies, to assist in stewardship of laboratory data. -- There are a number of options for developiong your own custom database solution. +- There are a number of options for developing your own custom database solution. --- @@ -47,7 +47,7 @@ When managing data in your own laboratory, there are a number of options availab > > **Overview**: Flywheel is a data management platform designed to ease the IT burden of the researcher by creating a collaborative environment for reproducible, computational science. Data can be uploaded directly from devices or can be manually uploaded into Flywheel. Once loaded, users can organize and search through data. > -> **Documentation**: Documentation for FLyWheel can be found on their documentation page (https://docs.flywheel.io) +> **Documentation**: Documentation for FlyWheel can be found on their documentation page (https://docs.flywheel.io) > {: .callout} @@ -86,7 +86,7 @@ While it is recommended that one try and utilize (and potentially contribute to) > > - [MongoDB University](https://university.mongodb.com) > -> **Abstract**: MongoDB provides a colletion of free lessons covering all aspects of the MongoDB platform. +> **Abstract**: MongoDB provides a collection of free lessons covering all aspects of the MongoDB platform. > > ##### Neo4J: > diff --git a/_episodes/06-Semantic-Data-Representations.md b/_episodes/06-Semantic-Data-Representations.md index 7546c63..63217a0 100644 --- a/_episodes/06-Semantic-Data-Representations.md +++ b/_episodes/06-Semantic-Data-Representations.md @@ -12,11 +12,11 @@ keypoints: ### Introduction -We provided a very brief overview of some of the concepts involved in the semantic web and linked data in [Episode 01: Web of Data](https://github.com/ReproNim/module-FAIR-data/blob/gh-pages/_episodes/01-Web-of-Data.md). Please review this module if you need an introduction to concepts such as persistent identifiers, URI's, RDF and triples. +We provided a very brief overview of some of the concepts involved in the semantic web and linked data in [Episode 01: Web of Data](https://github.com/ReproNim/module-FAIR-data/blob/gh-pages/_episodes/01-Web-of-Data.md). Please review this module if you need an introduction to concepts such as persistent identifiers, URIs, RDF and triples. Here we will focus specifically on a set of technologies collectively called the [semantic web](http://www.linkeddatatools.com/semantic-web-basics). The concept of the [semantic web](https://en.wikipedia.org/wiki/Semantic_Web) was introduced by Tim Berners Lee to describe a web of data accessibly by machines through the web. The semantic web is often called web 3.0, "a way of linking data between systems or entities that allows for rich, self-describing interrelations of data available across the globe on the web" [Linked Data Tools](http://www.linkeddatatools.com/semantic-web-basics). -What does that mean exactly? A more detailed and technical explanation is provided through the lessions below. But in plain terms, the semantic web provides a protocol for publishing data on the web that allows machines to access them based on a set of relationships. If fully realized, it allows bits of information about the same entity contained in different different databases, developed and maintained by different people, to be easily and automatically assembled without having to go through any manual mapping. +What does that mean exactly? A more detailed and technical explanation is provided through the lessons below. But in plain terms, the semantic web provides a protocol for publishing data on the web that allows machines to access them based on a set of relationships. If fully realized, it allows bits of information about the same entity contained in different databases, developed and maintained by different people, to be easily and automatically assembled without having to go through any manual mapping. The basis of this ability is the encoding of data as a graph where each data point is connected to other data points through specific relationships. This graph structure contrasts with the usual tabular form that we use to store our data, e.g., in a relational database or spreadsheet. Consider the following table for an fMRI study: @@ -56,7 +56,7 @@ Similarly, database 2 might have encoded statements like: In this data structure, it would be straightforward to compose a query to return any brain part that expresses a gene that encodes a glutamate receptor. Again, each of these concepts and relationships have their own URI (Uniform resource identifier) as a unique and persistent identifier. -Now imagine the case where both databases are building their databases from the same ontology, that is, a set of terms and the relationships among them, e.g., a set of brain regions, genes and tasks and a set of relations that connect them. Both databases therefore use the same URI's to identify elements in their databases. The use of the same set of URI's, even in separate databases, allows someone to "mash up" the results between the two databases, by connecting results through a set of common URI's. So one would be able to compose a query across the two databases for the set of brain regions that both express glutamate receptors and are activated by working memory tasks, without the need for human intervention. Because "hippocampus" has the same URI in both databases, there is no ambiguity in joining results from one database to another. +Now imagine the case where both databases are building their databases from the same ontology, that is, a set of terms and the relationships among them, e.g., a set of brain regions, genes and tasks and a set of relations that connect them. Both databases therefore use the same URIs to identify elements in their databases. The use of the same set of URIs, even in separate databases, allows someone to "mash up" the results between the two databases, by connecting results through a set of common URIs. So one would be able to compose a query across the two databases for the set of brain regions that both express glutamate receptors and are activated by working memory tasks, without the need for human intervention. Because "hippocampus" has the same URI in both databases, there is no ambiguity in joining results from one database to another. > ## Selected External Material > An excellent set of tutorials is available on the semantic web and associated technologies through [Linkeddatatools.com](http://www.linkeddatatools.com/index.php) and so we won't replicate them here. The tutorials cover: