Idra Scraping Guide

Idra supports the federation of a generic custom Open Data Catalogue (the WEB type) that provides the Open Datasets as web pages but doesn't expose any API to retrieve them in a programmatic way. The Website must provide a page with the list of the links to the dataset pages or just only the URL for each dataset (e.g. with a Path or Query parameter in the URL that identifies the dataset).

Idra in this case can navigate, scrape all the dataset pages and map contained metadata in the DCAT-AP format. To do that, it requires to know how to navigate the site, and how to extract datasets metadata from the HTML elements of the page. These information are defined in the so called Sitemap, which can be defined directly on the browser page, through the provided Web Scraper plugin for Chrome (see here for installation instructions).

Installing the Scraper Plugin

This Sitemap can be created through the provided browser plugin, which is a forked version (here) of the WebScraper.io plugin for Google Chrome.

In order to install the plugin, perform the following steps in the Chrome browser:

Clone the repository: git clone https://github.com/OPSILab/web-scraper-chrome-extension.git.
In Chrome, go to chrome://extensions/ and check the box for Developer mode in the top right.
Click the Load unpacked extension button and select the folder cloned previously to install the plugin.
Press F12 and then click on the Web Scraper tab.

Plugin Documentation

A video tutorial of the whole federation process can found here.
Since the main functionalities, such as the sitemap creation are basically the same of the original plugin, you can also see its documentation and video.

Federating a Web Open Data Catalogue

The Federation of such Open Data Web Catalog consists of the following steps: 1) Create the Sitemap through the browser plugin, defining the selectors to extract the metadata fields to be mapped with the DCAT-AP ones. 2) Insert the Sitemap metadata in order to define How to navigate the site. 2) Export the created Sitemap as JSON. 3) Import the JSON Sitemap in the "Add Catalogue" form.

1. Creating the Sitemap

The Sitemap will consist of a set of Selectors, which contain the information (CSS selectors) to extract from the HTML elements of the dataset page the data to be mapped to the DCAT-AP fields. Please see the DCAT-AP v1.1 specification.

Creating Selectors for the DCAT-AP fields

The Sitemap MUST contain at least the mandatory dataset Selectors, reported in the table below. Each selector will have a specific name as reported in the table and will represent a single dataset field to be extracted.

Selector name	Mandatory	Cardinality
title	Yes	1..n
description	Yes	1..1
publisher_name	No	0..1
publisher_mbox	No	0..1
publisher_homepage	No	0..1
publisher_type	No	0..1
publisher_identifier	No	0..1
publisher_uri	No	0..1
contact_fn	No	0..1
contact_email	No	0..1
contact_telephone	No	0..1
contact_url	No	0..1
keywords	No	0..n
accessRights	No	0..1
conformsTo_identifier	No	0..1
conformsTo_title	No	0..1
conformsTo_description	No	0..1
conformsTo_referenceDocumentation	No	0..1
documentation	No	0..1
frequency	No	0..1
hasVersion	No	0..1
isVersionOf	No	0..1
landingPage	No	0..1
language	No	0..1
provenance	No	0..1
releaseDate	No	0..1
updateDate	No	0..1
source	No	0..1
sample	No	0..1
spatialCoverage_geographicalIdentifier	No	0..1
spatialCoverage_geographicalName	No	0..1
spatialCoverage_geometry	No	0..1
temporalCoverage_startDate	No	0..1
temporalCoverage_endDate	No	0..1
type	No	0..1
version	No	0..1
versionNotes	No	0..1
rightsHolder_name	No	0..1
rightsHolder_mbox	No	0..1
rightsHolder_homepage	No	0..1
rightsHolder_type	No	0..1
rightsHolder_uri	No	0..1
rightsHolder_identifier	No	0..1
creator_name	No	0..1
creator_mbox	No	0..1
creator_homepage	No	0..1
creator_type	No	0..1
creator_uri	No	0..1
creator_identifier	No	0..1
subject	No	0..1
theme	No	0..n
distribution_title	No	1..1
distribution_license_uri	No	0..1
distribution_downloadURL	Yes	1..1
distribution_license_versionInfo	No	0..1
distribution_license_name	No	0..1

Note: The Distribution field is intended to be mandatory only if you want to create at least one Distribution, which is optionally.

In order to add new selectors:

Click on the Sitemap menu, then Selectors.
In the new tab, showing the current Sitemap Selectors, click on Add new selectors
As the Figure below depicts, select the HTML element to be extracted, by using the Select button, then click on the page element and finally on Done Selecting.

alt tag

The Idra Scraper supports the scraping of websites having two ways of navigating the dataset pages: Range and Page.

Depending on the navigation ways (Range or Page) and on the parameter types (Query or Path), the Navigation Parameters to be put in the Edit Sitemap metadata section (in the Sitemap menu) will vary.

The dataset Urls have a parameter (e.g. "id") which values varies between a number range:

Query parameter: e.g. www.example.com/datasets?id=0 to www.example.com/datasets?id=50 In this case, put the following Navigation Parameters:
- Nav Param Name : id
- Nav Param Type:QUERY RANGE
- Nav Param Start:0
- Nav Param End:50
- Start URL: www.example.com/datasets

The Idra Scraper will fetch all the dataset pages from www.example.com/datasets?id=0 to www.example.com/datasets?id=50.

Path parameter: e.g. www.example.com/datasets/id/0 to www.example.com/datasets/id/50. In this case, put the following Navigation Parameters:
- Nav Param Name : id
- Nav Param Type:PATH RANGE
- Nav Param Start:0
- Nav Param End:50

The Idra Scraper will fetch all the dataset pages from www.example.com/datasets/id/0 to www.example.com/datasets/id/50.

For each dataset page, it will extract the metadata fields of the single dataset, according to the dataset metadata selectors defined in the Defining dataset metadata selectors section.

Navigation by Pagination

The website has a paginated list of all the datasets. Each element in the list has the link to the single dataset page. E.g. The Data Grand Lyon portal. alt tag

In this case you have to:

1) Define how to extract the dataset links from the list, by creating the specific "datasetLink" selector with type Element Attribute and href as Attribute Name. (see the Figure).

Ensure to select ALL dataset links (after clicking the second link, all the others will automatically highlight). Ensure with Data preview that the link URL is extracted correctly.

2) For the pagination you can either:
- Specify manually the pages number of the datasets list (in the Nav Pages Number field of Edit Sitemap metadata section).
- Specify the dedicated "lastPage" selector in the Sitemap (see the section below), the Scraper will automatically extract the pages number from the specified "last page" element.
3) After defining the "datasetLink" selector and the pages number (by the Nav Pages Number metadata or by defining the lastPage selector), you must select the type and name of the pagination parameter, namely the parameter in the URL that will vary when navigating the list pages:
- Query parameter: E.g. https://data.grandlyon.com/search?P=10. In this case, put the following Navigation Parameters:
- Nav Param Name : P
- Nav Param Type:QUERY PAGE
- Start URL: https://data.grandlyon.com/search
The Idra Scraper will fetch all the list pages from https://data.grandlyon.com/search/?P=0 to https://data.grandlyon.com/search/?P=81.
- Path parameter: E.g. https://example.com/search/P/10 In this case, put the following Navigation Parameters:
- Nav Param Name : P
- Nav Param Type:PATH PAGE
- Start URL: https://example.com/search

The Idra Scraper will fetch all the list pages from https://data.grandlyon.com/search/P/0 to https://data.grandlyon.com/search/P/81.

For each page, it will extract all the dataset links using the "datasetLink" selector. For each dataset link, it will go to the relative page, and then extract the metadata field of the single dataset (according to the dataset selectors defined in the "Defining dataset metadata selectors" section).

Note. The last page value (81), is thus the one retrieved either from the "Nav Pages Number" metadata field, or through the "LastPage" selector (as described in the example below).

Note.. In this case, Nav Param Name represents the parameter used in the urls of the list pages (e.g. https://data.grandlyon.com/search?P=10), unlike the Range navigation case, in which the parameter represents directly the one used in the urls of the datasets (e.g. www.example.com/datasets?id=0).

The LastPage selector

If the list has an HTML element representing the last page of the paginated datasets list (e.g. a "last page" button as in the Figure), you can create the specific selector "lastPage" in the sitemap, by specifying how to extract the pages number value from a specific HTML element containing the number of pages (e.g. a "last page" button).

Example

The last page element is an ">>" arrow that consists of the a HTML link element, as in the Figure below: alt tag

<a href="https://data.grandlyon.com/search/?Q=&P=81#searchResult"></a>

In this case, the selector to be defined will be: a) A selector with "lastPage" name and "Element Attribute" type, since we have to extract the href attribute from the aelement. b) Select the HTML element with the "Select" functionality (don't forget to click on Done Selecting after highlighting the element). c) Fill the "Attribute name" with href. d) Ensure with Data preview that the link URL is extracted correctly. e) Save the selector.

3. Export the Sitemap

Once all the DCAT-AP dataset selectors, Sitemap metadata and (in case of navigation by paginated list) lastPage and datasetLink selectors have been defined, you can export the Sitemap as JSON. Perform the following steps: - Click on Sitemap menu and then on Export Sitemap; - Copy the generated JSON text.

alt tag

4. Import the Sitemap

In the Add Catalogue form of the Idra Catalogues Management section, fill the required fields, in particular: - API Endpoint: must match with the one inserted in the Start URL Sitemap metadata; - Type: select WEB; - Click the Update File button and paste ion the File editor the text copied previously when exporting the Sitemap JSON.

alt tag

Idra Scraping Guide

Installing the Scraper Plugin

Plugin Documentation

Federating a Web Open Data Catalogue

1. Creating the Sitemap

Creating Selectors for the DCAT-AP fields

2. Inserting Sitemap metadata and navigation modes

Navigation by Url Range

The LastPage selector

Example

3. Export the Sitemap

4. Import the Sitemap