Methodology of OpenDataMonitor
Contents
- Terms and definition
- Methodology
- Quality metrics
- Quantity metrics
- Data download
- Quality assurance
- Reporting issues
- Harvest my catalogue
- Known issues
- Contact details
Terms and definition
- Licence: A licence specifies what can and cannot be done with a piece of work. It grants permissions and states restrictions. An open licence is one which grants permission to access, reuse and redistribute a work with few or no restrictions. An open licence is the most fundamental requirement for open data.
- Format: A format is a pre-established layout for data. Data can be stored in structured or unstructured format. Data is machine readable if it is in a format that can be easily read and parsed by a computer. The list of machine readable formats we have defined for OpenDataMonitor can be accessed in the ODM github.
- Distribution: A distribution of a certain data set represents a specific available form of that data set. Each data set might be available in different forms, and these forms might represent different formats of the data set or different endpoints. For example, distributions could include a downloadable CSV file, an API or RSS feed.
- Metadata: Metadata is data that describes and gives information about other data. It makes the data easy to be retrieved, use and managed. Metadata of open data should at least contain these information: licence, author, organisation, date released and date updated.
- Official catalogue: A catalogue or data portal which is administered and/or authorised by a government body (territorial jurisdiction or agency).
- Unique publishers: These are data publishing organisations that are separate entities.
- Software platform: Software or platforms that are used to deploy open data portals and catalogues. There are both open source and commercial platforms to streamline data publishing, sharing, management, analysis and reusing.
- Category: Policy field used to categorise the dataset. Categories are represented as metadata and include policy fields like Agriculture, Economy, Environment, Transportation and more.
- Overall quality score: This term is created by OpenDataMonitor with the purpose of ranking and sorting countries and catalogues. And it 's defined as the average of the most relevant metrics such as open licenses, machine readable, availability and metadata completeness.
- Harvested catalogues: Data catalogues identified by the OpenDataMonitor in order to provide to the end users a high level of aggregation and analysis. After harvesting all the datasets, a thorough harmonization process is executed in order to clean the data before further analytics. The harvested catalogue can be found here.
- Pending catalogues: Data catalogues identified by the OpenDataMonitor and waiting in the queue to be harvested by the technical team. The pending catalogues are also attributed with the ability of harvesting. The pending to be harvested catalogues can be found here.
- Unable to harvest catalogues: In the process of harvesting data catalogues identified, there were some specific ones which were not able to be harvested for various reasons. These reasons are been documented by the OpenDataMonitor and presented at the Advanced Search interface. Find the list of "Unable to harvest" catalogues here.
- Other sources: Open data identified during the collection process of the project and assigned as other sources. These sources include potential open data in various categories with not necessary the ability to be harvested.
- Alert: An open data alert is the case when a dataset does not contain the following metadata: license, author, organisation, date released and date updated.
Methodology
Harvesting: This is the process of collecting metadata from external portals and catalogues. The process can be scheduled to run either periodically or on demand. In order to address the different data platforms and APIs that exist, different harvesters need to be implemented. For OpenDataMonitor we have three harvesters: a CKAN harvester which is responsible for extracting metadata from CKAN-based catalogues; a HTML harvester which extracts metadata from the HTML pages where the registered catalogue’s metadata is described; and a Socrata harvester which collect metadata from catalogues built upon Socrata platform. If you would like to register your catalogue or portal please email harvest@opendatamonitor.euHarmonisation: This handles the heterogeneity of collected metadata. It transforms and maps different metadata sets to an internal common schema which makes it possible to interpret, aggregate and use the metadata in a consistent and meaningful way for analysis. The heterogeneity can involve metadata with different schemas, different attribute descriptions or different value representations, (for example, abbreviations used). Mappings related to data catalogues’ attributes names and values that were identified and used for harmonization can be found here.
APIs (public APIs): The application programming interface (API) is a set of routines, standards, and tools that makes possible for other software or components to communicate with the metadata database of the OpenDataMonitor platform. It provides access to the metadata records themselves or computes various pre-defined aggregated metrics.
Quality metrics
- Open licence
This metric represents total count of open licences over total count of distributions with a licence. You can access our published open licence mappings here. - Machine readable
This metric represents the readability of one dataset while calculating the machine readability of its distributions formats. The distribution formats include file types such as CSV, XLS, JSON, XML and RDF. You can access the full published list of machine readable formats here. - Open metadata
This metric represents the average of missing metadata across a defined set of fields: licence, author, organisation and the existence of one of the date released or date updated. - Open access
This metric represents the number of datasets qualified as publicly available over the total number of dataset in a catalogue. The availability score is calculated across a defined set of fields: a description, at least one resource with a functional link and an available email of the author. - Discoverability
This metric represents an estimation of how important a catalogue is in the web based on two traffic ranking systems: Google and Alexa. - Open formats
This compliments the machine-readable metric. It represents the total count of distributions with a non-proprietary format over total count of distributions with a format. You can access our published list of non-proprietary formats here. - Overall quality score
The overall quality score is calculated as the average of open licenses, machine readable, open access and open metadata. This score involves the most important metrics defined based on the current open data standards, and is used to rank and sort catalogues and countries respectively.
Quantity metrics
- Total distribution size
This metric represents the total size of all resources, regardless of their format, for every dataset in a catalogue. The size is in KBytes. - Number of distributions
This metric represents the total number of distributions of a datasets or the total number of distributions of all datasets in a specific catalogue. A distribution of a certain data set refers to a specific available form of that data set. - Number of datasets
This metric represents the total number of available datasets in a catalogue. - Number of unique publishers
This metric represents the total number of unique publishing organisations of a specific catalogue. - Catalogues harvested
Number of catalogues per country harvested and harmonized.
Data download
The latest complete copy of all the data from OpenDataMonitor can be downloaded in various formats here. The user is able to interact with the data in order to be able to download the preferred selection.Quality assurance
There are numerous quality assurance procedures undertaken in publishing this data on the OpenDataMonitor platform. Such methods include: duplication reduction, community feedback and expert eye assurance, quality checks and mapping.Reporting issues
If you spot a problem with the data, please report this to odm.issues@synyo.comHarvest my catalogue
If you would like us to harvest your catalogue on OpenDataMonitor, please contact us at: harvest@opendatamonitor.euKnown issues
- The project is aware of the new growing market in open data and therefore will put a maximum effort in harvesting as many catalogues considering the allocated resources.
- The results from the analysis and visualizations are a presentation of the data harvested and revisited weekly. Some of the data harvested are not at the best quality of information.
- The time snapshot started to capture open data starting from July 2015. Before this period the platform does not provide accurate data on the status of various metrics defined by this monitor.
- The project is aware of some inconsistencies during the first months of time capturing procedures. One or two metrics were further developed during the implementation of the project what might have caused as well radical changes in the data. From September 2015 and in continuance the visual content has not changed the core business rules.