The journal Data Science is an interdisciplinary journal that aims to publish novel and effective methods on using scientific data in a principled, well-defined, and reproducible fashion, concrete tools that are based on these methods, and applications thereof. The ultimate goal is to unleash the power of scientific data to deepen our understanding of physical, biological, and digital systems, gain insight into human social and economic behavior, and design new solutions for the future. The rising importance of scientific data, both big and small, brings with it a wealth of challenges to combine structured, but often siloed data with messy, incomplete, and unstructured data from text, audio, visual content such as sensor and weblog data. New methods to extract, transport, pool, refine, store, analyze, and visualize data are needed to unleash their power while simultaneously making tools and workflows easier to use by the public at large. The journal invites contributions ranging from theoretical and foundational research, platforms, methods, applications, and tools in all areas. We welcome papers which add a social, geographical, and temporal dimension to Data Science research, as well as application-oriented papers that prepare and use data in discovery research.
This journal focuses on methods, infrastructure, and applications around the following core topics:
- scientific data mining, machine learning, and Big Data analytics
- data management, network analysis, and scientific knowledge discovery
- scholarly communication and (semantic) publishing
- research data publication, indexing, quality, and discovery
- data wrangling, integration, provenance
- trend analysis, prediction, and visualization
- crowdsourcing and collaboration
- corroboration, validation, trust, and reproducibility
- scalable computing, analysis, and learning
- smart and semantic web services, executable workflows
- analytics, intelligence, and real time decision making
- socio-technical systems
- social impacts of data science
Semantic publishing has been defined as anything that enhances the meaning of a published journal article, facilitates its automated discovery, enables its linking to semantically related articles, provides access to data within the article in actionable form, or facilitates integration of data between papers. Towards the goal of genuine semantic publishing, where a work may be published with its content and metadata represented in a machine-interpretable semantic notation, this journal will work with a global set of partners to develop standardized methods to ensure that our publications can be seen as a machine-accessible store of knowledge.
An important goal of the journal is to promote an environment to produce and share annotated data to the wider research community. The development and use of data and metadata standards are critical for achieving this goal. Authors should ensure that any data used or produced in the study is represented with community-based data formats and metadata standards.
Rapid, Open, Transparent, and Attributed Reviews
Data Science journal relies on an open and transparent review process. Submitted manuscripts are posted on the journal’s website and are publicly available. In addition to solicited reviews selected by members of the editorial board, public reviews and comments are welcome by any researcher and can be uploaded using the journal website. All reviews and responses from the authors are posted on the journal homepage. All involved reviewers and editors will be acknowledged in the final printed version. While we strongly encourage reviewers to participate in the open and transparent review process, it is still possible to submit anonymous reviews. Editors, non-anonymous reviewers will be included in all published articles. The journal will aim to complete reviews within 2-4 weeks of submission.
The journal will provide editor and reviewer profiles and metrics (links to ORCID, Google Scholar, etc.).
Abstract: The open nature of Knowledge Graphs (KG) often implies that they are incomplete. Knowledge graph completion (a.k.a. link prediction) consists in inferring new relationships between the entities of a KG based on existing relationships. Most existing approaches rely on the learning of latent feature vectors for the encoding of entities and relations. In general however, latent features cannot be easily interpreted. Rule-based approaches offer interpretability but a distinct ruleset must be learned for each relation. In both latent- and rule-based approaches, the training phase has to be run again when the KG is updated. We propose a new approach that…does not need a training phase, and that can provide interpretable explanations for each inference. It relies on the computation of Concepts of Nearest Neighbours (C-NN) to identify clusters of similar entities based on common graph patterns. Different rules are then derived from those graph patterns, and combined to predict new relationships. We evaluate our approach on standard benchmarks for link prediction, where it gets competitive performance compared to existing approaches.
Keywords: Knowledge Graph, multi-relational data, knowledge graph completion, link prediction, graph pattern, Concepts of Nearest Neighbours, inference rules, analogical inference, explainable AI
Abstract: Adopting open science principles can be challenging, requiring conceptual education and training in the use of new tools. This paper introduces the Workflow for Open Reproducible Code in Science (WORCS): A step-by-step procedure that researchers can follow to make a research project open and reproducible. This workflow intends to lower the threshold for adoption of open science principles. It is based on established best practices, and can be used either in parallel to, or in absence of, top-down requirements by journals, institutions, and funding bodies. To facilitate widespread adoption, the WORCS principles have been implemented in the R package worcs…, which offers an RStudio project template and utility functions for specific workflow steps. This paper introduces the conceptual workflow, discusses how it meets different standards for open science, and addresses the functionality provided by the R implementation, worcs . This paper is primarily targeted towards scholars conducting research projects in R, conducting research that involves academic prose, analysis code, and tabular data. However, the workflow is flexible enough to accommodate other scenarios, and offers a starting point for customized solutions. The source code for the R package and manuscript, and a list of examplesof WORCS projects , are available at https://github.com/cjvanlissa/worcs .
Keywords: Open science, reproducibility, r, dynamic document generation, version control
Abstract: One of the most popular methods to visualize the overlap and differences between data sets is the Venn diagram. Venn diagrams are especially useful when they are ‘area-proportional’ i.e. the sizes of the circles and the overlaps correspond to the sizes of the data sets. In 2007, the BioVenn web interface was launched, which is being used by many researchers. However, this web implementation requires users to copy and paste (or upload) lists of IDs into the web browser, which is not always convenient and makes it difficult for researchers to create Venn diagrams ‘in batch’, or to automatically update…the diagram when the source data changes. This is only possible by using software such as R or Python. This paper describes the BioVenn R and Python packages, which are very easy-to-use packages that can generate accurate area-proportional Venn diagrams of two or three circles directly from lists of (biological) IDs. The only required input is two or three lists of IDs. Optional parameters include the main title, the subtitle, the printing of absolute numbers or percentages within the diagram, colors and fonts. The function can show the diagram on the screen, or it can export the diagram in one of the supported file formats. The function also returns all thirteen lists. The BioVenn R package and Python package were created for biological IDs, but they can be used for other IDs as well. Finally, BioVenn can map Affymetrix and EntrezGene to Ensembl IDs. The BioVenn R package is available in the CRAN repository, and can be installed by running ‘install.packages(“BioVenn”)’. The BioVenn Python package is available in the PyPI repository, and can be installed by running ‘pip install BioVenn’. The BioVenn web interface remains available at https://www.biovenn.nl .
Keywords: Bioinformatics, visualization, Venn diagram, combinatorics, set theory, genomics, data science, R, Python