You are viewing a javascript disabled version of the site. Please enable Javascript for this site to function properly.

# Changes in academic libraries in the era of Open Science

#### Abstract

In this paper we study the changes in academic library services inspired by the Open Science movement and especially the changes prompted from Open Data as a founding part of Open Science. We argue that academic libraries face the even bigger challenges for accommodating and providing support for Open Big Data composed from existing raw data sets and new massive sets generated from data driven research. Ensuring the veracity of Open Big Data is a complex problem dominated by data science. For academic libraries, that challenge triggers not only the expansion of traditional library services, but also leads to adoption of a set of new roles and responsibilities. That includes, but is not limited to development of the supporting models for Research Data Management, providing Data Management Plan assistance, expanding the qualifications of library personnel toward data science literacy, integration of the library services into research and educational process by taking part in research grants and many others. We outline several approaches taken by some academic libraries and by libraries at the City University of New York (CUNY) to meet necessities imposed by doing research and education with Open Big Data – from changes in libraries’ administrative structure, changes in personnel qualifications and duties, leading the interdisciplinary advisory groups, to active collaboration in principal projects.

## 1.Introduction

Even if is clear that the ultimate goal of Open Science is to maximize the research output by removing barriers and promoting collective science, a single universal definition of Open Science does not exist. Some authors discuss Open Science only as establishment of Open Access Publishing and Open Source and identify the development of Internet and easy cross boundary communications as main driving force (Grand et al., 2012). Others analyze Open Science in a framework of communication theory (Kulczycki, 2016). In a context of the science-society relation, Open Science has been also analyzed as a cultural change, aiming to expand the economic impact of science onto society by removing the barriers in front of knowledge exchange and by changing the ways of creation, dissemination, storage, and delivery of scientific data (David, 2008). Similarly, Seaz and Martinez-Fuentes (2018) recognize Open Science as a “global movement that brings up socio-cultural and technological change, based on openness and connectivity on how research is designed, conducted captured and accessed”. As a starting point in this paper we adopt the quite general Fecher’s and Friesike’s definition of Open Science as an “umbrella term encompassing a multiple assumptions about the future of knowledge creation and dissemination” (Fecher & Friesike, 2014).

Kraker et al. (2011) define the four instruments of Open Science: Open Access (OA), Open Source (OS), Open Data (OD), and Open Methodology (OM), all representing the application of the concept of openness toward each step in a research workflow. In the educational context, particularly in learning and teaching, Open Educational Resources (OER) have joined the movement, too. It is important to mention that despite the fact that Open Science is currently most visible in the area of “hard sciences” (due to large data sets generated by high throughput experiments and simulations) it is not limited to only the STEM fields, but is also applicable to other types of scientific research. For example Open Data project “Brain Research through Advancing Innovative Neuro Technologies” (BRAIN) explores perspective of Open Science in psychology (Hesse, 2018). The Open Access and Open Data both are the most “subject independent” components of Open Science because they appear at any study despite of its topic and scope. Consequently, and as a global trend the academic libraries emphasize on building up of a support for Open Access and OD in addition to traditional information services they provide. In particular, many academic libraries speed up their own development toward offering “non-traditional” data driven and data oriented services in order to support research requirements, scenarios and workflows typical for collaborative and highly communicative open science projects. These new data services necessitate the use of Internet based models of communication and utilization of large set of complex digital technologies. Accordingly, gaining of a set of new qualifications and skills from library staff become mandatory for the 21st century librarians (Affelt, 2015).

## 2.Open Science – revolution or evolution in scientific research

Scientific research can be defined as planned, organized, and systematic collection, interpretation and analysis of data done with the purpose of contributing to global scientific knowledge. In other words, scientific research has intrinsic public nature – as the French physiologist Claude Bernard once said, “Art is I, science is we” (Bernard, 1957). Ensuring open and reproducible research has become a main goal across scientific communities and is supported by political circles and funding organizations (Boulton, 2016). The understanding is that open and reproducible research practices enable scientific re-use, accelerating future projects and discoveries in any discipline (Chen et al., 2019).

However, the current system of dissemination of scientific knowledge does not serve the public nature of science. The subscription based model, professed by journal publishers and their for-profit system of dissemination based on marketing, does not support the research process. The obstacle is purely financial – almost 75% of published scientific articles are behind paywalls therefore accessible to only those who work at institutions able to afford the steep subscriptions (Tennant et al., 2016). Unfortunately subscription to all peer-reviewed journals is not affordable for a single individual, research institute or university, meaning that the potential impact of published research is never fully reached due to financial limitations. Second, additional barriers are posed by the wide-spread disagreement regarding data and curated samples availability and their corresponding metadata, especially in the field sciences (McNutt et al., 2016). Third obstacle is data collection and data itself. Data and metadata practices of researchers often appear incomplete or deficient because data acquisition processes are different for different sciences (Van Tuyl & Whitmire, 2016). For example for laboratory scientists, data are usually computer generated, hence in digital format, therefore they can be automatically uploaded in repositories with little or no human intervention. For field sciences however, (ecologists, archeologists etc.) data collected on a filed are later recovered with large degree of human improvisation before being incorporated into data repositories (Gitleman, 2013). Finally many scientists are unwilling to share their data due to fears of exploitation of data sets, rich enough to produce several publications (Molloy, 2012).

Under subscription based journal model, sharing research data is possible into the scope of particular paper, but that does not necessarily cover the complete data sets for particular research. Indeed many journals offer mechanisms to upload research data (Stuart et al., 2018), but the sets of raw experimental and modeling data, details about non-traditional or unique experimental methodologies, results from failed experiments, results from failed theories and many other research byproducts, are rarely or never published in subscription journals.

Open Science aims to alleviate most of above problems by changing the ways knowledge is both created and disseminated across society. Increased access to research outputs might help foster a culture of greater scientific education and literacy, which in turn could have a direct impact on public policy (European Commission, 2012; Zuccala, 2010), particularly in domains such as climate change and global health, as well as increasing public engagement in scientific research. It is important to emphasize that openness and sharing the data will affect not only knowledge creation and dissemination, but will also increase effectiveness of education and data and knowledge processing. As Stodden et al. (2016) pointed out the “access to the computational steps taken to process data and generate findings is as important as access to the data themselves”. However there is significant confusion about what open research data should look like and about compliance of these data with Open Knowledge/Open Data definition (Molloy, 2012). That is rational, since very little scientific content is created outside the scientific communities (Fecher & Friesike, 2014). No doubt the Open Science movement is an effort to make scientific data a public good in contrast to the expansion of intellectual property rights over knowledge.

Indeed Open Science propagation is facilitated by the development of digital technologies and the exponential growth of data produced by the global scientific community. Due to the advancement of information technologies and computers, scientific experiments generate unprecedented enormous amounts of data which can be made accessible at any place/country by any researcher via the World Wide Web. Open Science is also a direct result from changes in the research process and the increasing need of collaborative and interdisciplinary research. However it is important to mention that Open Science as global phenomenon requires as well significant socio-cultural changes at all levels along with harmonizing legislation systems and political support. In Europe the European Science Cloud (EOSC) is an umbrella for academic and research libraries, universities and research centers with the goal to provide solutions for the scientific community in the context of Open Science (Mons et al., 2017). In the US the Open Science Chain (OSC), a project in progress funded by National Science Foundation (NSF), aims to develop a cyberinfrastructure platform that would allow researchers to make available metadata and verification information about their scientific datasets and update this information as the datasets change over the time.

## 3.Open Science as a cultural and social phenomena

From the perspective of knowledge dissemination, Open Science goes beyond transmission of knowledge, facts, ideas or information among participants in communication channels to remapping social relations and creating new sets of social interactions. In other words the openness influences the entire process of knowledge creation. For instance scientific publications as a high quality final product have little or no socio-cultural dimension. By making all steps of the research process visible to society, open science creates the conditions to involve different and often wider social groups in the creativity process. Therefore, Open Science is more of a social and cultural phenomenon aiming to recover the founding principles of scientific research rather than an alternative form of knowledge exchange. Thus, as an object of study, Open Science should not be modeled via a simple provisional communication model, but rather with constitutive models (Craig, 1999) because Open Science is not defined only by the processes of communication between scientists even if is true that openness in science depends on application of various new communication technologies which aid both scholarly communications and research impact evaluations. Indeed, Open Science rises up on a base of economic and technological developments (for example, the ability to propagate knowledge free of charge, existence of common platforms for sharing information not limited in time and space, global social networks, Internet 2, etc.), and also goes beyond these technological achievements and changes the understanding of the value of scientific knowledge across society. As such, Open Science manifests itself as a form of social organization built on a base of technology which aims to maximize the rate of accumulation of knowledge into the society and consequently to maximize the rate of growth induced by research activities. The driving force behind Open Science are not-for-profit organizations, scientists and their professional organizations, informal groups, public libraries, academic libraries, universities, foundations, government agencies, and individuals.

## 4.Open Science – an academic library perspective

As mentioned above, Open Science has several constructing components that aim to support high quality reproducible science. Open Science is often described as a multifaceted notion encompassing open access to publications, open research data, open source software, open collaboration, open peer review, open notebooks, open educational resources, open monographs, citizen science, or research crowdfunding (FOSTER, 2017) in order to remove barriers in the sharing of scientific research output and raw data. In this section we will focus on OA and OD because these two components inspire the largest changes in academic libraries’ services and operations under Open Science model. Historically, the driving force behind OD and OBD originated from scientific communities and not-for-profit organizations but are now the result of governments’ efforts (and consequently are requirements from institutions) to set up mediated data repositories and formulate the rules and policies for sharing the research data coming from all publicly funded projects. For example, numerous pilot initiatives such “H2020 Programme” (European Commission, 2011) in Europe requires any research funded by public sources to be published in Open-Access journals and data to be stacked in Open-Access data collections. In the US, the two main research funding agencies – National Science Foundation (NSF) and National Institutes of Health (NIH) – have similar mandatory requirements in accordance with FASTR (Fair Access to Scientific and Technology Research Act), approved by the Congress in 2013, instructing all U.S. science funding agencies to provide public access to federally supported research outputs.

As defined by Open Society, Open Access (OA) is a publishing and distribution model that makes scholarly research literature – much of which is funded by taxpayers around the world – freely available to the public online, without restrictions. In the context of Open Science OA references free access to scientific publications and databases with results from studies in a particular scientific discipline(s) (e.g., metabolomics databases). More precisely the Berlin Declaration on Open Access to Knowledge in the Sciences and Humanities (Max Planck Gesellschaft, 2003) defines OA as “a comprehensive source of human knowledge and cultural heritage that has been approved by the scientific community”. Many important and rich data sets are result of projects which rely on data collected from non-professional scientists (citizen science projects). As Groom (2017) points out: “citizen science data sets comprise 10% of data sets on GBIF (Global Biodiversity Information Facility), but account for 60% of all observations”. Due to large costs of citizen science projects and “contrary to what many people assume, data sets from volunteers are among the most restrictive in how they can be used”. Typically these open data sets are accessible to library patrons via registration. Some examples are data sets accumulated for urban microbiome project – AREM – led by the City University of New York or eBird database in ornithology. In addition, citizen science data sets are often in different formats and are almost always web based. To ensure access to these resources the academic libraries have to develop and expand their metadata services. That includes, but is not limited to metadata consultation services to patrons, hands on tutorials and manuals.

Open Standards are broadly defined as standards “independent of any single institution or manufacturer, and to which users may propose amendments” (Pountain, 2003). Academic libraries recognize Open Standards as a vehicle to ensure long term preservation of digital content and interoperability between systems, to solve obsolescence problems caused by advances in computer hardware and consequent changes in specifications, storage formats and access mechanisms. In particular, academic libraries engage actively with the Open Archives Institute (OAI) Protocol for Metadata Harvesting (OAI-PMH). The latter inaugurates an application-independent interoperability framework based on metadata harvesting. In addition, academic libraries do utilize various open standards for information retrieval such as Open-URL and Dublin Core Metadata Initiative (Corrado, 2005).

## 5.Academic libraries and Big Data

Modern research methods and sampling techniques generate large data sets. Data driven projects necessitate Research Data Management (RDM) strategy during all stages of the project – initial planning, collection of data from different sources, identification and labeling of data sets, processing these data sets and preservation and sharing of the results and raw data with the research community (Cox & Pinfield, 2014). It is important to mention that RDM is a complex activity, which incorporates data curation. The latter is an important part of archiving and preserving data for re-use. Academic librarians are investing resources to curate research data, especially Big Data (Akers, 2014). The latter are characterized by 3 Vs – high volume, high velocity, and high variety. In the context of the research cycle, Big Data is also characterized by its veracity. The latter refers to the quality of Big Data, understood in terms of accuracy (reliable methods of data acquisition), completeness of data (are there duplicates or missing data), consistency (are measurements and unit conversions accurate), uncertainty about its sources, and model approximations (Lukoianova & Rubin, 2014).

It is worth mentioning that Big Data typically needs application of a set of specific data cleaning, data validating and analytical tools in order to become valuable. From that perspective, the support for open data demands that librarians expand their qualifications toward data science and accept more data-centric roles (Hoy, 2014; Federer, 2016) in order to provide research data services. These include, but are not limited to, data sharing, data reuse, data collection, data visualization, data preservation and data curation. The technical side of the process requires the ability to interface with stored large data sets, and the technical ability to predispose data in a format suitable for decision making (Affelt, 2015). Within the scope of OD the volumes of data shared increase dramatically, which will transform libraries into Big Data libraries. However, the academic and public libraries with large collections are well positioned to make the transition toward OD and Open Big Data (OBD) libraries by cross sharing their collections and using Big Data Analytics (BDA) to identify similarities and consequently combine resources effectively. Such an approach is possible through the use of advanced data sharing cloud based technologies, when resources become virtual resources in a cloud (public or private) and thus become fully open for the research community. In addition BDA requires application of high performance algorithms and specific interfaces in order to make knowledge extraction effective. On the library side, the reference interview skills aiming to understand customer needs should be enriched with BD knowledge so librarians can offer OBD solutions as well (Hoy, 2014). Librarians’ roles should expand to cover at least part of data scientist duties to curate, clean, remove duplicates, and maintain data. That includes complex metadata related activities, content development, classification and other activities in more complex research domains defined by cross disciplinary and interdisciplinary research. In these complex domains, however, the existing expertise accumulated for well-defined social science data repositories, bioinformatics repositories, and geo-reference data repositories is of little help because new multimodal strategies take place and because the cross disciplinary and multidisciplinary research generates large data volumes of different types of data. Variability and volume of data shifts the focus toward effective data management, security, protecting privacy of individual researchers, preserving sensitive data according to federal or government regulations, anonymization and many others. With the current lack of standard data quality assessment protocol for OD and OBD, libraries are facing the challenges of ensuring the quality of resources in terms of accessibility, reusability, and trustworthiness. In this regard, academic libraries should take a leading role in developing (with IT collaboration) extensive metadata policies and implementation methodologies that enable the use of the same data by different researchers or groups. Nevertheless, the variability and volume of OD and OBD require academic libraries to take steps in expanding, tuning and justifying roles of librarians in realm of OD and OBD by requesting basic data science competencies in addition to the traditional library science expertise and qualifications.

### 6.3Promoter of research diversity

Some open data do exist already in grey literature, social media, blogs, social networking sites (e.g. ResearchGate), but those data are rarely research-curated or validated and thus they are not suitable for re-use in scientific research. By taking on the function of research data curation (along with open access), academic libraries guarantee the reliability of open diverse data sets for the community and consequently establish the conditions to contest citation bias and publication bias common in scientific research. As several empirical studies have shown (Czarnitzki et al., 2015), publication bias stems from editors’ selection of the works to be published based on criteria not always driven by research quality, from researchers’ willingness to pick up topics based on the political conjuncture set up by journal editors, and from researchers’ willingness to publish only selected parts of their research. All those biases could limit the scope and directions of further scientific research. Similarly, citation bias, resulting from researchers’ willingness to publish only in subscription journals with high impact factor, or researchers’ willingness to abandon parts of their research which they don’t believe to be “highly citable”, affects and limits the choice of next research topic(s) as well. As a result of citation and publication bias a significant part of the data remains hidden, in this way defeating the purposefulness of data as a generator of research ideas, research topics or implementations.

By building and maintaining open access institutional repositories, academic libraries can host the entire (published and unpublished) research output of their home institutions including both scientific publications and curated research data. Those data that haven’t been included in published works can be re-used for verification and reproducibility purposes. Data curation “involves maintaining, preserving and adding value to digital research data throughout its lifecycle” (Digital Curation Centre, 2014) and thus goes a step further than digital preservation which ensures long-term integrity only but not accessibility for immediate scientific re-use. The challenge there is not the technological capacity of such repositories but creating adequate metadata and policy ensuring sustained access to those curated data. For CUNY Academic Works, such submission policy and practices are already established – all current faculty, students and staff can submit their work to the repository (CUNY Academic Works, n.d.). Included works are selected and deposited at individual campuses where library coordinators provide consultation on a need to know basis in accordance with the submission policies, pending approval by the Scholarly Communications Librarian at the Office of Library Services who is the gatekeeper of the repository.

### 6.4Resources assessment

Under OD, academic libraries take on a new role to evaluate and grade the data storage resources, both private and public. In 2010 only life sciences repositories numbered more than 1,000 (Marcial & Hemminger, 2010). Currently, there are more than 2,000 open research data repositories of different types and with many different policies (Kindling et al., 2017). The journal Nature (2016) recently published a list of recommended repositories as “Repositories included on this page have been evaluated to ensure that they meet our requirements for data access, preservation and stability”. In Europe, RE3DATA (re3data.org), which is the most comprehensive source of reference for research data repositories, allows for the searching for research repositories utilizing over 40 criteria, such as subject, domain, content type, country, etc. (Pampel et al., 2013). It is very important to mention that research data management is discipline specific and thus the most suitable data repository should be carefully selected based on a set of criteria, which cover the whole process of data archiving and reuse in the discipline, and also within a framework of local and national legislative constraints and limitations defined by funding agencies. For example the answers of the following questions (Hart et al., 2016) might help to provide an integral understanding of suitability of a data repository for the particular type of research activity: Does the facility provide data curation or not? What type of data are accepted (structured or/and unstructured)? Are data backed up and if so how often? What is the data retention policy? And so on. The criteria for evaluating data repositories are usually developed by academic libraries with collaboration from IT departments. At City University of New York (CUNY) in addition to their institutional repository Academic Works, the CUNY High Performance Computing Center (HPCC) provides storage space which holds long term data for different research projects across the University (CUNY HPCC, n.d.). These data are annotated by the researchers (metadata) and are backed up on centralized data silo once every 24 hours. Currently the project data repository holds about 600 Tera Bytes research data from several NSF and NIH funded projects. Currently CUNY is building up research data cloud with capacity to store all types of research data deposited from different research groups across the university.