USC Information Sciences Institute (ISI)

by author

by title

by keyword

Geoscience Papers of the Future: Lessons Learned from Practicing Reproducible Resea...

Yolanda Gil

and 15 more

April 18, 2017

HOW TO USE AUTHOREA Hey, welcome. Double click anywhere on the text to start writing. In addition to simple text you can also add text formatted in BOLDFACE, _italic_, and yes, math too: E = mc²! Add images by drag’n’drop or click on the “Insert Figure” button. Citing other papers is easy. Voilà: or . Click on the cite button in the toolbar to search articles and cite them. Authorea also comes with a powerful commenting system. Don’t agree that E = mc³?!? Highlight the text you want to discuss or click the comment button. Find out more about using Authorea on our help page. INTRODUCTION The Geosciences Paper of the Future Initiative was created by the EarthCube OntoSoft project and its Early Career Advisory Committee formed by 30 geoscientists in different disciplines in order to disseminate best practices for reproducible publications, open science, and digital scholarship. The Initiative consists of three major efforts: 1. the compilation of best practices from a variety of community organizations (e.g, ESIP, RDA), scientific societies (e.g., AGU, AAAS, CODATA), curators (e.g., IEDA, NSIDC), and publishers (Nature, Science) 2. the dissemination of best practices through training sessions at major scientific conferences (e.g., AGU, GSA, ASLO, CEDAR); and research institutions (e.g., WHOI, USGS). The training materials are openly available, including a summary checklist for authors, and show how to manage their scholarly identity, reputation, and impact throughout their careers. 3. the publication of a special issue of the AGU Earth and Space Science journal on Geoscience Papers of the Future containing articles that illustrate how to apply these best practices in different geosciences areas, with another special issue of the journal Geophysics under way. A Geosciences Paper of the Future follows best practices to document all the associated digital products that result from the research reported in the paper. This means that a paper would include: - Data available in a public repository, including documented metadata, a clear license specifying conditions of use, and citable using a unique and persistent identifier - Software available in a public repository, with documentation, a license for reuse, and a unique and citable using a persistent identifier - Provenance of the results by explicitly describing the series of computations and their outcome in a workflow sketch, a formal workflow, or a provenance record, possibly in a shared repository and with a unique and persistent identifier These best practices are described in detail in . The Geoscience Papers of the Future published to date not only serve as exemplars of how to implement best practices, but also expose limitations of existing cyberinfrastructure capabilities to support scientists in their work. In this paper, we give a synthesis of perspectives by GPF authors contrasting the approaches used to implement GPF best practices in their own disciplines, the lessons learned, the challenges encountered, and the benefits found. _We should summarize here the main findings_. The paper starts with an overview of the articles that illustrates the breadth of disciplines, motivations, and approaches covered by all the GPFs. We then compare the different papers along common dimensions. We discuss the benefits and the challenges found. We conclude with prospects for the future. NOTE from 5/15/17 meeting: Add a comment about the different levels of reproducibility.

What to Keep and How to Analyze It: Data Curation and Data Analysis with Multiple Pha...

Alyssa Goodman

and 12 more

April 22, 2013

Overview This open document is being used to describe and record the events at the Radcliffe Exploratory Seminar on Data Curation and Analysis, to be held at the Radcliffe Institute for Advanced Study, May 9-10 2013. This Google Drive Directory should be used to deposit all files contributed by participants before and during the meeting. (Click "Open in Drive" on your browser to make a new folder, e.g. with your name as its name.) This Google Doc is used for collaborative real-time note-taking. ABSTRACT: Rapid advances in technology have allowed us to collect vast amounts of data in myriad fields and forms, but our ability to manage and analyze these data has not kept pace. As a result, the amount of data collected far exceeds what can be analyzed and, often, what can be archived. These issues only become more pressing as data collection accelerates. Astronomers and astrophysicists, for example, collect terabytes of data per night; the phrase “drowning in a data tsunami” is increasingly used to describe this situation. The issues of what to keep and what to distribute are surprisingly complex, even when we put aside technological issues such as long-term storage and retrieval. A central challenge is the fundamental conflict between reducing the size of data and preserving information for future scientific inquires and statistical analyses. Complicating matters further, the parties/teams involved in the entire data collection, curation, and analysis process often have only limited communication with each other owing to the sequential nature of this process. This seminar brings together a core group of leading experts and emerging scholars in information and natural sciences to discuss, debate, and design principles and strategies to address this grand challenge, which increasingly affects almost every aspect of science and society. GOAL: By gathering experts from information and natural sciences, we aim to start building a set of principles and methods that will allow us to understand such problems and to provide better preprocessing, analyses, and data preservation, especially in the context of the natural sciences. The ultimate goals of this research include providing methods for assessing the validity of such collaborative analyses, guidance on statistically-principled preprocessing, and a rich new theory of statistical learning and inference with multiple parties. We believe that this collaboration will simultaneously sow the seeds for innovative mathematical theory and shed light on directly usable guidelines for the construction and curation of scientific databases.