An NSF-funded project to enhance data citation information in DDI brought together a group of ten people from different stakeholder communities. One of the issues that the project plans to address is how data citation can be extended to acknowledge the contributions of different types of contributors in the development of research data, leading to the possibility of generating metrics to better understand and measure those contributions. To underlie this, we need structured metadata. From the proposal, the group noted the following key questions: ? Which elements do we need? ? What objects should have metadata? ? How should reuse be handled? ? What infrastructure is needed for location? ? Which need controlled vocabulary? ? What special information is needed for the citation of stream resources? In terms of which items relating to data should be cited, we can think about: ? Data files ? Segments ? Qualitative data segments ? Extracts from dynamic data ? Replication data We may also want to cite various aspects of instruments and the data collection process: ? Questions ? Categorizations ? Equipment ? Software ? Algorithms Procedures and Conceptual components may also need citing. In addition, we want to think about data reuse and the infrastructure for forward and backward searching. Other questions for the group: ? What contributor roles should be captured? ? How can we come up with a flexible controlled vocabulary for contributor role? ? Should we look at the Generic Longitudinal Business Process Model (GSBPM)? This may provide some ideas for roles as we look across the lifecycle of producing longitudinal data. ? How might we measure degree of contribution? Some ideas: Order of listing; Percent of total project; FTE; Importance to project The group will look at the current citations elements in DDI. Numeric fingerprints/qualitative fingerprints may be possible elements of data citations that we could add. We also need to think about tools to make it easy to capture information in the research stream. Having the capacity to track the provenance of questions would be very useful. Also, what happens when you want to cite multiple combinations of objects? Question 1: Which DDI objects should have citation metadata associated with them? All identifiable objects. Every identifiable object should have a specific set of citation metadata as an optional property of the object. It is implicit that there is a relationship to the larger entity of which the object is a part. There is not a need for a specific information object indicating that relationship. Rationale for decision: Choices of what to cite are are social choices - chasing this by choosing some and not others is a never-ending task. It is not the standard’s job to define these social questions as different organizations will have different needs. The standard should enable citation as practiced by different organizations. The structure should be there, but populating the citation metadata is optional. Related Rule: You can cite something only when IsPublished is True. In development you can reference something, but you can’t cite it formally. The modelers may want to focus on only those objects that are administered/versionable. Discussion Each object in DDI 4 will have an identifier, but not all of them have additional metadata and are administered or versionable. Which objects should be citable? This is a social issue. We need to separate the concepts of reference and citation. In general, we need to be clearer about the distinctions among: ? A citation in the sense of a pointer to an object (e.g., Title, Creator, Publisher, PublicationDate, Identifier) - as in citing a dataset in a journal article ? The collection of citation-related objects (a record) related to an object (in DDI Codebook and Lifecycle this is called a Citation element) ? Pointing to an object via an identifier (a reference) By “referenceable,” we mean “addressable” -- anything that can have a URL. Note that all DDI4 objects have unique identifiers. When an object is reused, we reference it within its context. This could be an opportunity to make things more consistent beyond name, label, and description. Upper level of containers should come with full citation. Creation of original object but then it may be reused. Likert example - The citation metadata for a widely used object should have pointer to the seminal paper (where one exists) - alternatively where there is no paper directly describing the object there should be a pointer to its first use, the instantiation heavily managed objects - e.g. medical classifications We already have a socially acceptable way to cite through writing a paper and citing something in the references section, but we want to give credit to others who have contributed to a dataset. DataCite was set up to make mainstream the convention of citing data without scholarly papers. There are 4 million DOIs to things that don’t have papers. A core idea of DDI is that a variable as an object could be put into a repository and reused; is there a way to attach the credit to that variable? The Altman and King paper on citation has a section on “deep citation” and provides an example of using three variables from a dataset for a table in a publication. Variable names are added to the citation of the data file. When creating an analysis file, we should reference the original data file and describe selection criteria for the analysis file through syntax and codes. Proposal: Give every DDI object the “Citation object” metadata (but call it something else). There is the potential for a “creator” at the data file level and at the variable level, for example. There is a relationship between the variable and the data file, so the best practice would be to treat it by inheritance and follow the relationships to the dataset. Administered objects need citation metadata, but do the others? Do we need to specify when in a process this information needs to be populated? In DDI, this happens when the isPublished flag is set to True. As some terms are not clear, we should provide a glossary defining what publishing means and what citation means. Publishing in this community means making the object available and this is static. The definition for “isPublished” says: “Indicates that the maintainable will not be changed without versioning and is a stable target for referencing.” This definition is clear but instead of isPublished, we might think about using another term -- for example, “IsRegistered”, “IsFinal”, “IsCommitted”, or “IsVersioned”. In clinical trials research, something might become citable when it becomes part of the audit trail. Question 2: Which information elements are needed for comprehensive data citation? Discussion To answer this fully, we first need to generate a document with a list of all citation information elements from 2.5 and 3.2. [Sanda to do this.] OthID is the contributor element in DDI Codebook and maps to Dublin Core contributor. There is a Contributor element in DDI Lifecycle. You have an addressable object and you can associate the citation metadata with it. The idea was raised that we could add an actionable citation pointer to any object and the user could then decide what elements to use. Users would structure the information in the referenced object. A hierarchy of pointers (typed) could degrade gracefully from DDI to something else, to a lesser something else; could be: #1 structured citation metadata in DDI; #2 link to structured citation located elsewhere but is in a recognized/sanctioned format; #3 link to citation in an unknown format. Could we borrow DataCite as the core metadata for citation? This is focused on data but what about other kinds of objects? The consequence of having a large list of attributes is that the large list may put a burden on implementers. The citee has all the information about the object and can provide instructions on how to cite. The citer can use these instructions. Allows for corner cases To make this machine-actionable, the citee would provide the bare minimum requirements. The actual citation might be unstructured. However, we don’t want to break backwards compatibility with machine-actionable objects. Citee describes how to be cited. Edge case: citee provides citation to different object outside of the DDI, e.g., they want the paper cited as a proxy for the dataset. A use case might be to find all roles a specific person has played with respect to published research data. This requires structure that the citee can provide. JK: proposed... Citation involves objects, cite-ers, and cite-ees. Cited objects have, at minimum, a URL (with embedded URN). Cite-er determines citation metadata appropriate to stage (embryonic, adolescent, mature, version 1, version 2) ... A set of use cases was developed: 1. Sally Scientist, working alone, wants to create a data file with a codebook including enough information about how to cite the work in an article she’s writing. The citable object is a DataSerialisation. What is in DDI now may be adequate, but is currently scattered through the structure. Right now her role in the components of the project is only determined by implication from the study level citation. She gets a DOI and generates a UNF. Should the Fingerprint go into the citation? Yes, as an optional element. For a file-level citation to be possible, what elements should she use? Title, Creator, Publisher, PublicationDate, Identifier (DataCite Kernel) also ResourceType? There may be a separate citation for the codebook (different use case). 2. Same dataset but including enough information to credit many co-workers defining the roles of those co-workers and degree of contribution. Elements needed: Title, Creator, Contributor(s) with roles, Publisher, PublicationDate, Identifier The codebook would have the complete set of metadata about the contributors, including degree of contribution. Degree of contribution is difficult to address as there is no right or wrong way. Suggestions made included author order, primary-secondary, 0-1 weight with key value pair defining the semantic. 3. Same dataset cited by a scholar wanting to reference the work in a publication. 4. Colleagues in a given field need to share references to the same embryonic data. 5. Data citations harvested for indexing activity. 6. Citation for data files received and then harmonized. Original citations and plus citation for the harmonization effort. 7. Citation via OAI-ORE resource map. 8. Joe creates a variable to be stored in an archive for others to be discovered and reused. Use case structure: ? scenario (include actors, systems, goals) ? citable objects (from DDI model) ? elements used, where they come from ? example citation(s) ? discussion Continuing the Use Cases: Scenario 1. Sally Scientist, working alone, wants to create a data file with a codebook including enough information about how to cite the work in an article she’s writing. Citable Objects Citable object(s): StudyUnit; DataFile; Elements Used Title, Creator, Publisher, PublicationDate, Identifier (DataCite Kernel) also ResourceType? Example Citation Scientist, Sally. Coffee Consumption at Schloss Dagstuhl. Leibniz Institute for Informatics [publisher], 2014. http://doi.org/10.3886/11223344 There is citation metadata at different levels. Should there be a rule for inheritance built into the model? DDI can enable inheritance by reference but tools are needed to enforce this. What is in DDI now may be adequate, but is currently scattered through the structure. Right now her role in the components of the project is only determined by implication from the study level citation. She gets a DOI and generates a UNF. Should the Fingerprint go into the citation? Yes, as an optional element. For a file-level citation to be possible, what elements should she use? There may be a separate citation for the codebook (different use case).Citation located in Study Unit. Citation information a components of the study could reference the study level citation. Need for best practices. When the user goes to a landing page what should they find there. The group reviewed the elements and attributes that are currently used for citation in DDI Codebook and Lifecycle. Copyright should probably be moved to another location as it is not really related to the citation. Publisher has attributes of producer and distributor. 2. Scenario Same dataset but including enough information to credit many co-workers defining the roles of those co-workers and degree of contribution. Bill does data collection, Jane does data analysis (statistical), and NIH was the funder. Citable object(s): StudyUnit, DataCollection, Instrument?, Statistical analysis (nothing in DDI now about analysis) Elements Used Title, Creator, Contributor(s) with roles, Publisher, PublicationDate, Identifier Example Citation(s) Scientist, Sally, Bill Smith, and Jane Doe. Coffee Consumption at Schloss Dagstuhl. Leibniz Institute for Informatics [publisher], 2014. http://doi.org/10.3886/11223344 Bill and Jane’s respective contributions would be acknowledged in journal Note/Acknowledgments, which are increasingly structured, e.g.: Author Contributions From article in PLOS (http://www.plosone.org/article/authors/info%3Adoi%2F10.1371%2Fjournal.pone.0109687): Conceived and designed the experiments: DR ZE. Performed the experiments: DR ZE. Analyzed the data: DR ZE. Wrote the paper: DR ZE. Discussion The codebook would have the complete set of metadata about the contributors, including degree of contribution. Degree of contribution is difficult to address as there is no right or wrong way. Suggestions made included author order, primary-secondary, 0-1 weight with key value pair defining the semantic. Citation information for funder at the Study Level - in DDI3.2 this would be in a separate element) Citation information at some object at the DataCollection level for Bill’s role Citation information at the Methodology level? for Jane’s role Modelers need to ensure that all roles are inheriting from the same core class. Make Creator a subset of Role so that we are not missing any roles. 3. Scenario Same dataset cited by a scholar wanting to reference the work in a publication. Citable Objects: StudyUnit; DataFile (PhysicalInstance) ; Codebook(DDIInstance) (Some people do cite codebooks because they don’t know that they should cite the data.) Elements Used Title, Creator, Publisher, PublicationDate, Identifier Example Citation(s) Scientist, Sally. Coffee Consumption at Schloss Dagstuhl. Leibniz Institute for Informatics [publisher], 2014. http://doi.org/10.3886/11223344 Steve and Mary. and Sam, and Stu Discussion Need pointers to both data and metadata. Need to know which data file (physical instance was used). 4. Suzie Student in Genomics generates lots of unpublished data over the course of 4 years and needs to share citations to intermediate findings for the purpose of review. Suzie includes all the fields she can from an official citation, and adds “[work in progress, subject to change]” or “[work in progress, fixed as of 2014-10-21]”. Discussion Permanence categories might apply 5. Scenario Sam Streamer uses data currency exchange data to normalize price data over time . Citable object(s): StudyUnit, PhysicalInstance Elements Used Title, Creator, Publisher, PublicationDate, Identifier, last revision date, queryDateTime, PID to query/selection If a subset was extracted, add a queryID with timestamp. Example Citation(s) Streamer , Sam. Normalized Coffee Price Data at Dagstuhl October 2013, DDI Alliance [distributor], http://doi.org/10.3886/11223354, 2014-10-21T10:24:15 Sam cites: Bucks are Us, Streaming Currency Exchange Rates, Daddy Warbucks Inc. [publisher], Last revision date of 2014-10-21T10:00:00, http://BucksAreUs.com/?myquery Discussion Permanence Level should be in the metadata. queryID, 6. Thomson Reuters Data Citation Index harvests data citations for indexing activity and discovery. Michael, Jay, Wolfgang, and Jenny. a) scenario (include actors, systems, goals) Dataset deposited in a repository (e.g., ICPSR) that puts metadata into DataCite Metadata Store and then is harvested by Data Citation Index (TR, a commercial service) using OAI-PMH. DCI then processes the DDI and includes other sources to enrich (e.g., normalize to their proprietary schema and associate with citations to literature that it also indexes) and make it searchable in Web of Science/Knowledge. For example, they provide metrics of use and present data citations with papers that cite them. b) citable objects (from DDI model) DDIInstance (i.e., codebook), StudyUnit, PhysicalInstance, SummaryStatistics, QuestionGroup [there could be more…] c) elements used, where they come from We will finish this later! 1. DDIInstance 2. StudyUnit 3. PhysicalInstance 4. SummaryStatistics 5. QuestionGroup d) example citation(s) We will finish this later! 1. DDIInstance 2. StudyUnit 3. PhysicalInstance 4. SummaryStatistics 5. QuestionGroup e) discussion Do we supply citations of data with harvest (e.g., for DCI to act upon). Currently not in DDI? Jay shared Aqueduct. There is how DCI works now and then what more they could do if we gave them more/better metadata. Jay can provide example of NIH with requirement for description of tables. Caveat: DCI may not actually use DataCite MDS now. Some make-believe. Scales are licensed for purchase. 1. Scenario Citation for data files received and then harmonized. Original citations and plus citation for the harmonization effort. Wendy Thomas and Knut. Citable object(s): Study Unit, Data Sources for the Study Elements Used Harmonized work: Title, Creator, Contributor (role), Publisher, PublicationDate, Identifier, Original data contributors: Contributor (role), Example Citation(s) Minnesota Population Center. Integrated Public Use Microdata Series, International: Version 6.3 [Machine-readable database]. Minneapolis: University of Minnesota, 2014. http://international.ipums.org [Source Data Contributor: France, National Institute of Statistics and Economic Studies; Germany, Federal Statistical Office] Question: What is the purpose of noting the sample sources? We have the strong feeling that this is not required and is basically a social/political motivation to acknowledge contributors to the overall project. This should be handled by the example capturing the extract specification if one is citing the extract rather than the project database (IPUMS). This is the only technically satisfying solution. Should it be noted in the citation? If so, how to differentiate between the direct source (harmonized in this case)and the contributing sources (listed here as Source Data Contributor). 8. Scenario Joe Vari creates a variable “AgeInMiilliseconds” to be stored in an archive for others to discover and reuse. Larry and Jeremy. Citable object(s): Variable Elements Used Title, Creator, Publisher, PublicationDate, Identifier, Resource Type Example Citation(s) Vari, Joe. AMs - Age in Milliseconds, DDI Alliance Repository, 2014, http://doi.org/10.3886/11223345, [Variable] ResourceType:Variable Discussion Note that the identifier will include a version. In this use case the the variable is likely to evolve over time. For others to reuse the variable they would likely need a reference to the Represented Variable. On the other hand an InstanceVariable might be cited in the reuse of a dataset or inn a repllication study where the the original variable was comapred to a new instance variable. 9. Scenario: Robert Researcher uses an extract from a larger published dataset (e.g. IPUMS xxx) done in a controlled access environment. Robert is able to construct a citation identifying the selection criteria for the subset of data that he used. Sanda and Ornolf. Citable Objects: DDIInstance, StudyUnit, PhysicalInstance (data file) Elements used: Author: Robert Researcher Title: Subset of Original Title Publisher: Robert Researcher Publication Date: Date when new subset was published (2004) Related Resource: Author: Census Bureau Title : IPUMS Publisher: Census Bureau Publication Date: 2000 Relation type: Is Derived From Selection criteria: List of variables selected Set of KEEP criteria (case selection) Example citation: Researcher, Robert. Subset of IPUMS. Robert Researcher 2004. Is Derived from: Census Bureau, IPUMS, Census Bureau, 2000. Discussion: Note: a link to the selection criteria is important MPC provides the the selection criteria and keeps the selection criteria (time stamped and versioned) Restful interfaces might be used Discussion of permanence, re: NLM permanence levels - Citation.Contributor reference in 3.2 Brainstorming ideas for populating/refining contributor roles: 1. Use taxonomy from Allen et al. Nature 508 312-3 or follow-up workshop. 2. Are their other taxonomies or lists that we might look at? Perform a scan to identify (some already in Google Drive folder) 3. Research examples from current practice by looking at notes, etc. in published data papers and datasets. 4. Scan citation style guides and author guidelines and note roles. 5. Free brainstorming in the group. 6. Look at focus groups, case studies, data curation profiles, etc. from different research communities. 7. Identify some exemplars: in current practice, what it the most granular, complex citation or reference to data that we can find? How important will a shared community list be? Do we need some sort of hybrid of basic list and additional controlled vocabulary? Lifecycle events - agents in process model in DDI4 will be a source of roles Will a controlled vocabulary for roles fit into some activity object in the lifecycle model? GLBPM Other notes We need to discuss inheritance of citation information. Rules for this need to be codified. Could these be explicitly part of the model? Must enforcement be built into the tools? We have to be careful about implications of inheritance - did the data collector design the collection or was just responsible for carrying it out? DDI4 should cover capturing statistical analysis activity National of medicine permanence ratings. almost everything subject to change. classification of dynamic datasources - growth by accumulation - older data don’t change, Characterize citation as replicable or not http://www.nlm.nih.gov/psd/pcm/devpermanence.html “Default values or drop-down menus are provided wherever possible. This minimal set includes: Title Heading Date Published Date Last Modified Next Review Date Contact email address Publisher Rights Permanence Level Permanence Guarantor Language ” “Permanent: Unchanging Content The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content will not change. Example: Minutes of the NLM Board of Regents meetings Permanent: Stable Content The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content is subject only to minor corrections or additions. Example: NLM Annual Report Permanent: Dynamic Content The National Library of Medicine has made a commitment to keep this resource permanently available. Its identifier will always provide access to the resource. Its content could be revised or replaced. Example: NLM Home Page Permanence Not Guaranteed The National Library of Medicine has made no commitment to keep this resource available. It could become unavailable at any time. Its identifier could be changed. Example: Frequently Asked Questions ” Consider a Published paper - Digital library journal? IASSISTQ Roles The Nature article roles are categories. Could we have a hierarchy within these? We can approach this from the standpoint of the data lifecycle process to capture all the steps in the data lifecycle. Allen et al. will be working with NISO as the next step to create a taxonomy with a standards body. The group compared the DDI lifecycle events with the Allen et al. taxonomy of categories. Sanda created a list of roles. A proposal is part of study conception. Funding is in both. The DDI list is more detailed. Methodology is broken down into several steps of the lifecycle. The DDI list is about metadata-driven design and machine-actionability while the taxonomy is about attribution. We could share the DDI lifecyle with the Nature authors. We could look at all the Chicago Manual of Style to see what is already being acknowledged. Distinction between categories and roles; start with categories (map between DDI and Nature and identify matches, mismatches, and gaps), talk to Nature guys, populate roles. Questions for Nature taxonomy authors: Is there an effort to make this information machine-actionable? What about reviewer (for funding proposal or article)? Mapped DDI Lifecycle DVG to Nature: https://docs.google.com/spreadsheets/d/1cHTb2tth1G9fjP4vXSmeBgBDSxpYq6Ykxikn4LQbslw/edit#gid=0 Also should flip around mapping: use Nature to potential improve DDI CVG lifecycle events and roles. Concern about the difficulty about parcelling out credit accurately. “I have to get listed in this role as opposed to that one”. Who will do these allocations of role, how will will it be done. Should we keep the list of roles small? Should we consider researcher burden? What about those who now are not listed first on the author list? Details give them an opportunity for credit. Questions for Nature taxonomy authors: Clarify distinction between categories of contributors versus roles. answer: categories (journals can define and populate more granularity, e.g., roles) Is there an effort to make this information in journals machine-actionable? Have you thought about degree of contribution? Are you open to adding new terms? Has the planned workshop been held yet? What are future plans? Do you see the DDI lifecycle roles as an appropriate application of your taxonomy? Broader lifecycle, e.g., what about reviewer (for funding proposal or article)? Mapped DDI Lifecycle DVG to Nature: https://docs.google.com/spreadsheets/d/1cHTb2tth1G9fjP4vXSmeBgBDSxpYq6Ykxikn4LQbslw/edit#gid=0 Also should flip around mapping: use Nature to potential improve DDI CVG lifecycle events and roles. Proposal: Use case 10 Data from this workshop Minutes (qualitative collectionn) quantitative dataset(s) - can’t survey ourselves - no IRB approval, but we could make a dataset on our observations here (weather, food?) some parts without the PIs’ contribution, credit for variables? metadata? Different measures of degree of contribution. How difficult? Potential for hard feelings? To be archived - KU Scholarworks (with handle) ICPSR (with doi) Generate DDI3.2 Make a SAS quantitative dataset from the minutes Notes does reuse imply using the data integration with other data or just e.g. another analysis rey on discovery - robots search for references RDA working group. (2 other groups, RMap and the National Data Service) Should DDI Alliance take on a advocacy role re citation? Elements needed Should we borrow from DataCite their list of elements and attributes? How does DataCite metadata relate to DDI? Does DDI lack the ability to describe relationships between 1) new dataset is subset of old dataset does citation of new dataset indicate parent dataset 2) harmonized dataset not in the citation in a publication but in the supporting information E.g. ICPSR fertility dataset “isDerived from” etc DDI has these relationships how do you serialize Will everyone using DDI use Datacite? no DDI4 vocabulary for relationships? Provenance chain Is Derived From Is Part of Can furnish relationships inside DDI; how can we expose them so they are serialized? One way of linking to another standard is to say go here. Steal from DataCite? Do XSLT of DDI and DataCite to find gaps Recommendation: There are hundreds of standards for serializations of citations. Publisher (/funder? / data provider?) requirements will determine the serialization of the citation metadata. Not our role to recommend one serialization style. We need to ensure that the information is there to be serialized. We should be in the modeling business not the serialization business. Should we remodel and make creator and publisher a role of contributer ? Dublin Core could support this approach this would have the advantage of having the degree of contribution weight having the same scale Should degree of contribution even be part of our role? A simple citation - What is core? There could be a best practice that some should be there. A minimal serialization is needed to test our theories. Question 1: Which elements do we need for citation? Minimal set Title ID Contributor (OtherID) (Role attribute -- Needs controlled vocabulary) -Author (creator) -Publisher -Publication date Version -Number -Date -Responsibility Resource Type (Needs controlled vocabulary) (could be derived from DDI or from external sources) (Jay) Locator Recommended but not required? - Pointer to metadata -- Need Wolfgang’s guidance on this What will this imply for data creators - is this enough of a burden to make this pointer impractical. Alternative link to metadata is loaded onto the locator NIH DDI - requirement for URI for metadata UML model allows extension beyond minimal set Short list of conditionals for additional elements under sets of circumstances These instructions could be in a profile - which in turn could be used for validation. Additional set of recommended actionable link to dataset Copyright (Access restrictions) Additional set of Objects needing development Permanence or Stability Information (NLM vocabulary?) multiple dimensions - Permanence of the object, permanence of the identifier, stability of the object Data Fingerprint (enable this at the citation level? could this apply to DDI elements (e.g. a question?) - lack of agreed standard, but stable possibilities (UNF reference Altman et al ) actionable link to other metadata (part of locator?) Permanence -- is this misleading? Streaming is a separate category from the permanence levels. There may be future work related to permanence needed for streaming resources. Big Data: Don’t get analysis dataset until you push it through pipeline -- at what stage is data sitting? Data Lake: Key value stores where the value can be anything. Warehousing data without worrying about the data type. But to use it you need to know something about the data type. Library of Congress vocabulary of roles? Citing algorithms as part of process -- this can happen Qualitative data segments -- they should also get citations; sql queries also Replication data should be considered like any type of data in terms of citation Terminologies or dictionaries apply to datasets. useful to cite them - link in metadata for dataset to link to terminologies or dictionaries Replication datasets - is there an object that binds the data and the processing information How do you generalize citation to the medical instruments that Jay describes? Should we distinguish between citation and annotation? We have minimal set and a DOI -- how do you get to the supporting metadata? DataCite - persistent identifiers to objects, infrastructure to resolve object In DataCite the goal was to give persistent IDs and there is a Metadata Store to hold the supporting metadata. DataCite provides stable infrastructure to resolve the DOI but the resource holder takes responsibility to keep the IDs up to date. Are we duplicating metadata stored in DataCite? Responsibility of data owner to synchronize We need a clear mapping to DDI (DataCite metadata document - mapping was complicated) DDI already had duplication DDI Title and dc:title. If we get rid of Dublin Core still need a clear mapping to it. We need human-readable reference information to support the machine-readable ID. Mandatory elements are stored together with the larger metadata set. Try to align with schema that others use. DataCite could have role as contributor type. Which roles are more important than others - would you name all of the people involved in a survey not all in serialized record but maybe all in the record. (8K authors might be too few) What is the minimum set of information objects to generate a DataCite doi reference Should we be thinking of how to generate a DataCite record from DDI? Have core and extensions that you would use to programmatically to generate a citation? Each data center can decide which objects to assign citations to Minting a DataCite DOI is one form of serializing a data citation. Should we try to make them identical? Would this be a big win? Five mandatory for DataCite -- we add version and resource type. Should we make these optional and push them “down” into the metadata record? +1 DataCite record like a catalog record no richer metadata - schema has related identifier with type of hasMetadata. Wolfgang created a DOI for the codebook of the World Values study. Now the DOI points to the rich metadata (XML) and avoids the landing page. Content negotiation mechanism? but landing page has different content than XML Could we put something on the landing page that a machine could read (like a meta tag?) to access the richer metadata? These are two different systems, though. There is a more general problem. We are registering DOIs for landing page. What if we cite a variable? Don’t have a landing page for that. Citation to a variable can’t yield a valid DOI? From a modeling point of view it is easy to articulate the requirement that all objects should be citable. But citing a variable may not make sense as a DOI may not be appropriate for a variable. Practice is different from modeling. For some things you need human-readable information and for others not. Citation with option of name value pairs? All nodes have a type? Attributes depend on the type? Park Metadata repository expected to keep metadata up to date, but errors are always there. Not sustainable in the data world? Interview with Micah Altman Presentation authorship in scholarly presentation average number of authors per publication increasing, commonly serialization only lists some contributed equally? order has no agreed upon meaning, varies among fields improved analyics? connections among authors, … acknowledgement practices vary Authorship statements collected by journals - (limiting misconduct) Who is recognized as an author varies across vendors even within a field Objective - define source of contribution typically recognized (whether or not authorship) Economics - alphabetical - early listing has advantage IWCSA workshop 2012 - contribution taxonomy another dimension - what you made a contribution too - paper , dataset, section of work intended to apply to natural, medical, physical, health, social science research intended to be usable in publication environment could ask questions - who was software developer? Publication workflow - focus on corresponding author surveys, analysis of acknowledgement statements some categories controversial as authorship (e.g. funding acquisition) pilot study with contributing authors how much they had contributed was missing - level of contribution (lead, equal, supporting) ?- Dec 2014 meeting CASRAI dictionary scale issues - 3 authors vs 100 authors contributership relates only to a particular subset of the work. Citation of versions?? with each article a packet of metadata Questions Clarify distinction between categories of contributors versus roles. The way they formulate this is that role is essentially incomplete subcategories of the taxonomy. What each person who had a relation to the work did in relation to the higher level. Is there an effort to make this information in journals machine-actionable? contributor statement qualitative - would like to see it structured at least in terms of taxonomic categories Ideally every contributor has an ORCID Publish the taxonomy in a way we can reference it? Yes working with CASRAI and to a lesser extent NISO both standards-based organizations. Large dictionary that is increasingly comprehensive. Integrate taxonomy into official CASRAI standard dictionary. Will be schematized. If we have ideas for new terms or definitions, would be open to them? Yes Send comments to Amy Brand and Micah Altman Broader lifecycle, e.g., what about reviewer (for funding proposal or article)? Doesn’t call out specifically peer review Look to funders to give more acknowledgment to reviewers Have you thought about degree of contribution? yes see above - lead, equal, supporting Was population primarily publishing data or article? Almost certainly traditional publication. It would be interesting to see taxonomy applied to that Are you open to adding new terms? yes sooner is better Has the planned workshop been held yet? What are future plans? Do you see the DDI lifecycle roles as an appropriate application of your taxonomy? Any data publishers in the sample? already published authors - very tiny data publications Micah: Main problem: ways that authorship has changed in scholarly communications. Number of authors increased. The order of authorship has no generally agreed upon meaning Improve analytics New measures become feasible Measure new things Reduce errors in standard analytics like impact factors Acknowledgments are not authorships. Citations ambiguous. Journals collect qualitative author statements not available for analytics. For limiting misconduct. Who is recognized as an author differs from journal to journal Objective is to define sorts of contributions typically recognized not define authorship Author statements not typically in structured form Research shows name order effects are important Has an effect on careers if your name comes earlier in the alphabet Workshop 2012 IWCSA Workshop 2012 Goals for Prototype Taxonomy Describes types of contributions Cover broad domain of research: natural, medical, physical, health, social sciences Theoretically justified Aligned with researcher behavior Usable in publication environment Not goal of describing all ways to make a contribution Hope that taxonomy can be applied in this way Number of steps from abstract taxonomy to something you can use in a real publication environment How to elicit information about people and integrating with publisher workflows difficult Most of workflow centers on corresponding author which is somewhat limiting in looking at contributorship Keep this in mind when thinking about what interventions are possible in this space. To develop prototype there were existing specifications especially in medical field Empirical analysis of practice Looked at what terms researchers used to describe their contributions qualitatively and which terms mapped to existing terms in frameworks Checking this against acknowledgement statements Ran a pilot test with select publishers Generated taxonomy Some categories are controversial as authorship in particular fields Not claiming that these contributions should be authors but are evalutable and can be usefully tracked nad some stakeholders are interested in them Designed survey to test whtether authors contriubtions can be assigned into a series of specified roles Reflected on the process. Got a few hundred responses from people who found it reasonably easy to do this. Focused on articles wiht 7 or few authors. Had generally positive comments about formalizing the process. It was clear that there was at least one thing missing: people wanted to state how much different contributors had done. There is another dimension of level of contribution in each category. Something on this is needed for people to consider this adequately expressive. Working with CASRAI and NISO to formalize this. There have been minor changes in wording with the exception of adding level of contribution as another taxonomic dimension. Community meeting at CNI. WIll be incoroporated into CASRAI dictionary. PLOS will be one of the early adopters. Accredited standard later maybe. Test with larger community Develop guidance for use of taxonomy -- examples of questions you can use to elicit information about contributorship; how implement online Socialize the taxonomy in the community Afternoon Session Discussion with Barry Radler re: MIDUS The question of: “where did specific measures and questions come from”? can arise. A formal approach to developing an audit trail for these objects could be based on a citation element attached to each object as they are versioned. Another question is “Who brought this question into the survey?” this could be tracked with a custom attribute in a citation element. Jay noted that it is important to agree on a place for this information. It is also important to balance the collection of the information against being burdensome to researchers. Copyright information, and perhaps access controls are important for questions in some contexts. Some questions have restrictions barring the publication of their wording or even the instructions around them.