Codebook for SAS Dataset: TOPICS2

Dataset

Dataset Label
A SAS Dataset generated from minutes from NSF1448107 group at Dagstuhl event 14432
Date Created
2014-11-09T11:43:30.8
Date Last Modified
2014-11-09T11:43:31.0
Number of Observations
25
Number of Variables
8
Encoding
wlatin1 Western (Windows)
Engine
V9

__________________extended attributes___________________

Abstract
This dataset was created from the October 20-22 minutes of the NSF1448107 sponsored group attending Dagstuhl event 14432. Topics were generated using default settings of SAS Text Miner on the concatenated raw minutes files, one record per paragraph, for the first three days of the meeting.
AccessRights
Freely available, with attribution
Contributor
Mary Vardigan(conceptualization, equal), Sam Hume(conceptualization, equal), Sanda Ionescu(conceptualization, equal), Jay Greenfield(conceptualization, equal), Jeremy Iverson(conceptualization, equal), John Kunze(conceptualization, equal), Barry Radler(conceptualization, equal), Stuart Weibel(conceptualization, equal), Michael C. Witt(conceptualization, equal)
Creator
Larry Hoyle
Description
This dataset is intended as an example for attaching source information to a dataset and a variable.
FundingInformation
This dataset was created during Dagstuhl event 14432 by a group funded from NSF grant number1448107.
Language
en-US
License
Freely available, with attribution
ParentDatasets
Cite.Minutes, Cite.Topics
Permanence
Permanent: Unchanging Content
PublicationDate
2014-11-17
Publisher
University of Kansas
RelatedResourceAuthor
Jay Greenfield, Larry Hoyle, Sam Hume, Sanda Ionescu, Jeremy Iverson, John Kunze, Barry Radler, Mary Vardigan, Stuart Weibel, Michael C. Witt
RelatedResourcePublicationDate
2014-11-17
RelatedResourcePublisher
University of Kansas
RelatedResourceRelationship
isDerivedFrom
RelatedResourceTitle
Minutes for Oct 20-22 2014 from NSF1448107 group at Dagstuhl event 14432
ResourceType
dataset
SpatialCoverage
Schloss Dagstuhl, Wadern, Germany
Study_AnalysisUnit
paragraphs from raw minutes files
Study_CollectionMethodology
Minutes were generated as Google Docs, one for each day at the Dagstuhl workshop. All participants could simultaneously edit the daily minutes file. Minutes for 2014-10-20, 2014-10-21, and 2014-10-22were copied from downloaded Microsoft Word files and concatenated into a single text file. This file was read into SAS and then used as input for SAS Text Miner with all default options chosen. The Topics results table was exported as this SAS dataset
Study_FundingInformation
Participant travel and accomodations funded by NSF grant 1448107
Study_KindOfData
SAS Text Miner Topics results table. Derived topics descriptions, dataset includes metadata in SAS extended attributes
Study_ProcessingDescription
/* data read from concatenated minutes */ filename mins "C:\DDRIVE\projects\various\DDI\NSFDearColleague\MineMinutes\Oct20_22Minutes.txt"; libname minlib "C:\DDRIVE\projects\various\DDI\NSFDearColleague\MineMinutes"; data minlib.minutes; infile mins lrecl=520 pad; input para $520.; run; /* Dataset CITE.Topics generated from SAS Text Miner from minlib.minutes, CIte.Topics2 then generated by */ data CITE.TOPICS2; set CITE.TOPICS; Length TopicDescription $ 1000; TopicDescription = catx(" ","Topic ",_topicID," has ",_numDocs," documents and",_name," as Terms:"); run;
Study_Purpose
a sample dataset for enhanced data citation
TemporalCoverage
2014-10-20 to 2014-10-22
Title
Topics generated from minutes from NSF1448107 group at Dagstuhl event 14432
TopicalCoverage
Enhanced citation
Version
1.0
VersionDate
2014-11-17
VersionResponsibility
Larry Hoyle

Variables

1 Variable: _displayCat

Label
Category
Type: Character - Length
16
Transcode
yes
SortedBy
0

2 Variable: _topicid

Label
Topic ID
Type: Numeric, internal bytes
8
Transcode
yes
SortedBy
0

Statistics: _topicid
Statistic Value
Max 25
Mean 13
Min 1
P25 7
P50 13
P75 19
Range 24
StdDev 7.35980072193987

3 Variable: _docCutoff

Label
Document Cutoff
SASFormat
5.3
Type: Numeric, internal bytes
8
Transcode
yes
SortedBy
0

Statistics: _docCutoff
Statistic Value
Max 0.402
Mean 0.26564
Min 0.182
P25 0.22
P50 0.252
P75 0.309
Range 0.22
StdDev 0.05938355552395

4 Variable: _termCutoff

Label
Term Cutoff
SASFormat
5.3
Type: Numeric, internal bytes
8
Transcode
yes
SortedBy
0

Statistics: _termCutoff
Statistic Value
Max 0.287
Mean 0.24752
Min 0.189
P25 0.237
P50 0.248
P75 0.265
Range 0.098
StdDev 0.0223273375036

5 Variable: _name

Label
Topic
Type: Character - Length
100
Transcode
yes
SortedBy
0

6 Variable: _numterms

Label
Number of Terms
Type: Numeric, internal bytes
8
Transcode
yes
SortedBy
0

Statistics: _numterms
Statistic Value
Max 21
Mean 12.36
Min 1
P25 9
P50 12
P75 15
Range 20
StdDev 5.21120587452335

7 Variable: _numdocs

Label
# Docs
Type: Numeric, internal bytes
8
Transcode
yes
SortedBy
0

Statistics: _numdocs
Statistic Value
Max 95
Mean 55.04
Min 12
P25 38
P50 56
P75 71
Range 83
StdDev 22.1443747559811

8 Variable: TopicDescription

Type: Character - Length
1000
Transcode
yes
SortedBy
0

_____________extended attributes_________

AccessRights
Freely available, with attribution
AnalysisUnit
paragraphs
Concept
A label for a topic generated by SAS Text Miner combining the topic number, the number of documents relating to teh topic and the key descriptive terms for the document.
Contributor
Mary Vardigan(writing – review & editing, lead)
Creator
Larry Hoyle
Description
A variable to be used with a Topics results dataset produced by SAS Enterprise Miner Test Miner. One string includes topic number, number of related Documents, and key terms.
GenerationInstruction
TopicDescription = catx(" ","Topic ",_n_," has ",_numDocs," documents and",_name," as Terms:");
Language
en-US
LevelOfMeasurement
Nominal
Permanence
Permanent: Unchanging Content
ProcessingDescription
computed with the following SAS assignment statment: TopicDescription = catx(" ","Topic ",_topicID," has ",_numDocs," documents and",_name," as Terms:"); _topicID, _numDocs, and _name are standard variable names from an unsorted Topics dataset saved from Enterprise Miner.
PublicationDate
2014-11-14
Publisher
University of Kansas
ResourceType
Variable
Role
Potentially useful for topic labeling
Title
Topic Descriptor Combining Sequence Number, Number of Related Documents, and Terms List From A SAS Text Miner Text Topics Node Result Table
VariableIdentifier
TopicDescription
Version
1.0
VersionDate
2014_10_23
VersionResponsibility
Larry Hoyle

Codelists (Formats, Value Labels)

There were 0 formats defined in the SAS session which generated this documentation. Note that not all of these formats were necessarily in use by a variable.

__________________SAS INFORMATS___________________
SAS variables may also have and associated 'informat' which describes how the variable is to be read from a text representation.

SAS INFORMATS

SAS variables may also have an associated 'informat' which describes how the variable is to be read from a text representation. There were 0 informats defined in the SAS session which generated this documentation. Note that not all of these informats were necessarily in use by a variable.

codebook generated at 11/9/2014 11:48:43 AM (11/9/2014 05:48:43 PM UTC)