com 633: content analysis cata kimberly a. neuendorf, ph.d. cleveland state university fall 2010
TRANSCRIPT
COM 633: Content AnalysisCATA
Kimberly A. Neuendorf, Ph.D.
Cleveland State UniversityFall 2010
COM 633 Fall 2010CATA Presentations
Kate & Julie: LIWC & PCADJen & Diane: LIWC & MCCALite?Fran & Dongwoo: CATPAC & WordStatJon & Elizabeth: Yoshikoder & General InquirerJoe: Diction
CATA: Computer Aided Text Analysis
Why might you want to use CATA rather than traditional human-coding techniques?CATA programs typically have been written by researchers with a specific need; thus, their utility is often limited.Online search and acquisition opportunities have made CATA easier, more attractive (e.g., Nexis)
Purposes of CATA1. Descriptive—e.g., word counts
Modell project, using:VBPro, M. Mark Miller, 1980s software
2. Coding of Open-ended Survey Responses
WordStat, SimStat adjunct program(Provalis Research; Normand Peladeau)
Purposes of CATAStandard Dictionaries: Most of the following
applications use internal “standard” dictionaries:
3. Linguistic and Sociolinguistic MeasuresGeneral Inquirer, Philip Stone, 1966
Harvard IV DictionaryMCCALite, Don McTavish & Ellen Pirro
116 “idea categories” are applied to multiple characters in a script
CATPAC, Joseph WoelfelSemantic “neural” networks—no actual dictionary
Purposes of CATA4. Psychometric Measures (or “Thematic Content Analysis”—Smith)
General Inquirere.g., Lasswell Values Dictionary
5. Clinical Psychological/Psychiatric Diagnoses
PCAD, Louis Gottschalk & Robert BechtelComputer version of Gottschalk’s earlier human-coded schemes devised to provide alternative diagnostic techniques
Purposes of CATA6. Verbal Style or Communicator Style
LIWC, Pennebaker, Booth, & Francise.g., positive emotions, cognitive processesAlso includes many linguistic measures and some that might be used as psychometrics
Diction, Rod HartComputer application of Hart’s earlier human-coded schemes aimed at measuring characteristics of political speech—e.g., aggression, cooperation, ambivalence
Purposes of CATA7. Authorship Attribution
Most use simple counts of letters or words to attribute authorship (e.g., the Federalist papers; Raymond Chandler; Shakespeare)Basic computer/word processing programming is sufficient
Measurement in CATAThree choices:
Custom DictionariesComplicated, time-consuming
Standard DictionariesA task of matching one’s conceptualization to someone else’s operationalization—sometimes a scavenger huntSimilar to the challenge of finding an appropriate scale for a survey
“Emergent” Coding—outcome based on language patterns that emerge (e.g., CATPAC)
Quantitative CATA Programs
Program Author Original Purpose
VBPro M. Mark Miller Newspaper articles
Yoshikoder Will Lowe Political documents
WordStat Normand Peladeau Part of SimStat, a statistical analysis package
General Inquirer
Philip Stone General mainframe computer application (1960s)
Profiler Plus Michael Young Communications of world leaders
LIWC 2007 Pennebaker, Booth, & Francis
Linguistic characteristics & psychometrics
Diction 5.0 Rod Hart Political speech
PCAD 2000 Gottschalk & Bechtel Psychiatric diagnoses
WORDLINK James Danowski Network analysis/communication
CATPAC Joseph Woelfel Consumer behavior/marketing
Quantitative CATA Programs
Program Type
VBPro Word count/researcher-created dictionaries only
Yoshikoder Word count/researcher-created dictionaries only
WordStat Word count/researcher-created dictionaries only
General Inquirer
Word count with pre-set dictionaries
Profiler Plus
Word count with pre-set dictionaries
LIWC 2007 Word count with pre-set dictionaries (researcher-created dictionaries may be added)
Diction 5.0 Word count with pre-set dictionaries
PCAD 2000 Word count with pre-set dictionaries (researcher-created dictionaries may be added)
WORDLINK Word co-occurrence
CATPAC Word co-occurrence
Validity and CATA
Validation part of development of CATA system (e.g., Lin et al., 2009—genres of online discussion threads)Validation of thematic CA (psychometrics) against self-report—rare and uncertain (e.g., McClelland et al., 1992)A comprehensive model for assessing content, external, and predictive validity when using CATA—Short, Broberg, Cogliser, Brigham (2010) as applied to “entrepreneurial orientation”:
Content validity—an inductive/deductive comboExternal validity—use multiple sampling framesPredictive validity—measure non-CATA variables that should relate
Validity of Standard Dictionaries
Trusting the Standard Dictionary—an issue of face validity
Few CATA programs reveal the full dictionary lists (e.g., Diction, General Inquirer)None reveal the full algorithm (including disambiguation (e.g., well, pot, leaves))None account for negation
Construct and Criterion ValidityRod Hart’s Diction—”normed” rather than validatedGottschalk and Bechtel’s PCAD—validated against standard psychiatric diagnoses
Quantitative CATA Programs
Program Type Validation
VBPro Word count/researcher-created dictionaries only
N/A—all custom dictionaries
Yoshikoder Word count/researcher-created dictionaries only
N/A—all custom dictionaries
WordStat Word count/researcher-created dictionaries only
N/A—all custom dictionaries
General Inquirer
Word count with pre-set dictionaries No--Dictionaries adapted from Harvard IV, Lasswell values, other standard linguistic and socio-psychological scales
Profiler Plus
Word count with pre-set dictionaries Proprietary
LIWC 2007 Word count with pre-set dictionaries (researcher-created dictionaries may be added)
Some dimensions have been validated against assessments by human judges
Diction 5.0 Word count with pre-set dictionaries No—Based on R. Hart’s substantive work
PCAD 2000 Word count with pre-set dictionaries (researcher-created dictionaries may be added)
Long history of development of a human-coded scheme; both human & CATA heavily validated against clinical diagnoses
WORDLINK Word co-occurrence N/A—emergent dimensions
CATPAC Word co-occurrence N/A—emergent dimensions
Yoshikoder
About Yoshikoder
Created by Will Lowe at Harvard’s Department of Government
Can be downloaded free at www.yoshikoder.org
A cross-platform, multi-lingual CATA program
Must run one case at a time
Assumes the researcher will create dictionaries
Can import external dictionaries
Exports results into Excel
Yoshikoder: KWIC and Concordance
Yoshikoder: Dictionary Report
WordStat
About WordStat• Created by Normand Peladeau, as part of the SimStat suite for quantitative data analysis (a counterpart to SPSS)
•Must be run as part of SimStat
•Particularly suited to analyzing open-ended responses, in that data set typically includes both numeric and textual variables—which can immediately be crosstabulated
•The “standard” dictionaries that are included are incomplete and should be avoided
•Also includes KWIC
The WordStat Interface (within SimStat)
Selection of Independent & Dependent Variables—Including Textual Variable
Standard WordStat “Dictionaries”
Breakdown of very limited WordStat “Dictionary”
WordStat Output: Word counts
WordStat Output: Dendogram
WordStat Output: Crosstab with bar graph
WordStat Output: Crosstab and 3D
representation
WordStat Output: KWIC
General Inquirer (PC/MAC version)
About General InquirerCreated by Philip Stone in the Department of Social Relations at Harvard in the 1960s—on mainframe for many yearsThe current version combines the "Harvard IV-4" dictionary content-analysis categories, the "Lasswell" dictionary content-analysis categories, and five categories based on the social cognition work of Semin and Fiedler, making for 182 categories (dictionaries) in all
The General Inquirer (PC) Interface
Input and output files must be namedTwo choices: Tags (application of dictionaries) & Words
General Inquirer Output: Tags (data file that may easily be exported to Excel &
SPSS)
First row of each set is the ‘r’ (raw count) form of the output. This corresponds to frequencies.
Second row of each set is the ‘s’ (scaled count) form of the output. This corresponds to percentages (of total).
General Inquirer Output: Words
PCAD
About PCADDeveloped by Gottschalk & Bechtel, using scales developed by Gottschalk & Gleser for human-coding in 1960sDiagnostic—assesses one text at a timeIntended for naturally-occurring speech or writing, minimum 80 wordsMeasures states of neuropsychiatric interest such as:
AnxietyHostilityCognitive impairmentDepressionSchizophreniaAchievement StrivingsHope
The PCAD Interface
PCAD Interface-2
PCAD Output: 4 Types(Clauses, Summaries, Analyses,
Diagnoses)
PCAD Output: Analyses
PCAD Output: Diagnoses
LIWC
About LIWC
James W. Pennebaker & Martha E. Francis
•Created by Pennebaker, Booth, & Francis
•“Looks at how people write & their state of mind”
•Intended to measure both affective and cognitive constructs
•84 Output Variables (standard dictionaries):
•17 Standard linguistic dimensions (e.g., number of pronouns)
•25 Word categories (e.g., “psychological constructs – affect, cognition”)
•10 Time categories (e.g.“space, motion”)
•19 Personal concerns (e.g., “home”)
LIWC Dictionaries (dimensions) with sample words
http://www.liwc.net/descriptiontable1.php
The LIWC Interface
LIWC Output: Data Matrix (Each row is a case/text, each column a dictionary)
Diction
About Diction• Created by Roderick P. Hart, University of Texas, originally for the
purpose of analyzing political discourse
• To measure “semantic features”, uses a series of 31 standard dictionaries and five “Master Variables” (scales constituted of combinations of the 31):
• Activity
• Optimism
• Certainty
• Realism
• Commonality Users can create custom dictionaries in addition to standard dictionaries. The program can accept individual or multiple passages.
The Diction Interface
Diction Output: Calculated & Master Variables
Diction Output: Dictionary Totals with Normative Values
Diction Output: Interactively Changing Normative Values
Diction: Custom Dictionaries as Simple .txt Files
Diction Output: Data file may be exported to SPSS
SPSS Syntax Editor
CATPAC
About CATPACCreated by Joseph Woelfel, Communication scientist at University of BuffaloPart of the GALILEO suite of softwares that analyze and display various types of networksCATPAC uses a neural network approach, identifying the most frequent words and determining patterns of connection based on co-occurrenceA scanning window is used to measure the association/co-occurrenceUses cluster analysis to present results of this co-occurrence procedure
The CATPAC Interface
Text input will appear in CATPAC main screen
CATPAC Output: Descending Frequency List, Alphabetically Sorted List
CATPAC Output:
Descending Frequency List, Alphabetically Sorted List
CATPAC Output: Dendogram
CATPAC Output: 3D Plot (using ThoughView, another part of Galileo
Suite)
VBPro
About VBProCreated by M. Mark Miller at the University of TennesseeFor use with MS-DOS (!!)Entirely do-it-yourself. . . no standard dictionariesQuantitative: frequencies & coding texts in numeric format for analysis in statistical softwareQualitative: can provide KWIC (key word in context)
VBPro: Preparing the textMultiple cases within one file are prefixed with an identification tag and saved as a .txt file (NOT .asc, the old standard)
VBPro: Preparing Dictionaries
Each search dictionary is headed with >>#<<
The VBPro Interface
VBPro Output: Data matrix (each row is a case/text, each column a
dictionary)
VBPro Output: Alphabetization
VBPro Output: Word Frequency
MCCALite
About MCCALITE Created by Donald G. McTavish & Ellen B. Pirro, sociologists at the University of Minnesota, 1990Full name: Minnesota Contextual Content AnalysisMeasures the frequency of words in 116 “idea categories” (dictionaries) and compare these frequencies to the norms of general usage statistics for the English LanguageThere are standard dictionaries (categories) and KWIC, DIMAPTwo types of dictionary scores are reported: E-Scores (emphasis) and C-Scores (context)
Ideal content for MCCALITE are multiple-person transcripts (plays, hearings, interviews, TV)
The MCCALite Interface & Output
MCCALite: One more example (of many possible)
end