4 text mining and open ended questions in sample surveys ludovic lebart cnrs
DESCRIPTION
Congreso AMAI 2009TRANSCRIPT
![Page 1: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/1.jpg)
1
Text Mining and Open-ended Questions
in Sample Surveys
Ludovic Lebart Centre National de la Recherche Scientifique
Telecom-ParisTech, Paris, France
AMAI - 2009 - September 8th, 2009 (Mexico D.F.)
![Page 2: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/2.jpg)
Text Mining and Open-ended Questions in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 3: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/3.jpg)
Text Mining and Open-ended Questions in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 4: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/4.jpg)
The « data Mining » approach...
✔ Ancient techniques are easier to use,
✔ Ancient techniques are improved
✔ New techniques are conceived
✔ New fields of application
✔ New products: Softwares
✔ Need for a selection of methods, of simple and clear strategy for data processing
1- Principles of Data Mining and Text mining: A reminder
![Page 5: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/5.jpg)
Reminder : (Fayadd et al.)
Data Mining (KDD) is the non-trivial process of identifying patterns in huge data sets, these patterns being supposed to be valid, novel, potentially useful, and ultimately… understandable
Survey data processing and data mining
These huge data sets could be unstructured, non representative.
The main goal being to automatically extract from the ore (raw data) the genuine diamond of truth…. (Benzécri 1973)
1- Principles of Data Mining and Text mining: A reminder
![Page 6: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/6.jpg)
Survey data : Homogeneity of content, of coding… … different from the usual inputs of Data Mining programs.
Despite the fact that we may deal with several observational levels (households, individuals, trajectories or biographical data, areas or regions…), there is a consistency and a unity of content in a survey data set - together with general hypotheses formulated beforehand - that are not present in the usual data mining input data.
In this context, a lot of meta-information is generally available(Demographic, economic, sociologic, epidemiologic, etc)that provides a framework for the interpretation phase.
A survey (whatever its complexity) is a costly set of measurements that follows a specific decision.
1- Principles of Data Mining and Text mining: A reminder
![Page 7: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/7.jpg)
7
✔ Initial paradigm:✔
✔ - Extracting statistical units from texts ✔
✔ - Complementing lexicometry with a multivariate approach ✔
✔ - Applying visualization tools to lexical tables ✔
✔ Evolution and diversification of techniques and approaches
“Text Mining” and Multivariate exploratory statistical analysis of texts
1- Principles of Data Mining and Text mining: A reminder
![Page 8: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/8.jpg)
8
The fields of Text Mining
WEB Press
Scientific papers, abstracts Information Retrieval
Open-ended questions, free responses
Qualitative interviews, Discourses, Reports
Complaints
1- Principles of Data Mining and Text mining: A reminder
![Page 9: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/9.jpg)
Text Mining and Open-ended Questions in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 10: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/10.jpg)
10
◆ To shorten interview time: Open ended questions are less costly in terms of interview time, and generate less fatigue and tension (voluminous lists of items)
Open questions : Why?
◆ To gather spontaneous information: Marketing survey questions contain many questions of this type. " What do you recall (or: what do you like) about this ad?
◆ To probe the response to a closed-end question.: This is the follow up additional question "Why?". Explanations concerning a response already given have to be provided in a spontaneous fashion.
◆ To get information relating to non-comparable variables: Example : Environmental activism, dietary habits….
2- Open-ended Questions: Why? How?
![Page 11: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/11.jpg)
11
Open questions : Why ?
DRAWBACKSCostComplexitySpecificity
ADVANTAGES SpeedFreedomSpecificity
2- Open-ended Questions: Why? How?
![Page 12: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/12.jpg)
A classical experiment, quoted by Schuman and Presser (1981), stresses the difficulty of comparing the two types of questionning.
When asked "what is the most important problem facing this country [USA] at present", 16% of Americans mention crime and violence (grouped free responses), whereas the same item asked in a closed question produces 35% of the same response.
The explanation given by authors is the following: lack of security is often considered as a local, not a national problem, so that the item crime and violence is not mentioned spontaneously very often.
Closing the question indicates that this response is a relevant or possible response, resulting in a higher response percentage.
Comparison between open and closed questions
2- Open-ended Questions: Why? How?
![Page 13: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/13.jpg)
In some particular contexts, the absence of a response item can play a positive role. It can establish a climate of confidence and communication, and lead to better results when certain subjects are brought up.
This is what is indicated by the work of Sudman and Bradburn (1974) concerning questions having to do with "threats", and of Bradburn et al. (1979) concerning questions about alcohol and sexuality.
In international studies, it is important to know whether people interviewed in different countries understand the closed questions in the same way. (case of the follow up :”Why” ).
As a matter of fact, it is also legitimate to raise this same issue with respect to regional and generational differences.
Heuristic value of open-ended questions
2- Open-ended Questions: Why? How?
![Page 14: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/14.jpg)
In some other particular contexts, the cultural gap between those who have conceived the questionnaire and the interviewees is hidden by the purely numerical coding of the closed questions.
In a national survey about the attitudes of economically impaired people towards the minimum wage system in France, a classical open questionwas asked at the end of the interview:
“Would you like to add something about some topics that could be missing in this questionnaire, about the minimum wage system ?”
One answer (among many others of the same vein) was “ We eat potatoes and eggs, despite my diabetes and my cholesterol, because there are cheap.”
Another: “Thank you for coming. It proves that you are thinking of me”.
Some respondents are far from the problematic “Attitude towards an institution”
Heuristic value of open-ended questions (continuation)
2- Open-ended Questions: Why? How?
![Page 15: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/15.jpg)
15
Empirical Post-Coding of free responses
(Drawbacks of this type of processing)
Coder bias: Coder bias is added to interviewer bias, since the coder makes decisions and formulates interpretations, introducing a «personal touch ».
Alteration of form: Information is destroyed in its form and often weakened in its content: quality of expression, level of vocabulary, and general interview tonality are lost.
Weakening of content: (case of responses that are composed, complex, vague and diversified).
Infrequent responses are eliminated a priori.
2- Open-ended Questions: Why? How?
![Page 16: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/16.jpg)
16
Example 1: Open Question « Life » (international sample surveys)
The following open-ended question was asked :
"What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?".
This question was included in a multinational survey conducted in seven countries (Japan, France, Germany, Italy, Nederland, United Kingdom, USA) in the late nineteen eighties (Hayashi et al., 1992).
Our illustrative example is limited to the British sample (Sample size: 1043).
2- Open-ended Questions: Why? How?
![Page 17: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/17.jpg)
17
GenderEduc. Age Responses 1 1 4 happiness in people around me, contented family, would make me happy 1 2 2 my own time, not dictated by other people 1 2 2 freedom of choice as to what I do in my leisure time 1 3 2 I suppose work 1 2 1 firm, my work, which is my dad's firm 2 1 6 just the memory of my last husband 2 2 6 well-being of my handicapped son 1 1 5 my wife, she gave me courage to carry on even in the bad times 2 2 3 my sons, my kids are very important to me, being on my own, I am responsible for their education 1 3 3 job, being a teacher I love my job, for the well-being of the children
Example 1: Open Question « Life »: Examples of responses
2- Open-ended Questions: Why? How?
![Page 18: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/18.jpg)
Following a viewing of a television commercial on breakfast cereals (copy-test), several open questions were asked. One of them, which we shall use as our example, is :
What was the main idea of this commercial?
In addition a number of closed questions were also asked (socio-demographic characteristics of respondents, purchase intent toward product seen). Purchase intent being an important issue, this question plays a major role in the discussions that follow.
Two examples of responses to that open question.
1 - That it has complex carbohydrates in it, it has energy releaser and it tastes good... It showed people eating grape nuts.
2 - It gives you energy in the morning, nothing else.
Example 2: Open Questions / Copy-Test
2- Open-ended Questions: Why? How?
![Page 19: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/19.jpg)
A survey in three cities (Tokyo, New York, Paris) about dietary habits.
The common open-ended questions were:
"What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").“ What would be an ideal meal?”
Akuto H.(Ed.) (1992). International Comparison of Dietary Cultures, Nihon Keizai Shimbun, Tokyo.
Akuto H., Lebart L. (1992). Le Repas Idéal. Analyse de Réponses Libres en Anglais, Français, Japonais. Les Cahiers de l'Analyse des Données, vol XVII, n°3, Dunod, Paris
Example 3: An international survey (Tokyo Gas Company)
2- Open-ended Questions: Why? How?
![Page 20: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/20.jpg)
Four responses (New York) "What dishes do you like and eat often? “What would be an ideal meal?”
---- 1SPAGHETTI,CHINESE++++CAESAR SALAD,LOBSTER TAILS,BAKED POTATO, CHOCOLATE MOUSSE
---- 2SEAFOOD,GREEN SALAD,CHINESE FOOD++++CHAMPAGNE,CAVIAR,GREEN SALAD,GRILLED SEAFOOD
---- 3CHINESE FOOD++++CHINESE FOOD,FRENCH FOOD,VEAL,BREAD---- 4PASTA++++BEARNAISE BEEF,CHINESE FOOD,ITALIAN FOOD,PASTA
Example 3: An international survey (continuation)
2- Open-ended Questions: Why? How?
![Page 21: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/21.jpg)
5 denominaciones: Bierzo, Cigales, Ribera del Duero, Rueda, Toro
Example 4: Evaluación de vinos mediante notas y comentarios
Guia de Catas de Castilla y León (2005) 522 vinos de Castilla y León pertenecientes a 207 bodegas
2- Open-ended Questions: Why? How?
![Page 22: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/22.jpg)
---- Nota= 80 Valdelosfriales-2003Joven típico, con notas de tempranillo y balsámicos; en boca amable y frutoso.
---- Nota=91 Tares P3-2001 premiumMucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final.
Example 4: Evaluación de vinos mediante notas y comentarios
Example of two texts
2- Open-ended Questions: Why? How?
![Page 23: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/23.jpg)
Text Mining and Open-ended Questions in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 24: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/24.jpg)
24
Statistical units derived from texts
Characterss
Words, lemmas, n-grams
Segments or quasi-segments
Sentences or responses
Texts
CORPUS
3- From texts to numerical data
![Page 25: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/25.jpg)
25
Ambiguity of frequencies: statistical frequency versus « linguistic frequency »
Closed questions
Texts
Open questions ouvertes
(statistical frequency)
( linguistic frequency)
Sample surveys
3- From texts to numerical data
![Page 26: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/26.jpg)
26
In this example we focus on a partitioning of the sample into nine categories, obtained by cross-tabulating age (three categories) with educational level (three categories).
The counts for the first phase of numeric coding are as follows: Out of 1043 responses, there are 13 669 occurrences, with 1 413 distinct words. When the words appearing at least 16 times are selected, there remain 10 357 occurrences of these words, with 135 distinct words (types).
Example 1: Question « Life » - continuation
The same questionnaire also had a number of closed-end questions (among them, the socio-demographic characteristics of the respondents, which play a major role).
3- From texts to numerical data
![Page 27: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/27.jpg)
27
Words Appearing at Least Sixteen Times (Alphabetic Order) in the 1043 responses to the open question
Word Frequency Word Frequency Word Frequency I 248 go 19 of 312 I'm 22 going 26 on 59 a 298 good 303 other 33 able 55 grandchildren 30 others 17 about 31 happiness 227 our 29 after 26 happy 137 out 34 all 86 have 99 own 16 and 504 having 70 peace 77 anything 19 health 609 people 61 are 65 healthy 45 really 28
Example 1: Selected statistical units
3- From texts to numerical data
![Page 28: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/28.jpg)
28
Gender Educ. Age Tagged responses 1 1 4 happiness/NN in/IN people/NNS around/IN me/PRP
,/, contented/VBN family/NN ,/, would/MD make/VB me/PRP happy/JJ
1 2 2 my/PRP$ own/JJ time/NN ,/, not/RB dictated/VBN by/IN other/JJ people/NNS
1 2 2 freedom/NN of/IN choice/NN as/IN to/TO what/WP I/PRP do/VB in/IN my/PRP$ leisure/NN time/NN
1 3 2 I/PRP suppose/VBP work/NN 1 2 1 firm/NN ,/, my/PRP$ work/NN ,/, which/WDT is/VBZ
my/PRP$ dad's/NNS firm/NN 2 1 6 just/RB the/DT memory/NN of/IN my/PRP$ last/JJ
husband/NN 2 2 6 wellbeing/NN of/IN my/PRP$ handicapped/JJ son/NN 1 1 5 my/PRP$ wife/NN ,/, she/PRP gave/VBD me/PRP
courage/NN to/TO carry/VB on/IN even/RB in/IN the/DT bad/JJ times/NNS
Example of morpho-syntactic information
3- From texts to numerical data
![Page 29: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/29.jpg)
29
- First partition: three age categories
- less than 30 years [noted -30], - between 30 years and 55 years [-55] - over 55 years [+ 55] .
- Second partition: three educational levels - No degree or Low [noted L], - Medium [M], - High level [H]
Example of a lexical contingency table
3- From texts to numerical data
![Page 30: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/30.jpg)
30
Partial listing of lexical table cross-tabulating 135 words of frequency greater than or equal to 16 with 9 age-education categories
L-30 L-55 L+55 M-30 M-55 M+55 H-30 H-55 H+55 I 2 46 92 30 25 19 11 21 2 I'm 2 5 9 3 2 1 0 0 0 a 10 56 66 54 44 19 20 22 7 able 1 9 16 9 7 4 4 5 0 about 0 3 13 7 1 2 4 1 0 after 1 8 11 3 1 2 0 0 0 all 1 24 19 8 18 6 3 5 2 and 8 89 148 86 73 30 25 32 13 anything 0 4 9 1 3 0 1 1 0
Example 1: A lexical contingency table
3- From texts to numerical data
![Page 31: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/31.jpg)
Example 2: "What is the main idea in this commercial"Words appearing more than 9 times (100 responses)
Number Word Frequency Number Word Frequency
1 I 14 25 in 272 a 59 26 is 373 about 15 27 it 1334 all 21 28 it's 285 and 42 29 long 146 are 25 30 morning 97 been 12 37 nothing 258 carbohydrate 14 32 nutritional 99 carbohydrates33 33 nutritious 1210 cereal 34 34 nuts 2511 complex 25 35 of 2512 crunchy 9 36 people 2813 eaten 10 37 showed 1114 eating 19 38 taste 1115 energy 33 39 that 8016 for 57 40 that's 1317 give 9 41 he 8218 gives 11 42 they 5019 good 52 43 to 3220 grape 25 44 was 1921 has 30 45 with 1122 have 27 46 years 1123 healthy 23 47 you 8124 how 9
•
3- From texts to numerical data
![Page 32: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/32.jpg)
SEGM FREQ LENGTH "TEXT of SEGMENT" ------------------------------------- -----------------------------------------a 1 8 3 a long time -----------------------------------------are 2 6 4 are good for you -----------------------------------------carbohydrates 3 5 3 carbohydrates in it -----------------------------------------complex 4 15 2 complex carbohydrates -----------------------------------------for 5 37 2 for you -----------------------------------------give 6 7 3 give you energy -----------------------------------------gives 7 11 2 gives you 8 9 3 gives you energy -----------------------------------------good 9 24 2 good for 10 22 3 good for you -----------------------------------------grape 11 25 2 grape nuts -----------------------------------------have 12 6 3 have been eating -----------------------------------------healthy 13 6 3 healthy for you -----------------------------------------is 14 9 4 is good for you -----------------------------------------it 15 26 2 it has 16 19 2 it is 17 14 2 it was 18 8 3 it gives you 19 8 3 it has a 20 6 3 it has complex 21 5 3 it is good 22 6 4 it gives you energy -----------------------------------------people
3- From texts to numerical data
Example 2: "What is the main idea in this commercial" Examples of repeated segments
![Page 33: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/33.jpg)
Example 3: An international survey (Tokyo Gas Company)
The common open-ended question : "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").
- Sub-sample 1 (city of Tokyo) : 1008 individuals. The global corpus of open responses contains 6219 occurrences of 832 distinct words. 139 words appear at least 7 times, leading to 4975 occurrences.
- Sub-sample 2 (city of New-York) contains 634 individuals. (6511 occurrences of 638 distinct words). The processing takes into account the 83 words appearing at least 12 times.
- Sub-sample 3 (city of Paris) contains 1000 individuals. The global corpus contains 11108 occurrences of 1229 distinct words. The processing takes into account the 112 words appearing at least 18 times, leading to 7806 occurrences.
- The three sets of respondents are broken down into into six categories (three categories of age, combined with the gender).
3- From texts to numerical data
![Page 34: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/34.jpg)
Example 3: An international survey (Tokyo Gas Company) !------------------------------------! ! words (frequency order) ! !-------!---------------------!------! ! num. ! used words ! freq.! !-------!---------------------!------! ! 12 ! CHICKEN ! 254 ! ! 73 ! STEAK ! 101 ! ! 49 ! PASTA ! 95 ! ! 22 ! FISH ! 87 ! ! 60 ! SALAD ! 85 ! ! 1 ! AND ! 85 ! ! 23 ! FOOD ! 82 ! ! 52 ! PIZZA ! 62 ! ! 79 ! VEGETABLES ! 57 ! ! 4 ! BEEF ! 56 ! ! 71 ! SPAGHETTI ! 55 ! ! 13 ! CHINESE ! 54 ! ! 80 ! WITH ! 48 ! ! 59 ! ROAST ! 47 ! ! 58 ! RICE ! 45 ! ! 67 ! SHRIMP ! 45 ! ! 43 ! MACARONI ! 42 ! ! 56 ! POTATOES ! 39 ! ! 35 ! HAMBURGERS ! 36 ! ! 75 ! TUNA ! 35 ! ! 26 ! FRIED ! 33 ! ! 77 ! VEAL ! 33 ! ! 38 ! ITALIAN ! 31 ! ! 2 ! BAKED ! 29 ! ! 48 ! PARMESAN ! 29 ! ! 55 ! POTATO ! 27 ! ! 46 ! MEATBALLS ! 25 ! ! 3 ! BEANS ! 24 ! ! 45 ! MEAT ! 24 ! ! 76 ! TURKEY ! 24 ! ! 14 ! CHOPS ! 23 ! ! 34 ! HAMBURGER ! 22 ! !------------------------------------!
3- From texts to numerical data
City of New York
The common open-ended question : "What dishes do you like and eat often? (With a probe: "Any other dishes you like and eat often?").
![Page 35: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/35.jpg)
108toque18
116bien17
140madera16
152una15
159el14
167taninos13
168que12
211notas10
211la10
237muy9
246nariz8
308un7
334fruta6
356con5
433boca4
694en3
806y2
891de1
FrecPalabraPos.
Lematización:-Singulares y plurales- Masculino y femenino- Formas verbales a infinitivo- …
Eliminados artículos, preposiciones …
Conservadas palabras utilizadas al menos 8 veces.
Quedan-250 palabras-443 vinos
P1 P2 ... P250Vino 1 0 1 ... 2Vino 2 1 0 ... 1Vino 3 0 0 ... 1. . . . . . . . . . .Vino 443 1 2 ... 0
Example 4: Evaluación de vinos mediante notas y comentarios
3- From texts to numerical data
![Page 36: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/36.jpg)
3- From texts to numerical data
Example 4: Evaluación de vinos mediante notas y comentarios (Continuation)
0 50 100 150 200 250 300 350 400
acidez (acidity)potente (powerful)
suave (mild)ligero (light)
ser (to be)cereza (cherry)
algo (some/something)fino (fine)
medio (medium)jugoso (juicy)
agradable (pleasant)elegante (elegant)
todavía (still)vino (wine)
balsámico (balsamic) maduro (ripened)
final (end)bien (well)
toque (hint)negro (black)
rojo (red)buen (good)
madera (wood)taninos (tannins)
nota (note)nariz (nose)muy (very)fruta (fruit)
boca (mouth)
Frequency
![Page 37: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/37.jpg)
Text Mining and Open-ended Questions in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 38: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/38.jpg)
38
✔ Applying visualization tools to lexical tables✔
● Principal axes analyses of lexical tables● Classification (clustering) of words and texts ✔
✔ Selecting characteristic units and responses ✔ (or: sentences)✔
● Characteristic units (words, segments, lemmas)● Selecting « Modal responses »
4) Basic statistical tools: Visualization, Characteristic words.
![Page 39: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/39.jpg)
Briefly, one can summarize the principles of methods for performing these data reductions:
Principal axes methods, largely based upon linear algebra, produce graphical representations on which the geometric proximities among row-points and among column-points translate statistical associations among rows and among columns. Correspondence analysis belongs to this family of methods.
Clustering or classification methods that create groupings of rows or of columns into clusters (or into families of hierarchical clusters) including the SOM (Self Organizing Maps, or Kohonen maps).
These two families of methods can be used on the same data matrix and they complement one another very effectively.
Selection of characteristic units and responses (or: sentences) Characteristic units (words, segments, lemmas)Selecting « Modal responses »
4) Basic statistical tools: Visualization, Characteristic words.
![Page 40: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/40.jpg)
40
Visualization through principal coordinates, « a breakthrough in 1904 ».
Charles Spearman (1904) – “General intelligence, objectively determined and measured”. Amer. Journal of Psychology, 15, p 201-293.
j jii j ix a f= + ε
Value of variable j for individual i
Coefficient of variable j
General factor for individual i
Residual (hopefully small)
Known = Unknown
4) Basic statistical tools: Visualization, Characteristic words.
![Page 41: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/41.jpg)
41
...j ji ii j j ix a f b g= + + + ε
Garnett J.-C. (1919) - General ability, cleverness and purpose. British J. of Psych., 9, p 345-366.Thurstone L. L. (1947) - Multiple Factor Analysis. The University of Chicago Press, Chicago.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 42: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/42.jpg)
42
X v1 u'1 u'pvpu'αvα
+ ... + λ αλ 1 + ... + λ p= × ××
Eckart C., Young G. (1936) - The approximation of one matrix by another of lower rank. Psychometrika, l, p 211-218.
Eckart C., Young G. (1939) - A principal axis transformation for non- hermitian matrices. Bull. Amer. Math. Assoc., 45, p 118-121.
Singular Values Decomposition is a theorem, not a model
A precursor: Pearson K. (1901) - On lines and planes of closest fit to systems of points in space. Phil. Mag. 2, n°ll, p 559-572.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 43: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/43.jpg)
43
95 88 88 87 95 88 95 95 95 106 95 78 65 71 78 77 77 etc. 143 144 151 151 153 170 183 181 162 140 116 128 133 144 159 166 170 153 151 162 166 162 151 126 117 128 143 147 175 181 170 166 132 116 143 144 133 130 143 153 159 175 192 201 188 162 135 116 101 106 118 123 112 116 130 143 147 162 183 166 135 123 120 116 116 129 140 159 133 151 162 166 170 188 166 128 116 132 140 126 143 151 144 155 176 160 168 166 159 135 101 93 98 120 128 126 147 154 158 176 181 181 154 155 153 144 126 106 118 133 136 153 159 153 162 162 154 143 128 159 153 147 159 150 154 155 153 158 170 159 147 130 136 140 150 150 151 144 147 176 188 170 166 183 170 166 153 130 132 154 162 120 135 155 181 183 162 144 147 147 144 126 120 123 129 130 112 101 135 150 166 147 129 123 133 144 133 117 109 118 132 112 109 120 136 120 136 136 130 136 147 147 140 136 144 140 132 129 151 153 140 128 153 147 130 133 140 124 136 152 166 147 144 151 159 140 123 130 123 109 112 126 120 143 145 162 153 155 175 154 144 136 130 120 112 123 123 144 144 159 155 155 162 166 158 147 140 147 126 123 132 135 136 144 147 136 143 162 175 136 110 112 135 120 118 126 151 150 130 129 133 147 133 151 143 106 85 93 128 136 140 140 144 143 126 117 116 129 124 ……………………………..etc.
Image “Cheetah” (Data Compression, Mark Nelson)and table (200 x 320) containing levels of grey.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 44: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/44.jpg)
44
Trace before diagonalization: 0.15930 Trace after diagonalization: 0.15930 eigenvalues 1: 0.045 28.549 28.549 ************************************************** 2: 0.028 17.695 46.243 ****************************** 3: 0.019 12.205 58.448 ********************* 4: 0.012 7.306 65.754 ************ 5: 0.007 4.674 70.428 ******** 6: 0.006 3.516 73.944 ****** 7: 0.005 2.944 76.888 ***** 8: 0.003 2.179 79.067 *** 9: 0.003 1.869 80.936 *** 10: 0.002 1.531 82.467 ** 11: 0.002 1.371 83.838 ** 12: 0.002 1.106 84.944 * 13: 0.002 1.066 86.010 * 14: 0.002 0.956 86.965 * 15: 0.001 0.791 87.756 * 16: 0.001 0.758 88.514 * 17: 0.001 0.690 89.204 * 18: 0.001 0.567 89.771 19: 0.001 0.554 90.325 20: 0.001 0.477 90.801 21: 0.001 0.422 91.223 22: 0.001 0.406 91.629 23: 0.001 0.384 92.013 24: 0.001 0.339 92.352
Eigenvalues of the Correspondence Analysis of the previous table
4) Basic statistical tools: Visualization, Characteristic words.
![Page 45: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/45.jpg)
45
Reconstitution of the Cheetah with 2, 4, 6, 8, 10, 12, 20, 30, 40 principal axes
4) Basic statistical tools: Visualization, Characteristic words.
![Page 46: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/46.jpg)
46
A pedagogical example: Description of « Textual Graphs »
4) Basic statistical tools: Visualization, Characteristic words.
![Page 47: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/47.jpg)
47
**** Ain Ain Isere Jura Rhone Hte_Saone Savoie Hte_Savoie **** Aisne Aisne Ardennes Marne Nord Oise Seine_Marne Somme
**** Allier Allier Cher Creuse Loire Nievre Puy_de_Dome Hte_Saone
**** Alpes_Prov Alpes_Prov Alpes_Hautes Alpes_Marit Drome Var Vaucluse
**** Alpes_Hautes Alpes_Hautes Alpes_Prov Drome Isere Savoie
**** Alpes_Marit Alpes_Marit Alpes_Prov Var
**** Ardeche Ardeche Drome Gard Loire Hte_Loire Lozere
**** Ardennes Ardennes Aisne Marne Meuse ……………………….
Each area “answers” to the fictitious “open-question” : Which are your neighbouring areas?
4) Basic statistical tools: Visualization, Characteristic words.
![Page 48: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/48.jpg)
48
The idea: When a pattern existswithin a text, some techniques maydetect it and exhibit it.
This map is blindlyproduced from theprevious texts.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 49: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/49.jpg)
49
Correspondence analysis can be presented in several different ways.
• It is difficult to trace the method's history accurately(see, e.g., Hill, 1974 ; Benzecri, 1976 ; Nishisato, 1980; Gifi, 1990).
•The underlying theory probably dates back to...
•Fisher (1936) , Guttman (1941), and Hayashi (1956).
Example: CORRESPONDENCE ANALYSIS of a simple lexical table
4) Basic statistical tools: Visualization, Characteristic words.
![Page 50: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/50.jpg)
50
• Correspondence analysis and principal components analysis are used under different circumstances:
• Principal components analysis is used for tables consisting of continuous measurements.
• Correspondence analysis is best applied to contingency tables (cross-tabulations) frequently encountered when analyzing textual data.
• By extension, it also provides a satisfactory description of data tables with binary coding.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 51: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/51.jpg)
51
• Cross-tabulations or contingency tables are among the most common data structures used for analyzing qualitative data.
• By looking simultaneously at two partitions at a time of a population or sample, a cross-tabulation enables us to work with variations in the data by response categories, a necessary step for the interpretation of results.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 52: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/52.jpg)
52
• We will use as a leading example a small contingency table.
• However, this kind of exploratory method is chiefly useful when we are dealing with very large data tables
• (pedagogical paradox)
• In the following table, the 14 rows are words used in responses to an open-ended question given by 2000 respondents.
•The 5 columns are the educational levels of the respondents.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 53: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/53.jpg)
53
No Elem.Trade High Coll- Total Words degr. Sch. Sch. Sch. ege
Money 51 64 32 29 17 193 Future 53 90 78 75 22 318 Unemployment 71 111 50 40 11 283 Decision 1 7 5 5 4 22 Difficult 7 11 4 3 2 27 Economic 7 13 12 11 11 54 Selfishness 21 37 14 26 9 107 Occupation 12 35 19 6 7 79 Finances 10 7 7 3 1 28 War 4 7 7 6 2 26 Housing 8 22 7 10 5 52 Fear 25 45 38 38 13 159 Health 18 27 20 19 9 93 Work 35 61 29 14 12 151
Total 323 537 322 285 125 1592
A contingency table crossing words and education level
4) Basic statistical tools: Visualization, Characteristic words.
![Page 54: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/54.jpg)
54
No Elem. Trade High Coll Total
Words degree Sch. Sch. Sch. ege
Money 26.4 33.2 16.6 15.0 8.8 100.0 Future 16.7 28.3 24.5 23.6 6.9 100.0 Unemployment 25.1 39.2 17.7 14.1 3.9 100.0 Decision 4.5 31.8 22.7 22.7 18.2 100.0 Difficult 25.9 40.7 14.8 11.1 7.4 100.0 Economic 13.0 24.1 22.2 20.4 20.4 100.0 Selfishness 19.6 34.6 13.1 24.3 8.4 100.0 Occupation 15.2 44.3 24.1 7.6 8.9 100.0 Finances 35.7 25.0 25.0 10.7 3.6 100.0 War 15.4 26.9 26.9 23.1 7.7 100.0 Housing 15.4 42.3 13.5 19.2 9.6 100.0 Fear 15.7 28.3 23.9 23.9 8.2 100.0 Health 19.4 29.0 21.5 20.4 9.7 100.0 Work 23.2 40.4 19.2 9.3 7.9 100.0
Total 20.3 33.7 20.2 17.9 7.9 100.0
4) Basic statistical tools: Visualization, Characteristic words.
Row-profiles of the same table
![Page 55: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/55.jpg)
55
No Elem. Trade High Coll- Total Words degree Sch. Sch. Sch. ege
Money 15.8 11.9 9.9 10.2 13.6 12.1 Future 16.4 16.8 24.2 26.3 17.6 20.0 Unemployment 22.0 20.7 15.5 14.0 8.8 17.8 Decision .3 1.3 1.6 1.8 3.2 1.4 Difficult 2.2 2.0 1.2 1.1 1.6 1.7 Economic 2.2 2.4 3.7 3.9 8.8 3.4 Selfishness 6.5 6.9 4.3 9.1 7.2 6.7 Occupation 3.7 6.5 5.9 2.1 5.6 5.0 Finances 3.1 1.3 2.2 1.1 .8 1.8 War 1.2 1.3 2.2 2.1 1.6 1.6 Housing 2.5 4.1 2.2 3.5 4.0 3.3 Fear 7.7 8.4 11.8 13.3 10.4 10.0 Health 5.6 5.0 6.2 6.7 7.2 5.8 Work 10.8 11.4 9.0 4.9 9.6 9.5
Total 100.0 100.0 100.0 100.0 100.0 100.0
4) Basic statistical tools: Visualization, Characteristic words.
Column-profiles of the same table
![Page 56: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/56.jpg)
56
row-profiles column-profiles
general term of the contingency table
• •• •••
•••••
•••
•• •
••
••
••• • •••
••• •
•• • •
Rp
•
• • ••
•••
•
•
•
n points in R
p
•• • ••
•• ••
• •
••
••• •
• ••
••• •
• •
R n
1 j p1 i n
fij F = (n,p)
i i'
1 j p j j'1 i n
p points in R
n
Symmetryof the twospaces:
rows and columns
4) Basic statistical tools: Visualization, Characteristic words.
![Page 57: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/57.jpg)
57
C O L L E G E
D e c i s i o n
E c o n o m i c
H I G H
F u tu r e
W a rF e a r
H e a l t h
S e l f i s h n e s s
N o D E G R E E
U n e m p l o y m e n t
M o n e y
D i f f i c u l t
W o r kH o u s i n g
F i n a n c e s
- . 1
- . 1 5
0
. 1 5
. 1 . 2- . 2
A x i s 2 ( 2 1 % )
A x i s 1
( 5 7 % )
O c c u p a t i o n
E L E M
T R A D E
.
.
.
.
..
.
..
.
.
.
..
.
.
.
.
.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 58: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/58.jpg)
5858
Characteristic elements (words, lemmas, segments)
The corpus contains several parts (categories of respondents).
Notations:kij -sub-frequency of word i in the part j of the corpus;ki. -frequency of word i in the whole corpus;k.j -frequency (size) of part j;k.. -size of the corpus (or, simply, k).
We are interested in the statistical significance of sub-frequency kij .
Is the word i abnormally frequent in part j ? Is it abnormally rare?
The comparison between the relative frequency of word i in part j and the relative frequency of word i in the entire corpus leads to a classicalstatistical test using either the hypergeometric distribution or its normalapproximation.
4) Basic statistical tools: Visualization, Characteristic words.
![Page 59: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/59.jpg)
5959
The 4 parameters for computing characteristic elements
k . j
k i j k i .
k . .
k i j
k i . frequency of word in corpus
k . j size of text part
k . . size of corpus
T E X T P A R T S
W O
R D
S
frequency of word in text part
4) Basic statistical tools: Visualization, Characteristic words.
![Page 60: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/60.jpg)
60
Text Mining and Open-ended Questions in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 61: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/61.jpg)
The two forthcoming diapositives show the principal plane produced by a correspondence analysis of the previous lexical contingency table (section 3).
Proximity between 2 category-points (columns) means similarity of lexical profiles of the 2 categories.
Proximity between 2 word-points (rows) means similarity of lexical profiles of these words.
5) Applications: Open questions, sample surveys, texts
Example 1: Open Question « Life » (International sample surveys)
The following open-ended question was asked :
"What is the single most important thing in life for you?" It was followed by the probe: "What other things are very important to you?".
![Page 62: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/62.jpg)
62
s e c u r i t y
m i n d
k i d s
p e a c e
l e i s u r e f r e e d o m s t a n d a r d
h o u s e t i m e
c o n t e n t m e n tc h i l d r e n
w e l f a r e c h u r c hs o n
g e n e r a lf a m i l y
h a p p i n e s se m p l o y m e n t
w o r l d d a u g h t e rd o g
w i f e
a r ef r o m
v e r y
s h o u l d
m e
h e l p
i ft h e m
f o r
m u s i c
w o r ke d u c a t i o n
l o v es a t i s f .
j o b
f u t u r ef r i e n d s
m o n e y
t h i n k
c o m f o r t a b l y
h a v ek e e p
g o i n g
a n y t h i n g
w o u l d
d a y
m o r e
a n d
b e
n o t
w e l l
I
t o
y o u
c o m f o r t a b l em u c h
c a r
t h i n g s
o u t g oc a n
E 3 - A G E 2E 1 - A G E 2
E 2 - A G E 3
E 2 - A G E 2
E 1 - A G E 3
E 2 - A G E 1
E 1 - A G E 1
E 3 - A G E 1
E 3 - A G E 3
CorrespondenceWords - Categories
Example 1 (« Life » question)
![Page 63: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/63.jpg)
63
p e a c e o f m i n d
a g o o d s t a n d a r d o f l i v i n g
w e l f a r e o f m y f a m i l y
h a p p i n e s s , g o o d h e a l t h
l a w a n d o r d e r
a g o o d j o b
f r i e n d s a n d f a m i l y
h a v i n g e n o u g h m o n e y t o l i v e
c a n ' t t h i n k o f a n y t h i n g e l s e
E 3 - A G E 2E 1 - A G E 2
E 2 - A G E 3
E 2 - A G E 2
E 1 - A G E 3
E 2 - A G E 1
E 1 - A G E 1
E 3 - A G E 1
E 3 - A G E 3
p e a c e i n t h e w o r l d
a n i c e h o m e
I d o n ' t k n o w
Location ofSegments
Example 1 (« Life » question)
![Page 64: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/64.jpg)
64
H-30 = -30 * high (Young, High education) 1 f r i e n d s 2 . 8 7 1 . 1 1 1 7 1 1 6 3 . 4 4 2 d o 1 . 3 5 . 4 5 8 4 7 2 . 6 0 3 w a n t 1 . 0 1 . 3 0 6 3 1 2 . 4 4 4 b e i n g 2 . 1 9 1 . 1 1 1 3 1 1 6 2 . 1 8 5 j o b 2 . 5 3 1 . 3 6 1 5 1 4 2 2 . 1 6 6 h a v i n g 1 . 5 2 . 6 7 9 7 0 2 . 1 1 7 t h i n g s . 8 4 . 2 7 5 2 8 2 . 0 6 - - - - - - - - - - - - - - - - 2 w i f e . 0 0 . 6 5 0 6 8 - 2 . 1 0 1 h e a l t h 2 . 7 0 5 . 8 5 1 6 6 0 9 - 3 . 5 9 H+55 = +55 * high (Older, High education) 1 m i n d 2 . 5 5 . 4 5 5 4 7 2 . 9 1 2 w e l f a r e 1 . 5 3 . 2 1 3 2 2 2 . 4 2 3 p e a c e 2 . 5 5 . 7 4 5 7 7 2 . 1 7
Example 1 (« Life » question) Characteristic words words %W %glob Fr.W Fr.glob TestValue
5) Applications: Open questions, sample surveys, texts
![Page 65: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/65.jpg)
65
Category 7 Less than 30 years, high level of education 1 . 3 3 - 1 f r i e n d s , f r i e n d s , m y h o m e l i f e
1 . 1 2 - 2 b e i n g c o n t e n t h a v i n g e n o u g h m o n e y t o d o w h a t y o u w a n t t o d o , w i t h i n r e a s o n , h a v i n g g o o d f r i e n d s , h a v i n g a f u l f i l l i n g j o b t o d o , h a v i n g s o m e i d e a o f w h a t y o u w a n t t o d o a n d t h e f r e e d o m t o c h o o s e , p r o t e c t i o n o f t h e e n v i r o n m e n t
1 . 0 5 - 3 t o h a v e g o o d f r i e n d s a r o u n d h a v i n g a g o o d j o b , l i v i n g i n a g o o d a r e a , h a v i n g l o t s o f f r e e d o m t o d o t h e t h i n g s y o u w a n t t o d o
. 9 3 - 4 g o o d l i v i n g e d u c a t i o n , g o o d j o b , m o n e y
Category 9 Over 55 years, high level of education . 9 7 - 1 t o g e t h e r n e s s , p e a c e o f m i n d , g o o d h e a l t h , r e l i g i o n , n o . 6 4 - 2 n o t t o d i e , p e a c e o f m i n d , d o n ' t l i k e p e o p l e l i v i n g e n v i o u s o f e a c h
o t h e r . 6 3 - 3 p e a c e o f m i n d g o o d h e a l t h , h a p p i n e s s , e n o u g h m o n e y t o k e e p a
s t a n d a r d o f l i v i n g . 3 8 - 4 w e l f a r e o f m y f a m i l y w o r k , s a t i s f a c t i o n , g o o d h e a l t h , t r a v e l
Example 1 (« Life » question) : Modal Responses
5) Applications: Open questions, sample surveys, texts
![Page 66: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/66.jpg)
66
Example 1: Similar survey in Japan (Same open question, same categories of respondents)
5) Applications: Open questions, sample surveys, texts
![Page 67: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/67.jpg)
67
Example 1: Similar survey in Japan: visualizationof the characteristic words for 2 categories
5) Applications: Open questions, sample surveys, texts
![Page 68: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/68.jpg)
Example 2: Open Questions / Copy-Test
5) Applications: Open questions, sample surveys, texts
![Page 69: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/69.jpg)
E x a m p l e s o f 2 C h a r a c t e r i s t i c r e s p o n s e s f o r 4 c a t e g o r i e s TEXT 1 PWNB = Prob.w.n.buy -- 1 to tell you about how long people have eaten them. -- 1 the complex carbohydrate that are in this cereal. -- 1 the people who eat this cereal and the product. that's all. -- 2 it's supposed to be healthy, it has good carbohydrates in it. TEXT 2 Hesi = Hesitates . -- 1 it gives you energy in the morning. nothing else. . -- 2 grape nuts cereal gives you energy -- 2 it has complex carbohydrates. they showed the man eating it with -- 2 strawberries and bananas. TEXT 3 PWB = Probably would buy -- 1 it's nutritious for you. nothing else. -- 2 that,is good for you, that,s all it said to me TEXT 4 DW B = Definitely would buy -- 1 they are bigger nuggets. low in carbohydrates, that's all. -- 2 it has nutty flavor, it is nutritious
Example 2: Open Questions / Copy-Test
5) Applications: Open questions, sample surveys, texts
![Page 70: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/70.jpg)
Example 3: International survey (Tokyo Gas Company). A survey in three cities (Tokyo, New York, Paris) about dietary habits. Open question: "What dishes do you like and eat often?
New York: First principal plane. Table crossing words and age x gender categories
5) Applications: Open questions, sample surveys, texts
![Page 71: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/71.jpg)
New York: First principal plane. Example of confidence areas for categories (Bootstrap)
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
5) Applications: Open questions, sample surveys, texts
![Page 72: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/72.jpg)
New York: First principal plane. Example of confidence areas for words (Bootstrap)
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
5) Applications: Open questions, sample surveys, texts
![Page 73: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/73.jpg)
New York: First principal plane. Example of Kohonen Map (Self Organizing map).
Example 3: International survey (continuation). Question: "What dishes do you like and eat often?
5) Applications: Open questions, sample surveys, texts
![Page 74: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/74.jpg)
Nota y commentarios activos
9797
9393
-3.0 -1.5 1.5
-1.5
1.5
3.0
4.5
6.0Mesoneros de Castilla (03)
Torondos (02)
Valdelosfrailes (03)
Fuentenarro (02)
Gayubar (02)
Valdetán (02)
Carramimbre (03)Viña Eremos (03)
Marqués de Peñamonte (01)
7878
7979
8080
8181 82828383 8484 8585 8686
87878989
8888 9090 9191 9292
9494
9595
Axis 2 : 1.75%
Axis 1: 3.52%
Jaros Chafandín (01)
Tares P3 (01)Termanthia (02)
San Román (01)Numanthia (02)
Gran Elías Mora (00)
Bienvenida Sitio de El Palo (01) Bienvenida Sitio de El Palo (02)
Vega Sicilia 'Único' (94)Viña Sastre Pesus(01)
First Principal PlaneWINES & MARKS
Tinto joven
Gran Reserva
Tinto crianza
Tinto reservaTinto roble
Eje de calidad
Example 4: Comments about 522 Spanish wines.
5) Applications: Open questions, sample surveys, texts
![Page 75: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/75.jpg)
82
highest marks
enérgico
lowest marks
Average mark: 85.16
corto
cocopólvoravoluptuosomagnífico
-1,9 -1,1 1,30,90,5-1,5 -0,7 -0,3 0,1
herbáceo
tradicionalrústicojovenroblelineal
amable
densosaladoimpresionante
83 86
consistencia
frutalcrianzaalgolimpioligerobeberevolucionarfácil
agradablesobremadurezsequedadmediotempranilloligeramenteamericanocapa
tuestesciertoabiertoalgúndemasiadofranco
reducidodiscretofrutosidadensambladosecoclásicodominar
rojotípicoexpresióncompotadosuaveRiberacestatoque
vezgrasotorrefactogranulosograntiempo
todonoblecascajo
estiloconcentradonecesitarpotencialsabrososorprendetactocomplejolargo
potentepurodejarmineralprimermodernocarnosoamargo
salinofinodondemuchoserbouquetsílexintensofirmevinochocolate
Mark81 84 85 87 88 89 90
Example 4: Comments about 522 Spanish wines (continuation)
5) Applications: Open questions, sample surveys, texts
![Page 76: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/76.jpg)
1.5
3.0
4.5
1.5- 3.0 - 1.5
- 1.5
8381
82
84 85
88 90 91 92
93
94
97
95
79
80
78
86 87
Gran Reserva
50-99,9€
30-49,9€
89
15-19,9€
20-24,9€
25-29,9€Tinto joven0-4,9€ 5-9,9€
Tinto crianza
10-14,9€
Tinto reserva
Tinto roble
Axis2
Axis1
Vega Sicilia 'Único' (94)
Viña Sastre Pesus(01)
Jaros Chafandín (01)
100-300€
Astrales (02)
Punta Esencia (01)
Tares P3 (01)
Termanthia (02)
Gran Elías Mora (00)
Bienvenida Sitio de El Palo (01)
Bienvenida Sitio de El Palo (02)
Numanthia (02)
San Román (01)
Valdetán (02)
Torondos (02)
Mesoneros de Castilla (03)
Valdelosfrailes (03)
Fuentenarro (02)
Valdecuadrón (02)
Gayubar (02)
Viñatorondos (03)
Viña Valdable (03)
Marqués de Olivara (98)Rauda (01)
El Marqués (02)
Carramimbre (03)Viña Eremos (03) Valsotillo (01)
Marqués de Peñamonte (01)
Variables suplementariasExample 4: Comments about 522
Spanish wines (continuation)
5) Applications: Open questions, sample surveys, texts
![Page 77: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/77.jpg)
---- Wine 212 (mark= 85) Legaris-2001Tuestes, gominolas y buenos balsámicos marcan la intensidad media frutal de este crianza. En boca aparece muy lineal, con consistencia media; el retrogusto frutal todavía tapado por una madera algo rústica.
---- Wine 30 (mark=91) Tares P3-2001 premiumMucho terruño se detecta en el bouquet de este gran tinto; pólvora, sílex, pizarra, cascajo caliente con el contraste de tierra húmeda y mucha fruta madura de hueso. concentrado, tacto graso sobre el paladar; impresionante viscosidad en la lengua, otra vez impresiones de tierra húmeda y pólvora en el largo final.
---- Wine 314 (mark=97) Vega Sicilia 'Único-1994Hay que realizar un ejercicio de disciplina gustativa de primer rango para describir este gran vino. el bouquet es fresco, bien armado de fruta roja que se ve potenciada por tintes de chocolates, tabacos, notas de sotobosque y una madera que se manifiesta pero que resulta difícil de localizar y menos de concretar. Tenemos el caso raro de un tinto que sale ileso del paso del tiempo sin lucir su armadura, que es la barrica. En boca joven, aunque ya tiene su cuerpo vigoroso y enérgico bastante ensamblado, con la excepción de algunos taninos saltamontes que quedan para domesticar. Largo y vibrante final que mezcla madurez con una notable finura fresca.
Example 4: Comments about 522 Spanish wines (continuation)
5) Applications: Open questions, sample surveys, texts
![Page 78: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/78.jpg)
Text Mining and Open-ended Questions in Sample Surveys
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general7) Conclusions
![Page 79: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/79.jpg)
79
Processing Strategy
A priori Grouping (Lexical contingency table)
Juxtaposition of Lexical contingency tables
Instrumental Partition
Direct Analysis of the sparse Lexical table
6) About textual data in general
![Page 80: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/80.jpg)
80
Importance of Meta-data
Textual data
Grammar / Syntax
Meta-data
linguistics
Semantics networks
External Corpora externes
Other a priori structures sociolinguistics,
chronology, etc.
6) About textual data in general
![Page 81: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/81.jpg)
81
The four phases of a linguistic analysis
Morphology
Syntax
Semantics
Pragmatics
A big flower
A bag flower
A bug flower
A bog flower
(A bxg flower)
The spoon speaks (The speaks)
A man thinks (A stone thinks)
A challenge to I.A.
6) About textual data in general
![Page 82: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/82.jpg)
82
Homography, Polysemy, Synonymy
Homographs: BORE
To bear
A tedious person
To bore
Polysemous words: DUTY
DRUG
Task
Taxmedicine
Addicting product
6) About textual data in general
![Page 83: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/83.jpg)
83
Semantic content of a lexical profile
Distributional linguistics (Z. Harris)
X is sometimes purringX mewsX has whiskersX likes milkX likes chasing mice
At the end, the point « X » will be superimposed with the point « CAT»
6) About textual data in general
![Page 84: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/84.jpg)
84
Semantic similarity is not a transitive relationship
Example of semantic chains:
(1) calm–wisdom–discretion–wariness–fear–panic,
(2) fact–feature –aspect–appearance–illusion .
6) About textual data in general
![Page 85: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/85.jpg)
85
New additional variables
Nouns (proper, common)
Verbs (auxiliary, modal, usual…)
Adjective
Pronoun
Determiner
Adverb
Preposition
Conjunction
6) About textual data in general
![Page 86: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/86.jpg)
86
new variables, new metrics
6) About textual data in general
![Page 87: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/87.jpg)
Text Mining and Open-ended Questions in Sample Surveys
Summary / Outline
1) Principles of Data Mining and Text mining: A reminder
2) Open-ended Questions: Why? How?
3) From texts to numerical data
4) Basic statistical tools: Visualization, Characteristic words.
5) Applications: Open questions, sample surveys, texts
6) About textual data in general
7) Conclusions
![Page 88: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/88.jpg)
7) Conclusions
For each open-ended question,
and for each partition of the sample of respondents, we obtain, without any preliminary coding or other intervention:
• A visualization of proximities between words and categories.
• Characteristic elements or words for each category . • Modal responses for each category (a kind of automatic summary).
[Remember also that the open question “Why” following a closed question provides an indispensable assessment of the real understanding of the question].
As a conclusion...
![Page 89: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/89.jpg)
7) Conclusions
All these processing are carried out under the supervision of robust assessment procedures:
- Non-parametric statistical tests, - Bootstrap validation.
We are not dealing here with a novel sophisticated modelling.
It is rather a painstaking effort to stick to the real concerns of therespondent, i.e.: the customer, the user, the client.
With the rapid development of online surveys, the spreading of e-mails and blogs, the presented set of tools is expected to be a noteworthy component in a new methodology of customer knowledge.
As a conclusion... (continuation)
![Page 90: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/90.jpg)
- Akuto H. (1992). International Comparison of Dietary Culture. Nihon Keizai Simbun, Tokyo. - Bécue M., Lebart L. (1996). Clustering of texts using semantic graphs. Application to open-ended questions in surveys, Proceedings of the IFCS 96 Symposium, Kobe, Springer Verlag, Tokyo (in press). - Bécue-Bertaut M., Pagès J., Alvarez-Esteban R., Vásquez Burguete J.L. (2006) Détermination d’une note globale, synthèse d’une évaluation numérique et d’appréciations libres. Application aux études de marché. (in French) Actes des JADT-2006. - Bécue-Bertaut, M., Álvarez Esteban R., Pagès (2008,) http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2006/tocJADT2006.htm Rating of products through scores and free-text assertions. Comparing and combining both. Food Quality and Preference, 19/1, 122-134. - Belson W.A., Duncan J.A. (1962): A Comparison of the check-list and the open response questioning system, Applied Statistics, 2, 120-132. - Benzécri J.-P. (1992). Correspondence Analysis Handbook. Marcel Dekker, New York. - Biber D. (1995). Dimensions of register variation. Cambridge Univ. Press, Cambridge. - Bradburn N., Sudman S., and associates (1979): Improving Interview Method and Questionnaire Design, Jossey Bass, San Francisco. - Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R. (1990). Indexing by latent semantic analysis, J. of the Amer. Soc. for Information Science, 41 (6), 391-407. - Habert B., Nazarenko A., Salem A. (1997). Les linguistiques de corpus. Armand colin, Paris. - Hayashi C., Suzuki T., Sasaki M. (1992): Data Analysis for Social Comparative research: International Perspective, North-Holland, Amsterdam. - Lebart L. (1982). Exploratory analysis of large sparse matrices, with application to textual data, COMPSTAT, Physica Verlag, 67-76. - Lebart L., Salem A., Bécue M., (2000), Análisis estadístico de textos, Editorial Milenio, Lleida. - Lebart L., Salem A., Berry E. (1998). Exploring Textual Data. Kluwer, Dordrecht. - Lebart L., Morineau A., Warwick K. (1984). Multivariate Descriptive Statistical Analysis. John Wiley. N.Y. - Ritter H., Kohonen T. (1989). Self Organizing Semantic Maps. Biol. Cybern. 61, 241-254. - Salem A. (1984). La typologie des segments répétés dans un corpus, fondée sur l'analyse d'un tableau croisant mots et textes, Cahiers de l'Analyse des Données, 489-500. - Sasaki M., Suzuki T. (1989): New directions in the study of general social attitudes : trends and cross-national perspectives, Behaviormetrika, 26, 9-30. - Schuman H., Presser F. (1981): Question and Answers in Attitude Surveys, Academic Press, New York. - Sudman S., Bradburn N. (1974): Response Effects in Survey, Aldine, Chicago.
7) Conclusions – Short Bibliography
![Page 91: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/91.jpg)
Surveys data and software (DtmVic)
can be downloaded from
www.dtm-vic.com
![Page 92: 4 text mining and open ended questions in sample surveys ludovic lebart cnrs](https://reader033.vdocuments.us/reader033/viewer/2022051610/54820272b07959600c8b46a7/html5/thumbnails/92.jpg)
92
Merci
Thank YouGracias
Grazie
Obrigado