Sommaire des JADT 1998

 

 

quantitative TOOLS FOR qualitative analySIs:

an indicator for segment typicality

 

Victor Armony

Department of Political Science

University of British Columbia

1866 Main Mall, Vancouver, BC

V6T 1Z1 Canada

Jules Duchastel, Gilles Bourque

Département de sociologie

Université du Québec à Montréal

C.P. 8888, succ. centre-ville, Montréal, QC

H3C 3P8 Canada

 

Résumé

Cette communication présente une technique simple, fondée sur le calcul de la distance du chi carré, qui permet d'évaluer la représentativité lexicale des segments textuels retenus par un patron de fouille déterminé. Le but de cette démarche est de montrer comment certains outils statistiques peuvent s'articuler à une analyse de nature proprement " qualitative ". C'est dans cette perspective que nous proposons, en nous inspirant des techniques de sélection de " réponses modales ", un " indicateur de typicalité " qui se trouve à mi-chemin entre l'algorithmique des données textuelles et la visée interprétative de la recherche qualitative.

1. Introduction

This paper is intended to provide an overview of the heuristic potential stemming from quite simple techniques of quantitative textual analysis when articulated in a qualitative data analysis approach. Specifically, we will lay out a word-based formalized procedure which may be used to leverage our insight into political discourse in preliminary phases of research. Our goal is to show that a mechanical task which involves some form of quantification can be very useful as an auxiliary tool in qualitative studies. As we all know, an important aspect of any computer-aided analysis of textual data is the retrieval of chunks which satisfy a given query. We can, for example, instruct the program to display all the instances of a word or character string in their context, all segments with a given content code or including a particular syntactic structure, and so on. As search hits of a query, all retrieved chunks have then something in common: a word, a code, or a pattern. Browsing through the search results, we try to make sense of regularities that may have become manifest in this specific subset of our data.

But, as in any collection of things, there are elements which can be said to be more typical than others in terms of a specific feature. We can ask: How can we single out the best samples of the set? Can we tell which segments are good examples of the list of items retrieved by the computer? With this question in mind, we were interested in the possibility of measuring typicality in the choice of words made by the speaking subject in his or her utterances. This means characterizing each retrieved chunk by assessing how far its vocabulary stands from that of the whole subset created by the search request. We adopted to this end a distance indicator which tells us to what extent each particular hit represents with its vocabulary the entire set of hits.

A corpus of speeches by U.S. President Bill Clinton will serve as an example to illustrate the use of our distance indicator for textual segments. The database is composed of 48 radio addresses delivered by President Clinton from October 1993 to May 1994. In order to manipulate the database and apply our distance indicator, we use SATO, a software workbench which enables the user to articulate different operations into a flexible and interactive research design. Besides the standard features of a textbase manager and a code-and-retrieve program, this software system has the particular capacity of supporting the annotation of the text in its two dimensions, syntagmatic and paradigmatic, through a system of type-level and token-level, numeric or symbolic codes. This means that every lexical item, either in an index (as a type) or in the text itself (as a token) can be categorized on-line or with specialized, user-defined algorithms. The researcher can follow a specific trail (looking for certain terms, frequency ranges or syntactic constructions), and attach, modify or remove categories from words. Because of its lexical approach and categorization capabilities, SATO is an ideal environment for procedures such as the one we wish to present here.

2. Analyzing political discourse

Analyzing political discourse using word-based computerized techniques seems rather far away from what is generally called qualitative data analysis. We could mention three apparent reasons: (1) the term discourse seems to refer to an approach which focuses on linguistic forms rather than content; (2) official speeches by professional politicians do not seem to fall into the category of "richness", "puzzling and challenging nature" that Anselm Strauss (1987: 27) finds in data collected for qualitative analysis; (3) choosing words as units of analysis seems to be more related to describing information transfers than understanding situated communication events. We will oppose a brief argument to these impressions.

First of all, we consider the expression "qualitative data" to be equivalent to "non-numerical data". We emphasize the fact that what we analyze are objects of language. This means that we share with all other approaches oriented to the study of textual materials the idea that these have to be seen as complex symbolic systems, with both form and content dimensions. If we employ the notion of discourse, it is to point out that, either written or verbal, the messages are articulated to a particular way of generating units of meaning.

Secondly, there are many traditions in discourse analysis. The Anglo-Saxon tradition favors everyday discourse situations or verbal interactions within localized situations and the French school of discourse analysis studies mostly standardized public discourses (political, media, scientific, etc.). While the Anglo-Saxon tradition has generally offered research conclusions only at the level of phenomenological and ethnographic considerations, the French tradition has tended to favor global interpretations about discursive and ideological formations. We situate our approach in this perspective, in which we find, for instance, the works of Michel Pêcheux and Michel Foucault.

Thirdly, we believe that the use of frequency counts and simple calculations does not necessarily mean a quantification of data. Recurrence in discourse can have a meaning, as well as absence or variation. We find it relevant to know, for example, if a specific item (a word, an expression, a subject, a turn of phrase, etc.) is consistently used or avoided by a subject, or used and avoided by the same subject depending on the context. This can be observed through basic numeric operations performed by the computer. These operations are to be seen as summarizing operations rather than statistical analyzers. We should bear in mind that formalization is not to be posed as a question of quantity against quality. As long as we do not accept a hard constructionist position, both qualitative and quantitative approaches can entail, at least partially, formalized representations and procedures.

3. A distance indicator for textual segments

As we have argued, some quantitative procedures can be articulated to qualitative approaches without compromising the methodological integrity of the analytical process. It is now widely accepted that resorting to certain computerized algorithms enhances more often than not the researcher's ability to gain insight into the text. This applies to algorithms which perform repetitive tasks, such as the automatic retrieval of text chunks, but also to some summarizing operations. It is in this perspective that we can think of tools which give us useful information about the text we are reading, coding and interpreting by means of elementary mathematical calculations. Word-frequency counts, for instance, may help us to detect the central notions in a given corpus. The text itself is not replaced by a numeric matrix. Numbers are rather seen as indicators of phenomena worth looking at. The questions to ask will be like the following: Why is this word so consistently used by this speaking subject? We could have noted this recurrence by simply going through the data, but the frequency count gives us a sharp image of relative proportions and directs our attention to contrasts and regularities we could have otherwise missed.

The procedure that we present here has been conceived as an auxiliary tool for the analysis of textual segments (sentences, paragraphs, answers to open-ended questions, etc.). Our goal is to illustrate one of many possibilities of what we could call a hybrid approach–rather than exclusively qualitative or quantitative. Very often, in the exploratory phases as well as during the different steps of the interpretation process, the analyst carries out searches on the corpus in order to locate textual segments which satisfy a particular query. A query can focus on the presence (or absence) of certain words or codes, and it can have different levels of logical complexity. Let's take the simplest case: the concordance. We can define a concordance as the exhaustive list of all contexts in a corpus which satisfy a given query. For example, we can instruct the computer to retrieve all sentences in President Clinton's radio addresses from 1993 to 1994 in which the term America is used. We will obtain thus 114 segments. Usually, we will then read all of them, trying to extract the general meaning (or the variations in meaning), and select those which illustrate most clearly the semantic or thematic aspect we wish to emphasize. But what if we had a much bigger corpus and we wanted to gain rapid access to its key notions?

This problem has already been raised by what is known as the search engines in the Internet. The question was: How can we go through an enormous mass of data and, not only detect those documents responding to a key-word array, but offer the user some kind of indication on the relevance of each hit. This struck us as an interesting concept to apply to textual data analysis. In the example above, it would be very useful to have, for each sentence of the concordance, a typicality rating. The basic idea is to find a way to automatically compare, in Bill Clinton's utterances, each sentence containing the word America to all other sentences containing the same word. This concept can be easily generalized: a typicality indicator can be used to get back a list of those items which best match the request in terms of vocabulary. They can be displayed in order by how well they match the request. We could then identify the modal sentences, that is to say, in our example, those which represent the best samples of the President's discourse on America.

To this end, we have adopted a technique based on the Chi-square distance (Lebart & Salem, 1994). The Chi-square distance (_2) measures the distance between two modalities i and i'. It is calculated with the following formula:

Figure 1 shows five selected sentences from the concordance of America, displayed in order from lowest to highest calculated distance. We must point out that we have found experimentally that unusually long segments tend to get exceptionally low distances and, conversely, unusually short segments tend to get exceptionally high distances. This bias is inherent in the Chi-square formula, and we are exploring other tests to bypass this problem.

Let's examine the first and last sentences of our example. The first one (# 9; Figure 2) can be said to be a high quality match, in terms of typicality, because the distance is relatively small. That means that most of the words in the sentence are rather usual words in Clinton's sentences which include the word America: our, we, work, better, children, etc.

The other sentence (# 107; Figure 2) will be considered a low quality match. Most of its words are not usual collocates of the word America in our corpus. Obviously, this is just a quick indication of general trends in the President's vocabulary. We do not expect qualitative analysts to rely on this sole indicator to assert the relevance of a given sentence. But we believe that this information might help in its identification as a more or less typical expression of the speaker's discourse.

We would like to underline an interesting phenomenon worthy of further scrutiny. In our example, it becomes quite clear that those sentences which get a lower distance indicator show more elaborated references to the notion conveyed by the word America. It would seem that detecting a recurring lexical context for a given term allows us to identify structured representations. Let's notice that in effect high quality matches show the word America to be the semantic core of the utterances, while lower quality matches show the word America to be an almost superfluous reference.

4. Conclusion

The Chi-squared distance indicator that we propose is just an example of various tools which, while mathematical in nature, may function as aids to qualitative readings. This particular indicator could be programmed into almost any existing qualitative software. The numeric value could be even translated into a qualitative scale such as fair match, poor match, etc. The basic idea is to build more "intelligent" search interfaces which are capable of providing supplementary information on hits. If we see a sentence which seems important to our interpretation, it will be interesting to know that, for instance, its vocabulary is extremely atypical compared to other sentences of the same kind (including a given concept, coded within the same node, etc.). We will not let the numeric indicator take over analysis and determine what is relevant to our research, but we will be able to qualify the examples that we use to communicate our findings. In political discourse analysis, we often see demonstrations based on a selection of quotations which are supposed to illustrate the use of a given notion by a given speaker. Tools like the one we have put forward could contribute to adding validity to descriptions, if not to interpretations.

 

Figure 1

Selected segments from the concordance of America

in President Clinton's Radio Addresses (1993-1994)

 

Ranking: 9

 

Distance: 38,7

 

Reference: 09/04/94

We'll think of the faith of our parents that was instilled in us here in America -- the idea that if you work hard and play by the rules, you'll be rewarded with a good life for yourself and a better chance for your children.

 

Ranking: 36

 

Distance: 49,2

 

Reference: 12/19/93

For America, that means we can find new customers in Mexico, and that in turn means more jobs here at home.

 

Ranking: 57

 

Distance: 56,3

 

Reference: 08/07/93

We've also changed the environmental policies of this administration so that once again America is a leader, not a follower in the effort to preserve the global environment and our environmental issues here at home.

 

Ranking: 79

 

Distance: 62,9

 

Reference: 31/07/93

Under this plan, more than 90 percent of the small businesses in America will actually be eligible for a reduction in their taxes.

 

Ranking: 107

 

Distance: 83,6

 

Reference: 08/14/93

There were 90,000 murders in America in the last four years, and a startling upsurge in gang activity, drive-by shootings and bloody carjackings.

Figure 2

Distance indicator calculation: an example of a high and a low quality match

(Sentences # 9 and # 107)

 

Sentence # 9 (high quality match)

Sentence # 107 (low quality match)

Words

F

f

%

%

d

Words

F

f

%

%

d

and

149

2

3,98

4,08

0,00

the

171

1

4,57

3,85

0,11

our

54

1

1,44

2,04

0,25

America

115

1

3,07

3,85

0,19

.

114

1

3,05

2,04

0,33

.

114

1

3,05

3,85

0,21

America

115

1

3,07

2,04

0,35

,

189

2

5,05

7,69

1,38

the

171

3

4,57

6,12

0,53

a

58

1

1,55

3,85

3,40

in

103

2

2,75

4,08

0,64

and

149

2

3,98

7,69

3,45

of

102

2

2,73

4,08

0,67

in

103

3

2,75

11,54

28,03

,

189

1

5,05

2,04

1,79

years

5

1

0,13

3,85

103,12

we'

29

1

0,78

2,04

2,07

there

4

1

0,11

3,85

130,77

by

26

1

0,70

2,04

2,61

last

3

1

0,08

3,85

176,86

with

23

1

0,61

2,04

3,31

activity

1

1

0,03

3,85

545,74

a

58

2

1,55

4,08

4,13

bloody

1

1

0,03

3,85

545,74

--

17

1

0,45

2,04

5,54

carjackings

1

1

0,03

3,85

545,74

work

16

1

0,43

2,04

6,08

drive-by

1

1

0,03

3,85

545,74

that

48

2

1,28

4,08

6,10

four

1

1

0,03

3,85

545,74

for

47

2

1,26

4,08

6,35

gang

1

1

0,03

3,85

545,74

be

15

1

0,40

2,04

6,71

murders

1

1

0,03

3,85

545,74

if

12

1

0,32

2,04

9,22

shootings

1

1

0,03

3,85

545,74

better

11

1

0,29

2,04

10,38

startling

1

1

0,03

3,85

545,74

here

10

1

0,27

2,04

11,77

upsurge

1

1

0,03

3,85

545,74

hard

9

1

0,24

2,04

13,47

were

1

1

0,03

3,85

545,74

children

8

1

0,21

2,04

15,61

90,000

1

1

0,03

3,85

545,74

us

7

1

0,19

2,04

18,36

3741

26

83,64

you

22

2

0,59

4,08

20,75

faith

4

1

0,11

2,04

34,98

life

4

1

0,11

2,04

34,98

your

4

1

0,11

2,04

34,98

rewarded

3

1

0,08

2,04

47,94

rules

3

1

0,08

2,04

47,94

chance

2

1

0,05

2,04

73,88

parents

2

1

0,05

2,04

73,88

think

2

1

0,05

2,04

73,88

was

2

1

0,05

2,04

73,88

ll

6

2

0,16

4,08

95,87

good

1

1

0,03

2,04

151,76

idea

1

1

0,03

2,04

151,76

instilled

1

1

0,03

2,04

151,76

play

1

1

0,03

2,04

151,76

yourself

1

1

0,03

2,04

151,76

3741

49

38,70

 

 

Références

Armony, V., Duchastel, J., Bourque, G. (1996). Analyzing Political Discourse Using Both Qualitative and Quantitative Procedures. 4th International ISA Conference on Social Science Methodology, Colchester.

Duchastel, J., Armony, V. (1997). Computerized Strategies for Textual Analysis in Social Sciences. First International Workshop on Computational Semiotics, Paris.

Duchastel, J., Armony, V. (1996). Textual analysis in Canada: An interdisciplinary approach to qualitative data. Current Sociology, 44, 3, 259-278.

Ludovic L., Salem, A. (1994). Statistique textuelle. Paris: Dunod.

Strauss, A. (1987). Qualitative Analysis for Social Science. Cambridge: Cambridge University Press.

 Sommaire des JADT 1998