Coding of textual answers

The coding activity is generally performed in case the survey questionnaire contains textual variables that refer to official classifications that allow for national and/or international data comparability. Example of this kind of variables are Economic Activity (NACE), Occupation, Education, Places (of birth, of residence, etc.).

Coding means to assign a unique code to a textual answer according to a classification scheme. The level of detail of the matched code depends on the survey aims and/or the dissemination needs. Coding can be performed manually or trough automated systems. Manual coding can be performed only at the end of the data collection phase, while if automated systems are used, it can be run during or after collection of data: in the first case it is called assisted coding (on-line coding) while in second case automated coding (batch coding).

With reference to GSBPM, coding belongs to the sub-process 5.2 “Classify and code” of the Phase 5 “Process” that includes those activities that are necessary to make data ready for the analysis (Phase 6 “Analyse“). Obviously, in case of assisted coding some of the activities of sub-process 5.2 can start before Phase 4 “Collect” ends, improving the timeliness of data delivery.

Coding is, in general, a very hard activity of the survey process. Besides, if it is manually performed it is also difficult to standardise, because coding results strictly depend on coders. Despite coders are well trained about criteria and principles of each official classification, coding is influenced by the cognitive process of each coder that might lead to different (subjective) interpretations and, therefore, different coding of the same textual answer.

The use of specialised coding software can produce a considerable saving of time and resources and will also guarantee a higher standardisation level of the coding process, increasing the expected quality of the coding results.

A first distinction between methods/tools can be made based on the mode and phase of the process in which coding should take place:

Assisted coding: tthe method/tool provides interactive support to the coder/respondent, facilitating navigation within the reference classification. In this case, the coding process can already take place during the data collection phase;
Automated coding: a file containing the full set of collected textual responses is analyzed.

The main methods currently used for coding textual descriptions are:

Rule – or dictionary-based methods;
Supervised and semi-supervised machine learning;
Deep learning and advanced language models.

Rule- or dictionary-based methods use predefined lists of keywords or conceptual categories to classify texts. They are relatively simple to implement but less flexible than more advanced methods. The use of dictionaries requires a non-negligible effort to build the information base, namely the electronic version of the dictionary related to the official manual of the reference classification.

Supervised machine learning methods involve training algorithms on a set of texts that have already been correctly coded. The model learns the linguistic patterns associated with each category and can then automatically classify new texts. Semi-supervised methods, on the other hand, combine a small amount of already coded textual responses with large volumes of uncoded texts.

Deep learning techniques and modern language models make it possible to represent text as numerical vectors that capture the contextual meaning of words. These models are capable of identifying complex semantic nuances and offer high performance in classification tasks, especially when large amounts of data are available.

Methods and software of the statistical process

Coding of textual answers