SV-Ident 2022:

Survey Variable Identification in Social Science Publications

3rd Workshop on Scholarly Document Processing at COLING 2022

Overview

Social science literature often use and reference survey datasets. Typically, a survey dataset has hundreds of items or questions, called survey variables. Studies may only focus and reference a specific subset of these variables. While survey datasets that are used in a publication are typically referenced explicitly in-text using a bibliographic citation, individual survey variables are often only referenced ambiguously (four examples are shown in Table 1; see the GitHub repository for more examples). This lack of explicit linking between individual survey variables and publications limits aggregate analyses and makes comparing different studies non-trivial. Automatic methods for detecting and linking survey variables can be used to solve this problem. Initial baselines show promising results, however, challenges remain.

Reference Type In-text Reference Variable
Self-containing reference To test this, we analyzed data on the strength of individuals’ identification with their home town and its inhabitants from the German ALLBUS surveys. Variable label: IDENTIFICATION WITH OWN COMMUNITY
Quotation of the variable text There is only one item measuring happiness which directly asks the respondents: ‘If you were to consider your life in general these days, how happy or unhappy would you say you are, on the whole... Variable question: If you were to consider your life in general these days, how happy or unhappywould you say you are, on the whole...
Paraphrase of the variable text The second and the third questions come from the ISSP research, where respondents were asked about the influence of religious leaders on people’s votes and the government. Variable question: How much do you agree or disagree with each of the following: Religious leaders should not try to influence how people vote in elections.
Negative polarity item Victimization and fear of crime are dichotomous, with “1” indicating positive responses to either of the two following questions: “Have you been a victim of theft in the past 3 years?” and “Is there any place in the immediate vicinity in which you fear walking alone at night? Variable question: Is there any area in the immediate vicinity - I mean within a kilometer or so - where you would prefer not to walk alone at night?

Table 1: Example survey variable mentions

My Image

Figure 1: Example of linked survey variable mentions in scientific publications

For this shared task, a system should be built that can identify all relevant variables from English and German social science publications given a set of variables and sentences. The shared task is split into two sub-tasks:

Task 1 - Variable Detection: identifying whether a sentence contains a variable mention or not.

Task 2 - Variable Disambiguation: identifying which variable from a given vocabulary is mentioned in a sentence.

We released the training dataset, which contains 4,248 instances for English and German survey variable mentions. We also released a trial dataset containing 1,227 instances.

Announcements

July 18, 2022: We have released the test data and are evaluating submissions via CodaLab

July 5, 2022: We have extended the registration deadline **until July 14th!** Register here to participate!

June 23, 2022: Registration is now open here

June 8, 2022: Training data is released

March 16, 2022: Trial data is released

November 22, 2021: Workshop is accepted

Contact

Please join our group svident2022@googlegroups.com to receive announcements and participate in discussions. For any other question, please contact Tornike Tsereteli or Yavuz Selim Kartal.

Important Dates

Event Date
Trial Set Release March 16, 2022
Training Set Release June 6, 2022
Deadline for Registration July 4, 2022
Test Set Release (Blind) July 18, 2022
System Runs Due July 25, 2022
Workshop Papers Due August 15, 2022
Camera-Ready Papers Due September 5, 2022
Workshop at COLING 2022 October 16/17, 2022

Data

The dataset is made up of the sentences with and without survey variable mentions (with their respective variable labels) and a vocabulary of survey variables. An example of three consequitive sentences is shown below.

(1) A sense of vulnerability and insecurity could create a perception of unmet societal needs, and lead to a desire for increased welfare state interventions. (2) In fact, our analyses show that net migration is significantly positively associated with a preference for greater welfare spending on health, pensions and unemployment. (3) Our results challenge much conventional wisdom and many scholars and commentators.

In the example, sentences (1) and (3) do not contain variable mentions while sentence (2) contains the multiple variable mentions (exploredata-ZA4700_Varv25, exploredata-ZA4700_Varv27, exploredata-ZA4700_Varv28). The variable data is also provided in Table 2.

id label question item answers topics
exploredata-ZA4700_Varv25 v25 - Q7a: Gov. responsibility: Provide job for everyone Q.7 On the whole, do you think it should or should not be the government's responsibility to ... Q.7a Government responsibility: Provide a job for everyone who wants one Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy
exploredata-ZA4700_Varv27 v27 - Q7c: Gov. responsibility: Provide health care for sick Q.7 On the whole, do you think it should or should not be the government's responsibility to ... Q.7c Government responsibility: Provide health care for the sick Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy
exploredata-ZA4700_Varv27 v28 - Q7d: Gov. responsibility: Provide living standard for the old Q.7 On the whole, do you think it should or should not be the government's responsibility to ... Q.7d Government responsibility: Provide a decent standard of living for the old Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy

Table 2: Example survey variables

For the full trial data please visit the SV-Ident GitHub repository.

Evaluation

Task 1: The variable detection task is evaluated using a standard F1-macro. You can use the evaluation script evaluate_task1.py to check the performance of your model.

Task 2: The variable disambiguation task will be evaluated using (Mean) Average Precision with a cutoff of 10 (MAP@10), which is a measure commonly used in information retrieval and multi-label text classification. You can use the evaluation script evaluate_task2.py (which uses the ranx tool) to check the performance of your model.

Submission

Rules:

  1. Please use the most recent version of the dataset (v0.3-train-val-full). If you are using HuggingFace Datasets, make sure that you download the data after July 5th. In case you downloaded the data prior to this, you can delete the cache and then redownload the data or by overwriting the saved data (e.g., load_dataset("vadis/sv-ident", download_mode="force_redownload")).
  2. Submit the submission file `submission.json` in a ZIP format for each task in the format described below.
  3. You are allowed to use external data, but must either make the trained model available or provide a script to train the model (including any external data).

Format:

The submissions should have the following JSON formats:
Task 1

For Task 1, the key is the document ID and the value represents whether the sentence with the document id has a variable mention (1) or not (0).

Schema:
              
              {
                'DOC_ID': 'LABEL',
              }
            
Example:
        
              {
                '17': '1',
                '238': '0',
                ...
              }
        
      
Task 2

For Task 2, the first key is the document ID, the second key the variable ID, and the value the similarity score (lower is more similar) for the variable to the document.

Schema:
                
                {
                  'DOC_ID': {
                      'VAR_ID': 'SCORE',
                  },
                }
              
Example:
                        
                {
                  '17': {
                      'v25': 0.0908927470445633,
                      'v637': 0.10519161820411682,
                      'v206': 0.08874139934778214,
                      ...
                  },
                  '238': {
                      'v637': 0.0477452278137207,
                      'v418': 0.08932048827409744,
                      'v419': 0.05219722166657448,
                      ...
                  },
                  ...
                }
              

Organizers

This task is organized by members of the VAriable Detection, Interlinking and Summarization (VADIS) project.