3rd Workshop on Scholarly Document Processing at COLING 2022
Social science literature often use and reference survey datasets. Typically, a survey dataset has hundreds of items or questions, called survey variables. Studies may only focus and reference a specific subset of these variables. While survey datasets that are used in a publication are typically referenced explicitly in-text using a bibliographic citation, individual survey variables are often only referenced ambiguously (four examples are shown in Table 1; see the GitHub repository for more examples). This lack of explicit linking between individual survey variables and publications limits aggregate analyses and makes comparing different studies non-trivial. Automatic methods for detecting and linking survey variables can be used to solve this problem. Initial baselines show promising results, however, challenges remain.
Reference Type | In-text Reference | Variable |
---|---|---|
Self-containing reference | To test this, we analyzed data on the strength of individuals’ identification with their home town and its inhabitants from the German ALLBUS surveys. | Variable label: IDENTIFICATION WITH OWN COMMUNITY |
Quotation of the variable text | There is only one item measuring happiness which directly asks the respondents: ‘If you were to consider your life in general these days, how happy or unhappy would you say you are, on the whole... | Variable question: If you were to consider your life in general these days, how happy or unhappywould you say you are, on the whole... |
Paraphrase of the variable text | The second and the third questions come from the ISSP research, where respondents were asked about the influence of religious leaders on people’s votes and the government. | Variable question: How much do you agree or disagree with each of the following: Religious leaders should not try to influence how people vote in elections. |
Negative polarity item | Victimization and fear of crime are dichotomous, with “1” indicating positive responses to either of the two following questions: “Have you been a victim of theft in the past 3 years?” and “Is there any place in the immediate vicinity in which you fear walking alone at night? | Variable question: Is there any area in the immediate vicinity - I mean within a kilometer or so - where you would prefer not to walk alone at night? |
Table 1: Example survey variable mentions
Figure 1: Example of linked survey variable mentions in scientific publications
For this shared task, a system should be built that can identify all relevant variables from English and German social science publications given a set of variables and sentences. The shared task is split into two sub-tasks:
Task 1 - Variable Detection: identifying whether a sentence contains a variable mention or not.
Task 2 - Variable Disambiguation: identifying which variable from a given vocabulary is mentioned in a sentence.
We released the training dataset, which contains 4,248 instances for English and German survey variable mentions. We also released a trial dataset containing 1,227 instances.
July 18, 2022: We have released the test data and are evaluating submissions via CodaLab
July 5, 2022: We have extended the registration deadline **until July 14th!** Register here to participate!
June 23, 2022: Registration is now open here
June 8, 2022: Training data is released
March 16, 2022: Trial data is released
November 22, 2021: Workshop is accepted
Please join our group svident2022@googlegroups.com to receive announcements and participate in discussions. For any other question, please contact Tornike Tsereteli or Yavuz Selim Kartal.
Event | Date |
---|---|
Trial Set Release | March 16, 2022 |
Training Set Release | June 6, 2022 |
Deadline for Registration | July 4, 2022 |
Test Set Release (Blind) | July 18, 2022 |
System Runs Due | July 25, 2022 |
Workshop Papers Due | August 15, 2022 |
Camera-Ready Papers Due | September 5, 2022 |
Workshop at COLING 2022 | October 16/17, 2022 |
The dataset is made up of the sentences with and without survey variable mentions (with their respective variable labels) and a vocabulary of survey variables. An example of three consequitive sentences is shown below.
(1) A sense of vulnerability and insecurity could create a perception of unmet societal needs, and lead to a desire for increased welfare state interventions. (2) In fact, our analyses show that net migration is significantly positively associated with a preference for greater welfare spending on health, pensions and unemployment. (3) Our results challenge much conventional wisdom and many scholars and commentators.
In the example, sentences (1) and (3) do not contain variable mentions while sentence (2) contains the multiple variable mentions (exploredata-ZA4700_Varv25, exploredata-ZA4700_Varv27, exploredata-ZA4700_Varv28). The variable data is also provided in Table 2.
id | label | question | item | answers | topics |
---|---|---|---|---|---|
exploredata-ZA4700_Varv25 | v25 - Q7a: Gov. responsibility: Provide job for everyone | Q.7 On the whole, do you think it should or should not be the government's responsibility to ... | Q.7a Government responsibility: Provide a job for everyone who wants one | Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer | Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy |
exploredata-ZA4700_Varv27 | v27 - Q7c: Gov. responsibility: Provide health care for sick | Q.7 On the whole, do you think it should or should not be the government's responsibility to ... | Q.7c Government responsibility: Provide health care for the sick | Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer | Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy |
exploredata-ZA4700_Varv27 | v28 - Q7d: Gov. responsibility: Provide living standard for the old | Q.7 On the whole, do you think it should or should not be the government's responsibility to ... | Q.7d Government responsibility: Provide a decent standard of living for the old | Definitely should be;Probably should be;Probably should not be;Definitely should not be;Can't choose;No Answer;Don't know, no answer | Mass political behaviour, attitudes/opinion;Government, political systems and organisation;Social stratification and groupings;Economic policy |
Table 2: Example survey variables
For the full trial data please visit the SV-Ident GitHub repository.
Task 1: The variable detection task is evaluated using a standard F1-macro. You can use the evaluation script evaluate_task1.py to check the performance of your model.
Task 2: The variable disambiguation task will be evaluated using (Mean) Average Precision with a cutoff of 10 (MAP@10), which is a measure commonly used in information retrieval and multi-label text classification. You can use the evaluation script evaluate_task2.py (which uses the ranx tool) to check the performance of your model.
For Task 1, the key is the document ID and the value represents whether the sentence with the document id has a variable mention (1) or not (0).
{
'DOC_ID': 'LABEL',
}
{
'17': '1',
'238': '0',
...
}
For Task 2, the first key is the document ID, the second key the variable ID, and the value the similarity score (lower is more similar) for the variable to the document.
{
'DOC_ID': {
'VAR_ID': 'SCORE',
},
}
{
'17': {
'v25': 0.0908927470445633,
'v637': 0.10519161820411682,
'v206': 0.08874139934778214,
...
},
'238': {
'v637': 0.0477452278137207,
'v418': 0.08932048827409744,
'v419': 0.05219722166657448,
...
},
...
}
This task is organized by members of the VAriable Detection, Interlinking and Summarization (VADIS) project.