Machine Learning & NLP for the Study of Political Behavior

Zach Dickson

Postdoctoral Research Fellow
London School of Economics

Overview

What is machine learning?

Machine learning is a fancy term for a set of statistical methods that are used to make predictions based on patterns in data
Unlike traditional statistical methods, machine learning methods are (usually) designed to make predictions rather than test hypotheses or estimate causal effects
Simple example: predicting an outcome based on a predictor variable – e.g., predicting whether a person will turnout to vote based on their income (\(P(Y=turnout|income)\))

How does machine learning differ from traditional statistical methods?

In ML, prediction is usually the goal
- Big data: ML methods are designed to handle large datasets with many variables and observations
Less emphasis on statistical significance and p-values
- Machine learning methods are often used to make predictions about rare events, which traditional statistical methods are not well-suited to do
  - e.g. fraud detection, disease diagnosis, etc.
More emphasis on model selection and hyperparameter tuning
- Hyperparameter tuning: choosing the best values for the hyperparameters of a given model
- Model selection: choosing the best model from a set of candidate models
- …and all kinds of other ‘hacky’ methods that would make a traditional statistician cringe

Types of machine learning methods

Supervised learning: learning a function that maps an input to an output based on example input-output pairs
- Classification: predicting a categorical outcome
- Regression: predicting a continuous outcome
Unsupervised learning: learning a function that maps an input to an output without example input-output pairs
- Clustering: grouping observations into clusters based on their similarity
- Dimensionality reduction: reducing the number of variables in a dataset while retaining as much information as possible

Supervised learning: classification

Classification: Identifying which category an observation belongs to based on its characteristics
A function that maps an input to a given output is called a classifier
- Binary classification: predicting a binary outcome (e.g. turnout/no turnout)
- Multi-class classification: predicting a categorical outcome with more than two categories (e.g., party identification)

Supervised learning: regression

Regression: predicting a continuous outcome based on its characteristics
A function that maps an input to a range of outputs is called a regressor
- Linear regression: predicting a continuous outcome using a linear function (e.g., predicting income based on turnout)
- Nonlinear regression: predicting a continuous outcome using a nonlinear function (e.g. below)

Unsupervised learning: clustering

Clustering: grouping observations into clusters based on their (dis-)similarity
A function that maps an input to an undefined (or a semi-undefined) output is called a clusterer
- K-means clustering: grouping observations into clusters based on their distance from the cluster centroids
- Hierarchical clustering: grouping observations into clusters based on their distance from each other

Unsupervised learning: dimensionality reduction

Dimensionality reduction: reducing the number of variables in a dataset while retaining as much information as possible
- Principal component analysis (PCA): reducing the number of variables in a dataset while retaining as much information as possible
- Linear Discriminant Analysis: reducing the number of variables in a dataset while retaining as much information as possible

Note: The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour
and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width.

ML in causal inference

Causal inference
- Generalized synthetic control methods
- Matching and weighting methods
- Heterogeneous treatment effects

Generalized difference-in-differences

Observational difference-in-differences settings
- Synthetic control methods (Abadie, Diamond, and Hainmueller 2010)
- Matrix completion methods (Athey et al. 2021)
Given a set of control units, find a weighted combination of control units that best matches the treated unit(s) on the pre-treatment outcome

Matching and weighting methods

Propensity score matching & weighting (Rosenbaum and Rubin 1983; Horvitz and Thompson 1952)
Traditional matching methods: nearest neighbor matching, radius matching, kernel matching, Mahalanobis distance etc.
- Logic Given a set of control units, find a subset of control units that best matches the treated unit(s) on the pre-treatment outcome

Trajectory weighting methods

Trajectory weighting methods (Zubizarreta 2015; Hazlett and Xu 2018)
- Logic Given a set of control units, find a weighted combination of time-invariant control units that best matches the treated unit(s) on the pre-treatment outcome (and possibly other covariates)

Heterogeneous treatment effects

Heterogeneous treatment effects
- Random forests (Wager and Athey 2018)

ML in Natural Language Processing

Text-as-data vs. NLP
- NLP is an umbrella term for a set of methods that are used to analyze text data
- Text-as-data is a subset of NLP methods that are used to analyze text data in a quantitative way (e.g., word counts)
Treating text as data is not a new idea
- Bag-of-words (Harris 1954)
- TF-IDF (Salton and Buckley 1988)
- Topic modeling (LDA) (Blei, Ng, and Jordan 2003)
- Word embeddings (Mikolov et al. 2013)

Word embeddings

You shall know a word by the company it keeps
(Firth 1957)

Word embeddings are vector representations of words in a high-dimensional space
Word embeddings are used to represent words in a way that captures their meaning and context
Each word is assigned a unique vector, capturing its semantic meaning
Words that are similar in meaning are close to each other in the vector space

Word embeddings example

We can model the relationship between words using word embeddings
- King = [0.1, 0.2, 0.3, 0.4, 0.5]
- Man = [0.2, 0.3, 0.4, 0.5, 0.6]
- Woman = [0.3, 0.5, 0.7, 0.9, 1.1]

King - Man + Woman = [0.2, 0.4, 0.6, 0.8, 1.] = Queen

Word embeddings

We can model a word’s meaning over another dimension (e.g. documents, time, party, gender, etc.)

\[\mathbf{Y} = \mathbf{X} \beta + \mathbf{E} \]

Where \(\mathbf{Y}\) is a \(n \times 1\) vector of word embeddings, \(\mathbf{X}\) is a \(n \times k\) matrix of covariates, \(\beta\) is a \(k \times 1\) vector of coefficients, and \(\mathbf{E}\) is a \(n \times 1\) vector of errors.

Topic modeling

Clustering method that groups words into topics based on their co-occurrence in documents
Logic: Given a set of documents, find a set of topics that best describes the documents
- Example Given a set of political texts, find a set of topics that best describes the speeches

Text -> Embeddings -> Clustering -> Word Representations -> Topics (Summarised)

Language models

Language models combine embeddings with neural networks to predict the next word in a sequence of words
Neural networks are trained on large datasets of text to learn patterns in language
Language models are used to generate text, answer questions, and perform other tasks
Example: ChatGPT

Language models

NLP changed dramatically with the introduction of language models
- Attention is All You Need (Vaswani et al. 2017)
- Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2018)
- GPT-3 (Brown et al. 2020)

Types of language models

Language models can be used to generate text, answer questions, and perform other tasks
Pre-trained models are trained on large datasets of text and can be fine-tuned for specific tasks
- Common Example: BERT
  - Trained on BookCorpus, a dataset consisting of 11,038 unpublished books and the entirety of English Wikipedia (excluding lists, tables and headers).

Coding Example with BERT:

# input: 
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("The man worked as a [MASK].")
# output 1:
  'score': 0.09747550636529922,
  'token_str': 'carpenter'
# output 2:
  'score': 0.04962705448269844,
  'token_str': 'barber'

Additional applications of language models

Task-specific models are trained on specific datasets and are designed to perform specific tasks
- Text classification: predicting the issue category of a political text
  - Classification Model (BERT)
- Text Summarization: summarizing political party press releases
  - Summarization Model (BART)

Moving beyond text

Language models (e.g. the transformers architecture) can be used to analyze other types of data
- Images (Wu et al. 2020)
  - Classification: predicting the content of an image
  - Generation: generating images from text
  - Segmentation: identifying objects in an image
  - Detection: identifying objects in an image and their location
- Audio: SpeechBERT (Radford et al. 2022)
  - Text-to-speech: generating speech from text
  - Text-to-audio: generating audio from text
  - Speech/speaker recognition: transcribing speech to text

Ethics in machine learning

Machine learning and NLP have the potential to transform the study of political behavior, but they also raise a number of ethical concerns
- Contemporary Ethical Considerations
- AI Governance and Accountability
- Responsible AI/ML research

Contemporary ethical considerations

Bias and Fairness:
- Addressing biases in training data that can perpetuate stereotypes and discrimination
  - Systematically inaccurate models can also be harmful
  - garbage in, garbage out
Misinformation and Disinformation:
- Acknowledgment of the role language models play in spreading information.
- Developing tools and policies to counter misinformation.
Privacy Concerns:
- Heightened awareness of privacy implications
- Stricter regulations on data usage and user consent.

ML governance and accountability

Transparency:
- Calls for greater transparency in how ML systems are developed and used.
- Development of tools to explain model behavior.
Accountability:
- Who is responsible for the actions of ML systems?
- Growing emphasis on holding developers and organizations accountable
- Development of ethical guidelines and frameworks
Social Impact:
- Recognition of the broader societal impact of ML applications.
- Calls for responsible innovation and socially beneficial solutions.

Ethics in research practice when using ML

Some practical considerations for researchers using ML in their research:
- What data are you using? Where did it come from? How was it collected?
- Are there any biases in the data? How might these biases affect your results?
- Are there any privacy concerns? How will you protect the privacy of your subjects?
  - Are you giving language models access to sensitive information?
- What model are you using? How does it work? What assumptions does it make?
- How will you evaluate the performance of your model? What metrics will you use?

Resources for further learning

Machine learning in the social sciences:
- An Introduction to Statistical Learning in R (James et al. 2013) or Python (James et al. 2023)
- Causal analysis: Impact evaluation and causal machine learning with applications in R (Huber 2023)
- Text as Data: A Primer on Text Analysis with R for Students of Political Science (Grimmer, Roberts, and Stewart 2022)
Machine learning in Practice:
- Online courses: Coursera, Kaggle, Udemy, edX
- YouTube: StatQuest, 3Blue1Brown, Two Minute Papers
Natural language processing (in “real” ML’):
- Natural Language Processing with Transformers (Tunstall, Von Werra, and Wolf 2022)

Some websites you should know about

Pre-trained language models

Hugging Face
- Hugging Face is the largest platform for sharing and using open source pre-trained language models

Kaggle

Kaggle is a platform for data science competitions and machine learning projects
Kaggle hosts a number of datasets and provides tools for data exploration and analysis

Google Colab

Google Colab is a free cloud-based platform for data science and machine learning projects

Discussion

What are some applications of machine learning you could incorporate in your research?
What types of data might you use?
What are some ethical considerations you should keep in mind when using machine learning in your research?

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505.

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2021. “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116 (536): 1716–30.

Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Facure, Matheus. 2023. Causal Inference in Python. " O’Reilly Media, Inc.".

Firth, John. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis, 10–32.

Gong, Xiajing, Meng Hu, Mahashweta Basu, and Liang Zhao. 2021. “Heterogeneous Treatment Effect Analysis Based on Machine-Learning Methodology.” CPT: Pharmacometrics & Systems Pharmacology 10 (11): 1433–43.

Grimmer, Justin, Margaret E Roberts, and Brandon M Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Harris, Zellig S. 1954. “Distributional Structure.” Word 10 (2-3): 146–62.

Hazlett, Chad, and Yiqing Xu. 2018. “Trajectory Balancing: A General Reweighting Approach to Causal Inference with Time-Series Cross-Sectional Data.” Available at SSRN 3214231.

Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.

Huber, Martin. 2023. Causal Analysis: Impact Evaluation and Causal Machine Learning with Applications in r. MIT Press.

Imai, Kosuke, In Song Kim, and Erik H Wang. 2023. “Matching Methods for Causal Inference with Time-Series Cross-Sectional Data.” American Journal of Political Science 67 (3): 587–605.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. “An Introduction to Statistical Learning: With Applications in Python.” (No Title).

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2212.04356.

Rodriguez, Pedro L, Arthur Spirling, and Brandon M Stewart. 2023. “Embedding Regression: Models for Context-Specific Description and Inference.” American Political Science Review, 1–20.

Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.

Salton, Gerard, and Christopher Buckley. 1988. “Term-Weighting Approaches in Automatic Text Retrieval.” Information Processing & Management 24 (5): 513–23.

Tunstall, Lewis, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. " O’Reilly Media, Inc.".

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.

Wu, Bichen, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. “Visual Transformers: Token-Based Image Representation and Processing for Computer Vision.” https://arxiv.org/abs/2006.03677.

Zubizarreta, José R. 2015. “Stable Weights That Balance Covariates for Estimation with Incomplete Outcome Data.” Journal of the American Statistical Association 110 (511): 910–22.