Machine Learning & NLP for the Study of Political Behavior

Zach Dickson

Postdoctoral Research Fellow
London School of Economics

Overview

Trulli
1. What is machine learning?

2. How does machine learning differ from traditional statistical methods?

3. What are some applications of machine learning in the study of political behavior?

4. What is NLP?

5. What are some applications of NLP in the study of political behavior?

6. What are some ethical considerations when using machine learning and NLP in your research?

7. Where can you learn more about machine learning and NLP?

What is machine learning?

  • Machine learning is a fancy term for a set of statistical methods that are used to make predictions based on patterns in data

  • Unlike traditional statistical methods, machine learning methods are (usually) designed to make predictions rather than test hypotheses or estimate causal effects

  • Simple example: predicting an outcome based on a predictor variable – e.g., predicting whether a person will turnout to vote based on their income (\(P(Y=turnout|income)\))

How does machine learning differ from traditional statistical methods?

  • In ML, prediction is usually the goal
    • Big data: ML methods are designed to handle large datasets with many variables and observations
  • Less emphasis on statistical significance and p-values
    • Machine learning methods are often used to make predictions about rare events, which traditional statistical methods are not well-suited to do
      • e.g. fraud detection, disease diagnosis, etc.
  • More emphasis on model selection and hyperparameter tuning
    • Hyperparameter tuning: choosing the best values for the hyperparameters of a given model
    • Model selection: choosing the best model from a set of candidate models
    • …and all kinds of other ‘hacky’ methods that would make a traditional statistician cringe

Types of machine learning methods

  • Supervised learning: learning a function that maps an input to an output based on example input-output pairs
    • Classification: predicting a categorical outcome
    • Regression: predicting a continuous outcome
  • Unsupervised learning: learning a function that maps an input to an output without example input-output pairs
    • Clustering: grouping observations into clusters based on their similarity
    • Dimensionality reduction: reducing the number of variables in a dataset while retaining as much information as possible

Supervised learning: classification

  • Classification: Identifying which category an observation belongs to based on its characteristics
  • A function that maps an input to a given output is called a classifier
Trulli
Library: Sklearn.
Note: The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour and Virginica)
with 4 attributes: sepal length, sepal width, petal length and petal width.

Supervised learning: regression

  • Regression: predicting a continuous outcome based on its characteristics
  • A function that maps an input to a range of outputs is called a regressor
    • Linear regression: predicting a continuous outcome using a linear function (e.g., predicting income based on turnout)
    • Nonlinear regression: predicting a continuous outcome using a nonlinear function (e.g. below)
Trulli
Library: Sklearn

Unsupervised learning: clustering

  • Clustering: grouping observations into clusters based on their (dis-)similarity
  • A function that maps an input to an undefined (or a semi-undefined) output is called a clusterer
    • K-means clustering: grouping observations into clusters based on their distance from the cluster centroids
    • Hierarchical clustering: grouping observations into clusters based on their distance from each other
Trulli
Library: Sklearn
Note: The Digits dataset contains handwritten digits from 0 to 9.

Unsupervised learning: dimensionality reduction

  • Dimensionality reduction: reducing the number of variables in a dataset while retaining as much information as possible
Snow
Principle Components Analysis.
Forest
Linear Discriminant Analysis.

Note: The Iris dataset represents 3 kind of Iris flowers (Setosa, Versicolour
and Virginica) with 4 attributes: sepal length, sepal width, petal length and petal width.

ML in causal inference

  • Causal inference
    • Generalized synthetic control methods
    • Matching and weighting methods
    • Heterogeneous treatment effects


Trulli

Generalized difference-in-differences

  • Observational difference-in-differences settings
  • Given a set of control units, find a weighted combination of control units that best matches the treated unit(s) on the pre-treatment outcome
Forest
Fitted Synthetic Control
Source: (Facure 2023; Abadie, Diamond, and Hainmueller 2010)

Matching and weighting methods

  • Propensity score matching & weighting (Rosenbaum and Rubin 1983; Horvitz and Thompson 1952)

  • Traditional matching methods: nearest neighbor matching, radius matching, kernel matching, Mahalanobis distance etc.

    • Logic Given a set of control units, find a subset of control units that best matches the treated unit(s) on the pre-treatment outcome
Trulli
Source: Imai, Kim, and Wang (2023).

Trajectory weighting methods

  • Trajectory weighting methods (Zubizarreta 2015; Hazlett and Xu 2018)
    • Logic Given a set of control units, find a weighted combination of time-invariant control units that best matches the treated unit(s) on the pre-treatment outcome (and possibly other covariates)
Snow
Pre-treatment Covariate balance (Hazlett and Xu 2018).
Forest
Treatment estimation (Hazlett and Xu 2018).

Heterogeneous treatment effects

Trulli
Source: Gong et al. (2021).

ML in Natural Language Processing

  • Text-as-data vs. NLP
    • NLP is an umbrella term for a set of methods that are used to analyze text data
    • Text-as-data is a subset of NLP methods that are used to analyze text data in a quantitative way (e.g., word counts)
  • Treating text as data is not a new idea
Trulli

Word embeddings



You shall know a word by the company it keeps
(Firth 1957)


  • Word embeddings are vector representations of words in a high-dimensional space
  • Word embeddings are used to represent words in a way that captures their meaning and context
  • Each word is assigned a unique vector, capturing its semantic meaning
  • Words that are similar in meaning are close to each other in the vector space

Word embeddings example

  • We can model the relationship between words using word embeddings

    • King = [0.1, 0.2, 0.3, 0.4, 0.5]
    • Man = [0.2, 0.3, 0.4, 0.5, 0.6]
    • Woman = [0.3, 0.5, 0.7, 0.9, 1.1]


  • King - Man + Woman = [0.2, 0.4, 0.6, 0.8, 1.] = Queen

Word embeddings

We can model a word’s meaning over another dimension (e.g. documents, time, party, gender, etc.)

\[\mathbf{Y} = \mathbf{X} \beta + \mathbf{E} \]

Where \(\mathbf{Y}\) is a \(n \times 1\) vector of word embeddings, \(\mathbf{X}\) is a \(n \times k\) matrix of covariates, \(\beta\) is a \(k \times 1\) vector of coefficients, and \(\mathbf{E}\) is a \(n \times 1\) vector of errors.

Snow
Categorical: Gender & Party differences in word usage in the US Congress (Rodriguez, Spirling, and Stewart 2023).
Forest
Dynamic: US/UK differences in the use of Empire (Rodriguez, Spirling, and Stewart 2023).

Topic modeling

  • Clustering method that groups words into topics based on their co-occurrence in documents

  • Logic: Given a set of documents, find a set of topics that best describes the documents

    • Example Given a set of political texts, find a set of topics that best describes the speeches
Trulli


Text -> Embeddings -> Clustering -> Word Representations -> Topics (Summarised)

Language models

  • Language models combine embeddings with neural networks to predict the next word in a sequence of words

  • Neural networks are trained on large datasets of text to learn patterns in language

  • Language models are used to generate text, answer questions, and perform other tasks

  • Example: ChatGPT

Trulli

Language models

Trulli

Types of language models

  • Language models can be used to generate text, answer questions, and perform other tasks

  • Pre-trained models are trained on large datasets of text and can be fine-tuned for specific tasks

    • Common Example: BERT
      • Trained on BookCorpus, a dataset consisting of 11,038 unpublished books and the entirety of English Wikipedia (excluding lists, tables and headers).

Coding Example with BERT:

# input: 
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("The man worked as a [MASK].")
# output 1:
  'score': 0.09747550636529922,
  'token_str': 'carpenter'
# output 2:
  'score': 0.04962705448269844,
  'token_str': 'barber'

Additional applications of language models

  • Task-specific models are trained on specific datasets and are designed to perform specific tasks

Moving beyond text

  • Language models (e.g. the transformers architecture) can be used to analyze other types of data
    • Images (Wu et al. 2020)
      • Classification: predicting the content of an image
      • Generation: generating images from text
      • Segmentation: identifying objects in an image
      • Detection: identifying objects in an image and their location
    • Audio: SpeechBERT (Radford et al. 2022)
      • Text-to-speech: generating speech from text
      • Text-to-audio: generating audio from text
      • Speech/speaker recognition: transcribing speech to text

Ethics in machine learning

  • Machine learning and NLP have the potential to transform the study of political behavior, but they also raise a number of ethical concerns

    • Contemporary Ethical Considerations
    • AI Governance and Accountability
    • Responsible AI/ML research
Trulli
Source: MLPrograms

Contemporary ethical considerations

  • Bias and Fairness:
    • Addressing biases in training data that can perpetuate stereotypes and discrimination
      • Systematically inaccurate models can also be harmful
      • garbage in, garbage out
  • Misinformation and Disinformation:
    • Acknowledgment of the role language models play in spreading information.
    • Developing tools and policies to counter misinformation.
  • Privacy Concerns:
    • Heightened awareness of privacy implications
    • Stricter regulations on data usage and user consent.

ML governance and accountability

  • Transparency:
    • Calls for greater transparency in how ML systems are developed and used.
    • Development of tools to explain model behavior.
  • Accountability:
    • Who is responsible for the actions of ML systems?
    • Growing emphasis on holding developers and organizations accountable
    • Development of ethical guidelines and frameworks
  • Social Impact:
    • Recognition of the broader societal impact of ML applications.
    • Calls for responsible innovation and socially beneficial solutions.

Ethics in research practice when using ML

  • Some practical considerations for researchers using ML in their research:

    • What data are you using? Where did it come from? How was it collected?
    • Are there any biases in the data? How might these biases affect your results?
    • Are there any privacy concerns? How will you protect the privacy of your subjects?
      • Are you giving language models access to sensitive information?
    • What model are you using? How does it work? What assumptions does it make?
    • How will you evaluate the performance of your model? What metrics will you use?

Resources for further learning

  • Machine learning in the social sciences:
  • Machine learning in Practice:
    • Online courses: Coursera, Kaggle, Udemy, edX
    • YouTube: StatQuest, 3Blue1Brown, Two Minute Papers
  • Natural language processing (in “real” ML’):

Some websites you should know about

Trulli

Pre-trained language models

  • Hugging Face
    • Hugging Face is the largest platform for sharing and using open source pre-trained language models


Trulli

Kaggle

  • Kaggle is a platform for data science competitions and machine learning projects
  • Kaggle hosts a number of datasets and provides tools for data exploration and analysis
Trulli

Google Colab

  • Google Colab is a free cloud-based platform for data science and machine learning projects
Trulli

Discussion

  • What are some applications of machine learning you could incorporate in your research?

  • What types of data might you use?

  • What are some ethical considerations you should keep in mind when using machine learning in your research?

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505.
Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2021. “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116 (536): 1716–30.
Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.
Facure, Matheus. 2023. Causal Inference in Python. " O’Reilly Media, Inc.".
Firth, John. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis, 10–32.
Gong, Xiajing, Meng Hu, Mahashweta Basu, and Liang Zhao. 2021. “Heterogeneous Treatment Effect Analysis Based on Machine-Learning Methodology.” CPT: Pharmacometrics & Systems Pharmacology 10 (11): 1433–43.
Grimmer, Justin, Margaret E Roberts, and Brandon M Stewart. 2022. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Harris, Zellig S. 1954. “Distributional Structure.” Word 10 (2-3): 146–62.
Hazlett, Chad, and Yiqing Xu. 2018. “Trajectory Balancing: A General Reweighting Approach to Causal Inference with Time-Series Cross-Sectional Data.” Available at SSRN 3214231.
Horvitz, Daniel G, and Donovan J Thompson. 1952. “A Generalization of Sampling Without Replacement from a Finite Universe.” Journal of the American Statistical Association 47 (260): 663–85.
Huber, Martin. 2023. Causal Analysis: Impact Evaluation and Causal Machine Learning with Applications in r. MIT Press.
Imai, Kosuke, In Song Kim, and Erik H Wang. 2023. “Matching Methods for Causal Inference with Time-Series Cross-Sectional Data.” American Journal of Political Science 67 (3): 587–605.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor. 2023. “An Introduction to Statistical Learning: With Applications in Python.” (No Title).
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.
Radford, Alec, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2212.04356.
Rodriguez, Pedro L, Arthur Spirling, and Brandon M Stewart. 2023. “Embedding Regression: Models for Context-Specific Description and Inference.” American Political Science Review, 1–20.
Rosenbaum, Paul R, and Donald B Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
Salton, Gerard, and Christopher Buckley. 1988. “Term-Weighting Approaches in Automatic Text Retrieval.” Information Processing & Management 24 (5): 513–23.
Tunstall, Lewis, Leandro Von Werra, and Thomas Wolf. 2022. Natural Language Processing with Transformers. " O’Reilly Media, Inc.".
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.
Wager, Stefan, and Susan Athey. 2018. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.” Journal of the American Statistical Association 113 (523): 1228–42.
Wu, Bichen, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. “Visual Transformers: Token-Based Image Representation and Processing for Computer Vision.” https://arxiv.org/abs/2006.03677.
Zubizarreta, José R. 2015. “Stable Weights That Balance Covariates for Estimation with Incomplete Outcome Data.” Journal of the American Statistical Association 110 (511): 910–22.