Scaling Expertise

Feb 25, 2021 📖 < 9 min read

Strapping domain expertise to a rocket?

A potential employee (i.e., candidate) applies for a job. They take assessments–some of which involve responding in her/his own words. How do we evaluate those responses in a reliable, valid, and scaleable manner?

Best practices for evaluating open-ended text responses ⚖️ have been around since 1975. More recently, these best practices have been combined with the automated essay scoring paradigm–essentially predicting ground truth gold labels of SMEs using natural language processing – Even with this exciting advancement, having experts go row-by-row evaluating each-and-every-response on a numeric scale is a bottleneck with many limitations 🥴

Let's use snorkel.ai a Python package for weak/distant supervision that combines expert precision with state-of-the-art scalability (i.e., coverage) to label/rate our text responses in terms of a psychological construct; specifically, the Big 5 personality trait of extraversion. Even better, lets create Python labeling functions (LFs) out of zero-shot learning (ZSL) as a high-coverage heuristic.

Summary

TLDR: Snorkel is a fitting framework that promotes SMEs ability to impart their wisdom to scale.

Specifically, we

Programmed functions in Python that mapped onto our SME ground truth gold labels
- Zero-shot predictions for the 35 factors/facets of the Big 5 personality taxonomy
- TextBlob sentiment
- Pattern-based heuristics (i.e., keywords)
Created a generative model based on accuracies and correlations of our labeling functions
- Programmatically labeled all of our unlabeled responses
Trained a machine learning model on all (previously unlabeled) data
Strategy works with guidelines and ethical considerations for assessment center operations

Background on Snorkel

There are quite a few resources on Snorkel; such as, this one and this one. Essentially, the tried and true high-precision, low-coverage approach of having SMEs read through and evaluate each response is a bottleneck because:

Experts are expensive and can only label so many responses (i.e., low coverage) 💲
- State-of-the-art models require Big Data (i.e., high coverage) 💲💲💲
Class definitions/granularity change requiring re-labeling 💲🥴
Changes in tech such as APIs lose the text/label relationships 💲🙃
Test security involves multiple/parallel measures that require more labels 💲🔐

📸 Instead of capturing a fleeting snapshot of expertise the idea behind Snorkel is that we can bring all sources of signal to bear including SMEs, to programmatically label limitless text.

⚖️ The hook is that we incorporate a small subset of expertly rated ground truth gold labels to promote the legal defensibility of our approach.

🪙 In this way, we label the population of responses (i.e., coverage) and demonstrate precision with respect to our sample of SME gold.

Dataset

Let's use the SIOP 2019 ML competition data–focusing on the scenario-based prompt that was written to promote variability in terms of extraversion.

Extraversion Prompt

"You and a colleague have had a long day at work and you just find out you have been invited to a networking meeting with one of your largest clients. Your colleague is leaning towards not going and if they don't go you won’t know anyone there. What would you do and why?"

Examples of Gold Label Ground Truth Extraverted Responses 🪙

I would go and enjoy myself and network with the client. I have no issue with meeting new people or being in unfamiliar environments. My personality is naturally open and engaging.

I would go to the meeting. The purpose is to meet new people and I am up for the task. I consider myself social and would have no problem adjusting.

Examples of Gold Label Ground Truth Not Extraverted Responses 🪙

I would not go because I am very introverted. It would be awkward and not fun if I did not know anyone else there. Even though it could be beneficial to my career, I would be too anxious to go.

I would go home after work. As an introvert, it takes a lot of energy for me to be social and engaged in networking settings, so I would very likely feel uncomfortable and awkward. I would much rather relax at home with my wife and puppy, eat a nice dinner with them, and spend the night watching TV.

Setting Up The Dataset

Our end goal is to train a machine learning text classifier that can evaluate responses as extraverted or not extraverted (1 for extraverted; 0 for not extraverted). We have a total of 1688 responses. A subject matter expert (me) provided gold label ground truth for 350 responses (about 20%); 120 went into our training split; 115 for our development set; and 115 for test set for a 1:1:1 ratio.

We don't typically put gold labels in the training split, but I set it up this way to potentially build baseline models and for diagnostic purposes.

Please note in our training split we have the remaining 1338 unlabeled responses.

Warning

I'm using a very small, open-source dataset to show a minimimally viable walk-through (MVP). The power of this approach is when we have lots of unlabeled data. Still I'm not labeling 1688 responses!

Snorkel Flow

Machine Teaching

The first step (1) is writing labeling functions (LFs) in Python that express expertise in terms of evaluating extraversion.

Next (2) Snorkel automatically learns a generative model based on the accuracies and correlations of the LFs. Using this generative model, programmatic/soft labels are created for our population of responses.

Machine Learning

Finally (3) we predict these programmatic/soft labels using a supervised ML model.

(1) Signals Used To Program Expertise

External models out-of-the-box: Let's use zero-shot learning to build heuristics; that is, classify responses in terms of not just extraversion but all 35 factors/facets of the Big 5 personality taxonomy. ZSL is perfect for providing weak/distant supervision (i.e., coverage). It is a bit noisy–because we didn't explicitly train a model to learn the 35 factors/facets–nevertheless, these weak classifers do quite well in predicting our expertly labeled 🪙 extraversion. They're not perfect, but they provide tremendous coverage of all our responses.

I did an entire post on zero-shot worth checking out.

Journal Article

I used the factors/facets found in the APPENDIX of this journal article.

Zillig, L. M. P., Hemenover, S. H., & Dienstbier, R. A. (2002). What do we assess when we assess a Big 5 trait? A content analysis of the affective, behavioral, and cognitive processes represented in Big 5 personality inventories. Personality and Social Psychology Bulletin, 28(6), 847-858.

To build LFs from zero-shot predictions, I adapted the Snorkel tutorial on crowdsourcing.

More Details on Zero-Shot Learning

Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before.

The approach, proposed by Yin et al. (2019), uses a pre-trained MNLI sequence-pair classifier as an out-of-the-box zero-shot text classifier that actually works pretty well. The idea is to take the sequence we're interested in labeling as the "premise" and to turn each candidate label into a "hypothesis." If the NLI model predicts that the premise "entails" the hypothesis, we take the label to be true..

Code Example Zero-Shot

!pip install git+https://github.com/huggingface/transformers.git

from transformers import pipeline

classifier = pipeline("zero-shot-classification")

sequence = ('I would go! I would be excited about going. Networking could advance my career. '
'Networking could bring our company more work.')

candidate_labels = ['agreeableness', 'conscientiousness', 'extraversion', 'neuroticism', 'openness']

hypothesis_template = 'This response is characterized by {}.'

classifier(sequence, candidate_labels, multi_class=True, hypothesis_template=hypothesis_template)

Code To Build LFs from Zero-Shot

# To create LFs from zero-shot I used the example in the snorkel-tutorial on crowdsourcing
# https://github.com/snorkel-team/snorkel-tutorials/tree/master/crowdsourcing

labels_by_zs = zs_labels.groupby("candidate_label_id")
zs_dicts = {}
for zs_id in labels_by_zs.groups:
zs_df = labels_by_zs.get_group(zs_id)[["label"]]
zs_dicts[zs_id] = dict(zip(zs_df.index, zs_df.label))

ABSTAIN = -1

def zs_lf(x, zs_dict):
    return zs_dict.get(x.response_id, ABSTAIN)

def make_zs_lf(zs_id):
    zs_dict = zs_dicts[zs_id]
    name = f"lf_{zs_id}"
    return LabelingFunction(name, f=zs_lf, resources={"zs_dict": zs_dict})

zs_lfs = [make_zs_lf(zs_id) for zs_id in zs_dicts]

More external models: Positive emotion is a facet of extraversion. Let's use an out-of-the-box model TextBlob to provide signal in terms of positive sentiment

I used the example code from the Snorkel documentation.

Code TextBlog Labeling Functions

from snorkel.preprocess import preprocessor
from textblob import TextBlob

ABSTAIN = -1

@preprocessor(memoize=True)
def textblob_sentiment(x):
    scores = TextBlob(x.text)
    x.polarity = scores.sentiment.polarity
    x.subjectivity = scores.sentiment.subjectivity
    return x

@labeling_function(pre=[textblob_sentiment])
def polarity_positive(x):
    return 1 if x.polarity > 0.3 else -1

@labeling_function(pre=[textblob_sentiment])
def polarity_negative(x):
    return 0 if x.polarity < -0.25 else -1

@labeling_function(pre=[textblob_sentiment])
def polarity_negative_2(x):
    return 0 if x.polarity <= 0.3 else -1

@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
    return 1 if x.subjectivity >= 0.5 else ABSTAIN

Pattern-based heuristics: I picked up on keywords that reflect extraverted/introverted behaviors

Extraverted: extraverted, adventure, exciting, !, extroverted, butterfly, outgoing, upbeat
introverted: introverted, awkward, uncomfortable, quiet, intimidating, shy

I used the example code from the Snorkel documentation.

Code Keyword Labeling Functions

def keyword_lookup(x, keywords, label):
if any(word in x.text.lower() for word in keywords):
    return label
return ABSTAIN

def make_positive_keyword_lf(keywords, label=1):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

def make_negative_keyword_lf(keywords, label=0):
    return LabelingFunction(
        name=f"keyword_{keywords[0]}",
        f=keyword_lookup,
        resources=dict(keywords=keywords, label=label),
    )

keyword_positive = make_positive_keyword_lf(
    keywords=[
        "extraverted", "adventure", "exciting", "!", "extroverted", "butterfly", "outgoing", "upbeat"
    ]
)

keyword_negative = make_negative_keyword_lf(
    keywords=[
        "introverted",
        "awkward",
        "anxious",
        "uncomfortable",
        "quiet",
        "loner",
        "intimidating",
        "disorder",
        "stressful",
        "miserable",
        "shy",
        "dread",
        "beg",
    ]
)

There are additional ways to express expertise (i.e., build LFs) check out this link and below.

Distant supervision: According to Ratner et al., "Distant supervision generates training labels by heuristically aligning data points with an external knowledge base, and is one of the most popular forms of weak supervision."
Labeling-function generators: Again from Ratner et al., we can build labeling functions from crowdsourced labelers such as from Mturk. Note these are not used as gold labels rather for high coverage signal.
spaCy syntactics: More advanced NLP such as parts-of-speech (POS) and named entities (NER) can be used to capture nuanced patterns of behavior that represent the psychological constructs of interest.

Metrics

The power of Snorkel is our LFs, for the same response, will conflict. For example, zero-shot prediction of warmth could predict a 0 (not extraverted) and gregariousness a 1 (extraverted). That's OK we can denoise the conflict 🤝

Here are the metrics for when we analyze how our LFs did on the development set, N = 115.

Emp. Accuracy: Accuracy of LF predictions. For example, our keywords that represented an introverted response (e.g., shy or awkward) were 83% accurate.
Coverage: % of responses with at least one LF vote, extraverted (1) or not (0). We want coverage. Our keywords that represented an introverted response only had 21% coverage (21/115); whereas, our zero-shot prediction of extraversion has 100% coverage.
Polarity: Values the LF returns (1 = extraverted; 0 = not extraverted/introverted).
Overlaps & Conflicts: Metric the generative model uses to estimate the accuracy for each LF.

If we look at the far right column Emp. Acc. our individuals LFs are pretty accurate.

Labeling Function (LF)	Polarity	Coverage	Overlaps	Conflicts	Correct	Incorrect	Emp. Acc.
TextBlog Sentiment Polarity Positive	[1]	0.17	0.17	0.16	16	3	0.84
Keywords Introverted	[0]	0.21	0.21	0.2	20	4	0.83
Zero-Shot Extraversion	[0, 1]	1	1	0.93	89	26	0.77
Zero-Shot Positive Emotions	[0, 1]	1	1	0.93	89	26	0.77
Zero-Shot Openness	[0, 1]	1	1	0.93	89	26	0.77
Zero-Shot Gregariousness	[0, 1]	1	1	0.93	89	26	0.77
Zero-Shot Assertiveness	[0, 1]	1	1	0.93	87	28	0.76
Zero-Shot Warmth	[0, 1]	1	1	0.93	87	28	0.76
Zero-Shot Excitement Seeking	[0, 1]	1	1	0.93	87	28	0.76
Zero-Shot Activity	[0, 1]	1	1	0.93	87	28	0.76
TextBlog Sentiment Polarity Negative	[0]	0.1	0.1	0.1	9	3	0.75
Zero-Shot Achievement	[0, 1]	1	1	0.93	84	31	0.73
Zero-Shot Competence	[0, 1]	1	1	0.93	79	36	0.69
Zero-Shot Altruism	[0, 1]	1	1	0.93	78	37	0.68
Zero-Shot Trust	[0, 1]	1	1	0.93	77	38	0.67
Zero-Shot Agreeableness	[0, 1]	1	1	0.93	77	38	0.67
Zero-Shot Actions	[0, 1]	1	1	0.93	75	40	0.65
Zero-Shot Aesthetics	[0, 1]	1	1	0.93	74	41	0.64
Zero-Shot Ideas	[0, 1]	1	1	0.93	74	41	0.64
Zero-Shot Tender Mindedness	[0, 1]	1	1	0.93	73	42	0.63
Zero-Shot Impulsiveness	[0, 1]	1	1	0.93	72	43	0.63
Keywords Extraverted	[1]	0.07	0.07	0.07	5	3	0.62
Zero-Shot Straightforwardness	[0, 1]	1	1	0.93	70	45	0.61
TextBlog Polarity Negative_2	[0]	0.83	0.83	0.77	55	41	0.57
Zero-Shot Values	[0, 1]	1	1	0.93	63	52	0.55
Zero-Shot Fantasy	[0, 1]	1	1	0.93	62	53	0.54
Zero-Shot Conscientiousness	[0, 1]	1	1	0.93	61	54	0.53
Zero-Shot Dutifulness	[0, 1]	1	1	0.93	60	55	0.52
Textblob Subjectivity	[1]	0.5	0.5	0.5	30	28	0.52

(2) Generative Model

Next, we take our noisy and conflicting labeling functions and use the Snorkel LabelModel to denoise and combine them using their accuracies and correlations . More accurate LFs are weighted accordingly.

To check the quality of our generative model, we score it using our development set. We end up getting accuracy = .77 which is pretty solid considering I plugged-and-played example code from the Snorkel documentation in addition to the zero-shot predictions.

🪙 We labeled the population of responses and provided evidence it was precise with respect to our sample of SME ground truth gold labels.

The result of this step is the generative model creates labels for all 1338 of our unlabeled responses 👌

Code PandasLFApplier LabelModel

from snorkel.analysis import metric_score
from snorkel.labeling import PandasLFApplier
from snorkel.labeling.model import LabelModel

# For this step, I followed the code in the Snorkel tutorial on crowdsourcing
# https://github.com/snorkel-team/snorkel-tutorials/tree/master/crowdsourcing

# Apply LFs to dev and train
applier = PandasLFApplier(all_lfs)
L_train = applier.apply(df_train)
L_dev = applier.apply(df_dev)    

# Train LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train, n_epochs=100, seed=123, log_freq=20, l2=0.1, lr=0.01)

# Score LabelModel on our dev set
preds_dev = label_model.predict(L_dev)
acc = metric_score(df_dev.gold.values, preds_dev, probs=None, metric="accuracy")
print(f"LabelModel Accuracy: {acc:.3f}")

# Output
# LabelModel Accuracy: 0.765

# Generate labels for our 1338 unlabeled responses
preds_train = label_model.predict(L_train)

(3) Predict Labels Using ML

Now we have our 1338 programmatic/soft labels, at this point we disregard the LFs and build a machine-learning model simply using 2 things: text responses and a column of ones and zeroes, extraverted response or not, respectively.

Consistent with the Snorkel tutorial, I used BERT, a pre-trained language model, and trained a logistic regression model using the BERT features on our 1338 labels. For a stronger approach that can harness Big Data check out this article.

The accuracy of the trained model was .72 a little bit of shrinkage (~6.5%), but keep in mind we have very small data. Nevertheless, this level of accuracy is in keeping with top assessment journals.

Code Predict Soft Labels using Supervised ML

from snorkel.analysis import metric_score
from snorkel.labeling import PandasLFApplier
from snorkel.labeling.model import LabelModel

# For this step, I followed the code in the Snorkel tutorial on crowdsourcing
# https://github.com/snorkel-team/snorkel-tutorials/tree/master/crowdsourcing

import numpy as np
import torch
from pytorch_transformers import BertModel, BertTokenizer

model = BertModel.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


def encode_text(text):
    input_ids = torch.tensor([tokenizer.encode(text)])
    return model(input_ids)[0].mean(1)[0].detach().numpy()


X_train = np.array(list(df_train.tweet_text.apply(encode_text).values))
X_test = np.array(list(df_test.tweet_text.apply(encode_text).values))

from sklearn.linear_model import LogisticRegression

sklearn_model = LogisticRegression(solver="liblinear")
sklearn_model.fit(X_train, preds_train)


print(f"Accuracy of trained model: {sklearn_model.score(X_test, Y_test)}")

# Output
# Accuracy of trained model: 0.715

Conclusion

In 1975, Bill Gates and Paul Allen started Microsoft. The Suez Canal reopened and best practices for evaluating open-ended text responses in the talent space were assembled 🍾

The spirit of this post is to channel those seemingly fleeting snapshots of subject matter expertise (i.e., precision) into Python labeling functions (LFs) to be applied to all of our data (i.e., coverage) and on demand (Think: AWS) to evaluate fresh–yet to be labeled data.

The framework of weak/distant supervision (e.g., Snorkel) is especially fitting because it incorporates expertly rated ground truth gold labels to help buoy the legal defensibility of this strategy.

I spent 90 minutes providing gold labels for 350 responses in terms of extraversion and ended up with an end model trained on 1338 responses with an accuracy of .72. That's a solid start using small, open-source data.

Derek L Mracek, PhD

Support This Site

Summary

TLDR: Snorkel is a fitting framework that promotes SMEs ability to impart their wisdom to scale.

Specifically, we

Programmed functions in Python that mapped onto our SME ground truth gold labels
- Zero-shot predictions for the 35 factors/facets of the Big 5 personality taxonomy
- TextBlob sentiment
- Pattern-based heuristics (i.e., keywords)
Created a generative model based on accuracies and correlations of our labeling functions
- Programmatically labeled all of our unlabeled responses
Trained a machine learning model on all (previously unlabeled) data
Strategy works with guidelines and ethical considerations for assessment center operations