talks/deepir.tex at master · TaddyLab/talks · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
\documentclass[11pt,xcolor=svgnames]{beamer}
\usepackage{dsfont,natbib,setspace,changepage,multirow}
\mode<presentation>

% replaces beamer foot with simple page number
\setbeamertemplate{navigation symbols}{}
%\setbeamerfont{frametitle}{series=\bfseries}
\setbeamercolor{frametitle}{fg=Black}

\setbeamertemplate{footline}{
   \raisebox{5pt}{\makebox[\paperwidth]{\hfill\makebox[20pt]{\color{gray}\scriptsize\insertframenumber}}}}

\graphicspath{{/Users/mtaddy/Dropbox/inputs/}}
\usepackage{algorithm}
\usepackage{algorithmic}

% colors
\newcommand{\theme}{\color{Maroon}}
\newcommand{\bk}{\color{black}}
\newcommand{\rd}{\color{DarkRed}}
\newcommand{\fg}{\color{ForestGreen}}
\newcommand{\bl}{\color{blue}}
\newcommand{\gr}{\color{black!67}}
\newcommand{\sg}{\color{DarkSlateGray}}
\newcommand{\nv}{\color{Navy}}
\setbeamercolor{itemize item}{fg=gray}

% common math markups
\newcommand{\bs}[1]{\boldsymbol{#1}}
\newcommand{\mc}[1]{\mathcal{#1}}
\newcommand{\mr}[1]{\mathrm{#1}}
\newcommand{\bm}[1]{\mathbf{#1}}
\newcommand{\ds}[1]{\mathds{#1}}
\newcommand{\indep}{\perp\!\!\!\perp}
\def\plus{\texttt{+}}
\def\minus{\texttt{-}}

% spacing and style shorthand
\setstretch{1.1}

\begin{document}

\setcounter{page}{0}
{ \usebackgroundtemplate{\includegraphics[height=\paperheight]{phoenix}}
\begin{frame}[plain]
\begin{center}


{\bf \Large Document Classification by Inversion of \\Distributed Language Representations}

\vskip 1cm\large
 Matt Taddy,  Chicago Booth

\end{center}
\end{frame} }

\begin{frame}


{\bf Distributed language representation}

\vskip .5cm
$\theme  \boldsymbol{\mathcal{V}}$ contains
an embedding  in $\mathds{R}^K$ for every vocabulary  word.

\vskip .25cm
In a contextual language model, $\mathcal{V}$ is trained to maximize the likelihoods for each single word and its neighbors.

\vskip .25cm
e.g., The {\theme skip-gram} objective for word $t$ in sentence $s$ is
\begin{equation*}\label{eq:skipgram}
\mathrm{max}\sum_{j\neq t,~j=t-b}^{t+b} \log\mathrm{p}_{\mathcal{V}}(w_{sj}\mid w_{st})
\end{equation*}
where $b$ is the skip-gram window (truncate at ends of sentences).

\end{frame}

\begin{frame}

{\bf Neural network language models}

\vskip .5cm
Local context probabilities are functions of the word embeddings.

\vskip .25cm
e.g., In {\theme Word2Vec} {\gr (Mikolov et al. 2013)}
\begin{equation*} \label{eq:neuralnet}
\mathrm{p}_{\mathcal{V}}(w | w_t) =
 \prod_{j=1}^{L(w)-1}\sigma\!\left( \mathrm{ch}\left[\eta(w,j+1)\right] \mathbf{u}_{\eta(w,j)}^\top \mathbf{v}_{w_t} \right)
\end{equation*}
where $\eta(w,i)$ is the $i^{th}$ node in the length-$L(w)$ Huffman tree path for  $w$ and $\mathrm{ch}(\eta)
\in \{-1,+1\}$ for whether $\eta$ is a left or right child.

\vskip .25cm
\gr `Input' embedding $\mathbf{v}_{w_t}$ is usually the main object of interest.
\end{frame}


\begin{frame}

{\bf Example Huffman encoding of a 4 word vocabulary}

\begin{center}
\includegraphics[width=.9\textwidth]{graphs/bht}
\end{center}

From left to right the two nodes with lowest count are
combined into a parent.  Encodings are read off of the splits
 from right to left.

\end{frame}

\begin{frame}

{\bf From word embeddings to document modeling}

\vskip .5cm
Distributed representations have proven very useful for NLP tasks {\gr next word prediction, word analogy, named entity recognition, ...}

\vskip .25cm
There is interest in porting this success to document modeling
{\gr author classification, sentiment prediction, attribute imputation, ...}

\vskip .35cm
Strategies include directly modeling the semantic composition of contexts {\gr (Socher et al. 2011)} or adding latent document-location effects into the context model {\gr (as in Le \texttt{+} Mikolov's Doc2Vec)}.

\vskip .5cm
{\theme This paper:} composite likelihoods and Bayes rule provide a very simple way to turn local language models into document classifiers.
\end{frame}


\begin{frame}

{ \bf Composite likelihood}
\vskip .5cm

The local-context objectives don't correspond to a full document model, but they can be combined to form a composite likelihood.

\vskip .25cm
e.g., skip-gram's pairwise-conditional composition for sentence $\bm{w}$
\begin{equation*}\label{eq:sentencelhd} \log\mathrm{p}_{ \mathcal{V}}(\mathbf{w}) =
\sum_{j=1}^T\sum_{k=1}^T \mathds{1}_{\left[1\leq |k-j| \leq b\right]} \log\mathrm{p}_{ \mathcal{V}}(w_{k}|
w_{j} ). \end{equation*}

{\theme Composite LHD approximate a full joint LHD}. They are common in  statistics, since Besag's  pseudolikelihood $\mathrm{p}(\mathbf{w}) \approx \prod_j \mathrm{p}(w_j |\mathbf{w}_{-j})$.

\vskip .25cm
{\gr Another e.g.: Jernite et al. (2015) show that CBOW Word2Vec corresponds to the pseudolikelihood for a Markov random field.  }

\end{frame}

\begin{frame}

{ \bf Bayesian inversion}
\vskip .5cm

Given sentence LHDs, document $d =
\{\mathbf{w}_1, ... \mathbf{w}_S\}$ has  log LHD
\begin{equation*}\label{eq:fulllhd} \log\mathrm{p}_{ \mathcal{V}}(d) =
\sum_{s}  \log\mathrm{p}_{ \mathcal{V}}(\mathbf{w}_s). \end{equation*}

\vskip .2cm
{Suppose your documents are grouped by class label, $y \in
\{1 \dots C\}$.}

\vskip .1cm
{\theme Then you train separate $\mathcal{V}_c$ on each sub-corpus $D_c = \{ d_i : y_i =c \}$.}

\vskip .1cm
$\Rightarrow$ doc $d$ has probability
$\mathrm{p}_{ \mathcal{V}_c}(d)$ if it came from class $c$, and
\begin{equation*}\label{eq:bayesrule}
\mathrm{p}( y | d) = \frac{\mathrm{p}_{ \mathcal{V}_y}(d)\pi_y }
{\sum_c \mathrm{p}_{ \mathcal{V}_c}(d)\pi_c }
\end{equation*}
where $\pi_c$ is our prior probability on class label $c$ {\gr (say $\pi_c = 1/C$)}.


\end{frame}


\begin{frame}

{\bf Yelp reviews example}

\vskip .25cm
200k reviews, 2mil sentences, separate W2V for each of 1-5 stars.

\vskip .25cm
Given W2V representations $\mathcal{V}_1 \ldots \mathcal{V}_5$, calculate sentiment probs as, e.g.,
$\mathrm{p}(\star \geq 3|d) = \left[\mathrm{p}_{ \mathcal{V}_3}(d) +
\mathrm{p}_{ \mathcal{V}_4}(d) +
\mathrm{p}_{ \mathcal{V}_5}(d)
\right]/
{\sum_{c=1}^5 \mathrm{p}_{ \mathcal{V}_c}(d) }$
\begin{center}
\includegraphics[width=0.6\textwidth]{graphs/posneg}
\end{center}

% {\footnotesize
% $\star \geq 3$: very yummy food, friendly staff, quick service, and good coffee.

% $\star < 3$: went for lunch. the place smelled bad, like cig smoke. we ordered but  were told that they didnt have food.}

Everything is implemented in the \texttt{gensim} library for \texttt{python}, which now includes the \texttt{score} method to obtain $\log \mathrm{p}_{ \mathcal{V}}(d)$ for fitted $\mathcal{V}$.

\end{frame}

\begin{frame}


{\bf OOS classification performance}

\begin{center}
{\footnotesize
 \begin{tabular}{r|c|c|c}
{\it misclass rate} & $<,\geq 3 ~\star$ & $<,=,>3$ $\star$ &  $1 \dots 5 ~\star$
\\ \cline{1-4}\rule{0pt}{3ex}
W2V inversion & .099 & \textbf{.189} & .435 \\
Phrase regression & \textbf{.084} & .200 & \textbf{.410} \\
%D2V DBOW &  .144 &.282 & .496 \\
%D2V DM & .179 & .306 & .549 \\
D2V combined & .148 & . 284 & .500 \\
MNIR & .095 & .254 & .480 \\
W2V aggregation & .118 & .248 & .461
\end{tabular}}
\end{center}

\begin{center}
\includegraphics[width=1\textwidth]{graphs/bystarshort}
\end{center}

Best or close to it, with prob(\textit{positive}) nicely ordered by true star.
\end{frame}


\begin{frame}
{Messaging and Negotiation on eBay.com}

eBay has a `{\theme best offer}' button; a buyer can use this to circumvent the auction or fixed price sale, and make a direct offer to the seller.

\vskip .5cm
We track the communication.

\vskip .5cm
After controlling for parameters of the transaction {\gr (item, buyer offer, seller/buyer info, ...)} {\theme what language leads to a higher probability of a seller responding with a deal or counteroffer?}

\vskip .5cm
We can use the results to educate buyers, give templates/examples, or generate hypotheses on bargaining behavior.

\end{frame}


\begin{frame}

First, fit separate word2vec to those reviews with $y_i=1$ (seller response) and with $y_i=0$ (no seller response).

\vskip .2cm
Sentences most likely in the $y=1$ representation all showed evidence of previous deals and {\nv bundling}, while those most likely in the $y=0$ representation are  driving a {\theme hard bargain}:

\vskip .5cm
{\footnotesize \color{black!75} \it \setstretch{1}
know this low but its based on what individual [product] selling

\vskip .2cm
give you two cash you ship it free its very common low end [product] has broken pin good pin youd be lucky get this real its worth melt price dont believe me its really not worth listing fees that you paid

\vskip .2cm
hi my offer basically what price guide lists individual [product] it may seem too low you but it never hurts place offer regards [name]

\vskip .2cm
do you know that clubhouse sigs fakes worthless ridiculous fake sigs thats why other ball sold they were real sigs real sigs lot better its like gehrigs wife signing his name not worth anything

\vskip -.5cm
}

\end{frame}


\begin{frame}

To isolate the targeted effect, we first remove what was predictable from the 0/1 response, $y:$ `did seller respond to buyer?'

\vskip .5cm
Fit $p_i = \mr{p}(y_i=1 | \bm{x}_i)$ as an {\it Empirical Bayesian Forest} ($\approx$ an RF).

{\gr $\bm{x}$: buyer, seller, item attributes (even includes sale title topics). }

\vskip .5cm
\includegraphics[width=\textwidth]{graphs/botrunk}

\vskip .5cm
We can split the reviews {\theme on the residuals} -- high $r_i = y_i-\hat p_i$ and low $r_i = y_i- \hat p_i$ -- and fit a word2vec representation to each.

\end{frame}
\begin{frame}

Now, with residual $r_i$ as the variable separating our samples...

\vskip .2cm
The high $r_i$ representation still places high probability on language indicating previous deals and bundling.

\vskip .2cm
But the low $r_i$ representation now puts high probability on {\theme pleading}:

\vskip .5cm
{\footnotesize \color{black!75} \it \setstretch{1}
know that my offer not close what you asking cant afford pay much more than my offer but really want this pin please let me know you can make counteroffer this offer not acceptabl

\vskip .2cm
you dont like offer feel free tell me what lowest youll go on these [product] like you ive been collect since was im now hope we can reach deal wich fair both us by way they some awsome [product] !!

\vskip .2cm
hello my friend hope my offer good enough really want  [product] ... cgc going be there they grading comics also do you know how much they charge grade comic thkas greg

\vskip .2cm
last one sold on dec 26th ,, can match that price im also interested [product] was wondering how much pair ?? can have money ur account tonight we can reach deal thanks your time ,,

\vskip -.5cm
}

\end{frame}

\begin{frame}

{\bf W2V Inversion is simple, scalable, and it works}

\vskip .5cm
Not claiming it's a world beater, but it is an easy way to go from local context representation algorithms to document classification.

\vskip .5cm
If you think carefully about what attributes (or even estimated quantities, or residuals) are of interest, W2V inversion is a great tool for understanding language discrepancies.

\vskip 1cm
\hfill{\bf \LARGE \theme THANKS!}

\end{frame}

\end{document}