-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.xml
More file actions
5748 lines (5305 loc) · 577 KB
/
index.xml
File metadata and controls
5748 lines (5305 loc) · 577 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<title>Kate Lyons</title>
<link>/index.xml</link>
<description>Recent content on Kate Lyons</description>
<generator>Hugo -- gohugo.io</generator>
<language>en-us</language>
<copyright>&copy;2017 Kate Lyons</copyright>
<lastBuildDate>Sun, 12 Mar 2017 21:13:14 -0500</lastBuildDate>
<atom:link href="/index.xml" rel="self" type="application/rss+xml" />
<item>
<title>Emoji Dictionary</title>
<link>/portfolio/2017-10-04-emoji-dictionary/</link>
<pubDate>Sun, 12 Mar 2017 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-10-04-emoji-dictionary/</guid>
<description><p>If you are working with social media data, it is very likely you’ll run into emojis. Because of their encoding, however, they can be tricky to deal with. Fortunately, <a href="http://opiateforthemass.es/articles/emoticons-in-R/">Jessica Peterka-Bonetta’s work</a> introduced the idea of an emoji dictionary which has the prose name of an emoji matched up to its R encoding and unicode codepoint. This list, however, does not include the newest batch of emojis, Unicode Version 9.0, nor the different skin color options for human-based emojis. Good news though – I made <a href="https://github.com/lyons7/emojidictionary">my own emoji dictionary</a> that has all 2,204 of them! I also have included the number of each emoji as listed in the <a href="http://unicode.org/emoji/charts/emoji-list.html">Unicode Emoji List v. 5.0</a>.</p>
<p><strong>If you use this emoji dictionary for your own research, please make sure to acknowledge both myself and Jessica.</strong></p>
<p>This dictionary is available as a CSV file on my <a href="https://github.com/lyons7/emojidictionary">github page</a>. The prose emoji name in the CSV file conveniently has spaces on each side of the emoji name (e.g. &quot; FACEWITHTEARSOFJOY “) so if emojis are right next to other words they won’t be smushed together. Emoji names themselves have no spaces if the name of the emoji is longer than one word. I did this to make text analyses such as sentiment analysis and topic modeling possible without endangering the integrity of the emoji classification. (As we don’t want stop words that are part of emoji names to be deleted!)</p>
<p>Here is how to use this dictionary for emoji identification in R. There are a few formatting steps and a tricky find-and-replace producedure that requires another R package, but once you have the dictionary loaded and the text in the right format you will be ready to go!</p>
<div id="processing-the-data" class="section level2">
<h2>Processing the data</h2>
<p>Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.</p>
<pre class="r"><code>tweets=read.csv(&quot;Col_Sep_INSTACORPUS.csv&quot;, header=T)
emoticons &lt;- read.csv(&quot;Decoded Emojis Col Sep.csv&quot;, header = T)</code></pre>
<p>To transform the emojis, you first need to transform your tweet data into ASCII:</p>
<pre class="r"><code>tweets$text &lt;- iconv(tweets$text, from = &quot;latin1&quot;, to = &quot;ascii&quot;,
sub = &quot;byte&quot;)</code></pre>
</div>
<div id="processing-the-data-1" class="section level2">
<p>To ‘count’ the emojis you do a find and replace using the CSV file of ‘Decoded Emojis’ as a reference. Here I am using the <a href="http://www.inside-r.org/packages/cran/DataCombine/docs/FindReplace">DataCombine package</a>. What this does is identifies emojis in posts and then replaces them with a prose version. I used whatever description pops up when hovering one’s cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post. You can also cross-check the name listed on the dictionary and the number of the emoji entry in the <a href="http://unicode.org/emoji/charts/full-emoji-list.html#1f918">Unicode Emoji List</a>.</p>
<pre class="r"><code>library(DataCombine)
tweets &lt;- FindReplace(data = tweets, Var = &quot;text&quot;,
replaceData = emoticons,
from = &quot;R_Encoding&quot;, to = &quot;Name&quot;,
exact = FALSE)</code></pre>
<p>You now have a data frame with emojis in prose form. You can do fun things like <a href="https://lyons7.github.io/portfolio/2017-03-10-emoji-maps/">make maps with emojis</a> (if you have geotag information) or note which are the most frequent emojis and <a href="https://github.com/dill/emoGG">plot them</a> etc. etc. – there are possibilities galore! Have fun 😄</p>
</div>
</description>
</item>
<item>
<title>Collating Spatial Data </title>
<link>/portfolio/2017-03-14-collate/</link>
<pubDate>Sun, 12 Mar 2017 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-14-collate/</guid>
<description><!-- BLOGDOWN-HEAD -->
<!-- /BLOGDOWN-HEAD -->
<!-- BLOGDOWN-BODY-BEFORE -->
<!-- /BLOGDOWN-BODY-BEFORE -->
<div id="the-problem" class="section level2">
<h2>The Problem</h2>
<p>I have one data set of around 1,000 observations of public signs with latitude and longitude coordinates and another data set of around 15,000 tweets that have gone through topic modeling and have topics assigned to them. What I would like to do is link up tweets that have happened close by to the signs I’ve recorded and see the most common topic that is present ‘around’ that sign. This has two issues: 1) the coordinates of public signs and tweets are <em>not</em> going to be the exact same and 2) there are way more tweets than public signs in the area I’m looking at, so I can’t just merge these two together. I have to figure out a way to look at all the tweets that have occurred near a sign and <em>then</em> find the topic that has most frequently occured in those tweets associated with that location.</p>
</div>
<div id="the-solution" class="section level2">
<h2>The Solution</h2>
<p>Luckily, there is a way to do this with packages like <a href="https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf">fuzzyjoin</a> and <a href="https://cran.r-project.org/web/packages/dplyr/dplyr.pdf">dplyr</a>!</p>
<pre class="r"><code>packs = c(&quot;stringr&quot;,&quot;ggplot2&quot;,&quot;devtools&quot;,&quot;DataCombine&quot;,&quot;ggmap&quot;,
&quot;topicmodels&quot;,&quot;slam&quot;,&quot;Rmpfr&quot;,&quot;tm&quot;,&quot;stringr&quot;,&quot;wordcloud&quot;,&quot;plyr&quot;,
&quot;tidytext&quot;,&quot;dplyr&quot;,&quot;tidyr&quot;,&quot;xlsx&quot;)
lapply(packs, library, character.only=T)</code></pre>
</div>
<div id="matching-tweets-with-physical-sign-data" class="section level2">
<h2>Matching tweets with physical sign data</h2>
<p>What we are trying to do is to match up locations recorded in the physical realm with the digital. Because we do not have <em>exact</em> matches, we will use the awesome <a href="https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf">fuzzyjoin package</a>.</p>
<pre class="r"><code>library(fuzzyjoin)
library(dplyr)
pairsdf &lt;- ll %&gt;%
geo_inner_join(tweets, unit=&#39;km&#39;,distance_col=&quot;distance&quot;) %&gt;%
filter(distance &lt;= 0.018288)</code></pre>
<pre><code>## Joining by: c(&quot;latitude&quot;, &quot;longitude&quot;)</code></pre>
<pre class="r"><code># I have to use filter here because &#39;max_distance&#39; is not geared to be less than 1 km or 1 mi
# If you are a weirdo like me looking at things much smaller than a mile or kilometer, you have
# to filter afterwards...</code></pre>
<p>Voila! I have a data frame with a row of each time a post has occurred in a 60 foot vicinity of an LL object. This might be a little big, but this ensures we get more tweets associated with signs. If you’d like to look at smaller radii, just put in whatever fraction of a kilometer in the ‘distance’ parameter that you are interested in.</p>
<p>Now what I would like to do is figure out the most common topic that is associated with a particular sign. We’ll use the idea of ‘mode’ here with our topics and the <strong>group_by()</strong> function from dplyr as suggested <a href="http://stackoverflow.com/questions/25198442/how-to-calculate-mean-median-per-group-in-a-dataframe-in-r">here</a>.</p>
<p>As R does not have a built in function for mode, we build one. Code for this available <a href="https://www.tutorialspoint.com/r/r_mean_median_mode.htm">here</a>.</p>
<pre class="r"><code># To get the mode
getmode &lt;- function(v) {
uniqv &lt;- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
# Tell R your topic categories are a number so it can deal with them
pairsdf$V1&lt;- as.numeric(pairsdf$V1)
# Now calculate things about the topics per sign
topicmode &lt;- pairsdf%&gt;%
group_by(SIGN_ID)%&gt;%
summarise(Mode = getmode(V1))</code></pre>
<p>Let’s now combine this with our other data, but just include those instances that have a topic assigned (not all signs got a corresponding tweet)</p>
<pre class="r"><code>topicsigns &lt;- inner_join(ll, topicmode, by = &quot;SIGN_ID&quot;)</code></pre>
<p>There you have it! You now have a data frame which includes a record of public signs that have tweets that have occurred in their vicinity and the most common topic associated with those tweets.</p>
</div>
</description>
</item>
<item>
<title>Maps! Maps! Maps!</title>
<link>/portfolio/2017-03-14-maps/</link>
<pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-14-maps/</guid>
<description><p>The <a href="https://cran.r-project.org/web/packages/ggmap/ggmap.pdf">ggmap</a> package is awesome. It enables you to get a map from Google maps (in various forms too! Just check out the package documentation) and <em>then</em> you can plot stuff on top of the maps, which is really useful particularly if you are dealing with spatial data.</p>
<pre class="r"><code># Load in the package
library(ggmap)
# Get a map!
# For fun, I&#39;ll do my home town. ggmap does surprisingly well with
# limited search terms, so you do not have to worry about being super # specific or explicit.
map &lt;- get_map(location = &#39;Soda Bay,
California&#39;, zoom = 11)
# The &#39;base&#39; of this is ggmap(map). This will just &#39;print&#39; your map.
ggmap(map)</code></pre>
<p><img src="/portfolio/2017-03-14-maps_files/figure-html/unnamed-chunk-1-1.png" width="672" /></p>
<p>Now you can use <a href="https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf">ggplot2</a> to plot stuff <em>on top</em> of the map you’ve just generated. This is useful for all kinds of spatial data, like geotagged social media posts or census data. Here’s an example with census data. (If you want to see how I got this data with R, check out this post <a href="https://lyons7.github.io/portfolio/2017-03-12-census-data/">here</a> or this tutorial <a href="http://mazamascience.com/WorkingWithData/?p=1494">here</a> or <a href="http://zevross.com/blog/2015/10/14/manipulating-and-mapping-us-census-data-in-r-using-the-acs-tigris-and-leaflet-packages-3/#census-data-the-easyer-way">here</a> or <a href="http://dlab.berkeley.edu/blog/season-sharing-data-working-newly-released-census-2010-2014-acs-5-year-data-r">here</a>!)</p>
<pre class="r"><code>ggmap(map) +
geom_polygon(data = homevalue_points, aes(x = long, y = lat, group = group, fill = Median_Value), alpha=0.75) +
scale_fill_distiller(palette = &quot;Blues&quot;) +
guides(fill = guide_legend(reverse = TRUE)) +
theme_nothing(legend=TRUE) +
coord_map() +
labs(title = &quot;2013 Median Value for Owner-Occupied Housing Units&quot;,
fill = &quot;Value (Dollars)&quot;)</code></pre>
<p><img src="/portfolio/2017-03-14-maps_files/figure-html/unnamed-chunk-3-1.png" width="672" /></p>
<p>In essence, think of the ggmap as a base which you can build on. As long as you have coordinates, you are able to plot things on top of a ggmap, even <a href="https://lyons7.github.io/portfolio/2017-03-10-emoji-maps/">series of data</a>!</p>
</description>
</item>
<item>
<title>Having Fun with Tidy Text!</title>
<link>/portfolio/2017-03-09-fun-w-tidy-text/</link>
<pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-09-fun-w-tidy-text/</guid>
<description><p>Julia Silge and David Robinson have a wonderful new book called “Text Mining with R” which has a <a href="http://tidytextmining.com/">companion website</a> with great explanations and examples. Here are some additional applications of those examples on a corpus of geotagged Instagram posts from the Mission District neighborhood in San Francisco.</p>
<pre class="r"><code># Make sure you have the right packages!
packs = c(&quot;twitteR&quot;,&quot;RCurl&quot;,&quot;RJSONIO&quot;,&quot;stringr&quot;,&quot;ggplot2&quot;,&quot;devtools&quot;,&quot;DataCombine&quot;,&quot;ggmap&quot;,
&quot;topicmodels&quot;,&quot;slam&quot;,&quot;Rmpfr&quot;,&quot;tm&quot;,&quot;stringr&quot;,&quot;wordcloud&quot;,&quot;plyr&quot;,
&quot;tidytext&quot;,&quot;dplyr&quot;,&quot;tidyr&quot;,&quot;xlsx&quot;,&quot;scales&quot;,&quot;ggrepel&quot;,&quot;lubridate&quot;,&quot;purrr&quot;,&quot;broom&quot;)
lapply(packs, library, character.only=T)</code></pre>
<p>The data need to be processed a bit more in order to analyze them. Let’s try from the start with <a href="http://tidytextmining.com/">Silge and Robinson</a>.</p>
<pre class="r"><code># Get rid of stuff particular to the data (here encodings of links and such)
# Most of these are characters I don&#39;t have encodings for (other scripts, etc.)
tweets$text = gsub(&quot;Just posted a photo&quot;,&quot;&quot;, tweets$text)
tweets$text = gsub( &quot;&lt;.*?&gt;&quot;, &quot;&quot;, tweets$text)
# Get rid of super frequent spam posters
tweets &lt;- tweets[! tweets$screenName %in% c(&quot;4AMSOUNDS&quot;,
&quot;BruciusTattoo&quot;,&quot;LionsHeartSF&quot;,&quot;hermesalchemist&quot;,&quot;Mrsourmash&quot;),]
# Now for Silge and Robinson&#39;s code. What this is doing is getting rid of URLs, re-tweets (RT) and ampersands. This also gets rid of stop words without having to get rid of hashtags and @ signs by using str_detect and filter!
reg &lt;- &quot;([^A-Za-z_\\d#@&#39;]|&#39;(?![A-Za-z_\\d#@]))&quot;
tidy_tweets &lt;- tweets %&gt;%
filter(!str_detect(text, &quot;^RT&quot;)) %&gt;%
mutate(text = str_replace_all(text, &quot;https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;amp;|&amp;lt;|&amp;gt;|RT|https&quot;, &quot;&quot;)) %&gt;%
unnest_tokens(word, text, token = &quot;regex&quot;, pattern = reg) %&gt;%
filter(!word %in% stop_words$word,
str_detect(word, &quot;[a-z]&quot;))</code></pre>
<p>Awesome! Now our posts are cleaned with the hashtags and @ mentions still intact. What we can try now is to plot the frequency of some of these terms according to WHERE they occur. Silge and Robinson have an example with persons, let’s try with coordinates.</p>
<pre class="r"><code>freq &lt;- tidy_tweets %&gt;%
group_by(latitude,longitude) %&gt;%
count(word, sort = TRUE) %&gt;%
left_join(tidy_tweets %&gt;%
group_by(latitude,longitude) %&gt;%
summarise(total = n())) %&gt;%
mutate(freq = n/total)
freq</code></pre>
<pre><code>## Source: local data frame [61,008 x 6]
## Groups: latitude, longitude [1,411]
##
## latitude longitude word n total freq
## &lt;dbl&gt; &lt;dbl&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;dbl&gt;
## 1 37.76000 -122.4200 mission 1844 21372 0.08628112
## 2 37.76000 -122.4200 san 1592 21372 0.07448999
## 3 37.76000 -122.4200 district 1576 21372 0.07374134
## 4 37.76000 -122.4200 francisco 1464 21372 0.06850084
## 5 37.75833 -122.4275 park 293 2745 0.10673953
## 6 37.75833 -122.4275 mission 285 2745 0.10382514
## 7 37.75833 -122.4275 dolores 278 2745 0.10127505
## 8 37.76000 -122.4200 #sanfrancisco 275 21372 0.01286730
## 9 37.76300 -122.4209 alley 245 2273 0.10778707
## 10 37.76300 -122.4209 clarion 242 2273 0.10646722
## # ... with 60,998 more rows</code></pre>
<p>The n here is the total number of times this term has shown up, and the total is how many terms there are present in a particular coordinate.</p>
<p>Cool! Now we have a representation of terms, their frequency and their position. Now I might want to plot this somehow… one way would be to try to plot the most frequent terms (n &gt; 50) (Some help on how to do this was taken from <a href="http://blog.revolutionanalytics.com/2016/01/avoid-overlapping-labels-in-ggplot2-charts.html">here</a> and <a href="http://stackoverflow.com/questions/14288001/geom-text-not-working-when-ggmap-and-geom-point-used">here</a>).</p>
<pre class="r"><code>freq2 &lt;- subset(freq, n &gt; 50)
map &lt;- get_map(location = &#39;Valencia St. and 20th, San Francisco,
California&#39;, zoom = 15)
freq2$longitude&lt;-as.numeric(freq2$longitude)
freq2$latitude&lt;-as.numeric(freq2$latitude)
lon &lt;- freq2$longitude
lat &lt;- freq2$latitude
mapPoints &lt;- ggmap(map) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_label_repel(data = freq2, aes(x = lon, y = lat, label = word),size = 2) </code></pre>
<p><img src="/portfolio/2017-03-09-fun-w-tidy-text_files/figure-html/unnamed-chunk-6-1.png" width="672" /></p>
<p>Let’s zoom into that main central area to see what’s going on!</p>
<pre class="r"><code>map2 &lt;- get_map(location = &#39;Lexington St. and 19th, San Francisco,
California&#39;, zoom = 16)
mapPoints2 &lt;- ggmap(map2) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_label_repel(data = freq2, aes(x = lon, y = lat, label = word),size = 2) </code></pre>
<p><img src="/portfolio/2017-03-09-fun-w-tidy-text_files/figure-html/unnamed-chunk-8-1.png" width="672" /></p>
<p>This can be manipulated in many different ways – either by playing with what frequency of terms you want to look at (maybe I want to see terms that occur 100 times, between 20 and 50 times, less than 20 times etc. etc.) OR by playing around with the map. At the moment though, this is pretty illuminating in the sense that it shows us that the most frequency terms are focused around certain ‘hotspots’ in the area, which in itself is just kind of cool to see.</p>
<p>Now let’s try out word frequency changes over time: what words were used more or less over the time of data collection? (Help from <a href="http://tidytextmining.com/twitter.html">here</a>) (Also used the <a href="https://cran.r-project.org/web/packages/lubridate/lubridate.pdf">lubridate package</a> to help with time.)</p>
<pre class="r"><code># Might have to do this first
tidy_tweets$created2 &lt;- as.POSIXct(tidy_tweets$created, format=&quot;%m/%d/%Y %H:%M&quot;)
words_by_time &lt;- tidy_tweets %&gt;%
mutate(time_floor = floor_date(created2, unit = &quot;1 week&quot;)) %&gt;%
count(time_floor, word) %&gt;%
ungroup() %&gt;%
group_by(time_floor) %&gt;%
mutate(time_total = sum(n)) %&gt;%
group_by(word) %&gt;%
mutate(word_total = sum(n)) %&gt;%
ungroup() %&gt;%
rename(count = n) %&gt;%
filter(word_total &gt; 100)
words_by_time</code></pre>
<pre><code>## # A tibble: 1,979 × 5
## time_floor word count time_total word_total
## &lt;dttm&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt; &lt;int&gt;
## 1 0016-07-31 #art 7 2729 120
## 2 0016-07-31 #california 9 2729 149
## 3 0016-07-31 #dolorespark 6 2729 168
## 4 0016-07-31 #mission 5 2729 222
## 5 0016-07-31 #missiondistrict 1 2729 158
## 6 0016-07-31 #sanfrancisco 38 2729 1034
## 7 0016-07-31 #sf 23 2729 603
## 8 0016-07-31 #streetart 12 2729 229
## 9 0016-07-31 24th 10 2729 109
## 10 0016-07-31 alamo 3 2729 111
## # ... with 1,969 more rows</code></pre>
<p>Alright, now we want to figure out those words that have changed the most in their frequency over time so as to isolate ones of interests to plot over time. This involves a few steps though.</p>
<pre class="r"><code>nested_data &lt;- words_by_time %&gt;%
nest(-word)
nested_data</code></pre>
<pre><code>## # A tibble: 71 × 2
## word data
## &lt;chr&gt; &lt;list&gt;
## 1 #art &lt;tibble [26 × 4]&gt;
## 2 #california &lt;tibble [28 × 4]&gt;
## 3 #dolorespark &lt;tibble [27 × 4]&gt;
## 4 #mission &lt;tibble [28 × 4]&gt;
## 5 #missiondistrict &lt;tibble [29 × 4]&gt;
## 6 #sanfrancisco &lt;tibble [29 × 4]&gt;
## 7 #sf &lt;tibble [29 × 4]&gt;
## 8 #streetart &lt;tibble [28 × 4]&gt;
## 9 24th &lt;tibble [28 × 4]&gt;
## 10 alamo &lt;tibble [27 × 4]&gt;
## # ... with 61 more rows</code></pre>
<p>Process as described by Silge and Robinson: “This data frame has one row for each person-word combination; the data column is a list column that contains data frames, one for each combination of person and word. Let’s use map() from the purrr library to apply our modeling procedure to each of those little data frames inside our big data frame. This is count data so let’s use glm() with family =”binomial&quot; for modeling. We can think about this modeling procedure answering a question like, “Was a given word mentioned in a given time bin? Yes or no? How does the count of word mentions depend on time?”&quot;</p>
<pre class="r"><code>nested_models &lt;- nested_data %&gt;%
mutate(models = map(data, ~ glm(cbind(count, time_total) ~ time_floor, .,
family = &quot;binomial&quot;)))
nested_models</code></pre>
<pre><code>## # A tibble: 71 × 3
## word data models
## &lt;chr&gt; &lt;list&gt; &lt;list&gt;
## 1 #art &lt;tibble [26 × 4]&gt; &lt;S3: glm&gt;
## 2 #california &lt;tibble [28 × 4]&gt; &lt;S3: glm&gt;
## 3 #dolorespark &lt;tibble [27 × 4]&gt; &lt;S3: glm&gt;
## 4 #mission &lt;tibble [28 × 4]&gt; &lt;S3: glm&gt;
## 5 #missiondistrict &lt;tibble [29 × 4]&gt; &lt;S3: glm&gt;
## 6 #sanfrancisco &lt;tibble [29 × 4]&gt; &lt;S3: glm&gt;
## 7 #sf &lt;tibble [29 × 4]&gt; &lt;S3: glm&gt;
## 8 #streetart &lt;tibble [28 × 4]&gt; &lt;S3: glm&gt;
## 9 24th &lt;tibble [28 × 4]&gt; &lt;S3: glm&gt;
## 10 alamo &lt;tibble [27 × 4]&gt; &lt;S3: glm&gt;
## # ... with 61 more rows</code></pre>
<p>Silge and Robinson: “Now notice that we have a new column for the modeling results; it is another list column and contains glm objects. The next step is to use map() and tidy() from the broom package to pull out the slopes for each of these models and find the important ones. We are comparing many slopes here and some of them are not statistically significant, so let’s apply an adjustment to the p-values for multiple comparisons.”</p>
<pre class="r"><code>slopes &lt;- nested_models %&gt;%
unnest(map(models, tidy)) %&gt;%
filter(term == &quot;time_floor&quot;) %&gt;%
mutate(adjusted.p.value = p.adjust(p.value))</code></pre>
<p>“Now let’s find the most important slopes. Which words have changed in frequency at a moderately significant level in our tweets?”</p>
<pre class="r"><code>top_slopes &lt;- slopes %&gt;%
filter(adjusted.p.value &lt; 0.1)
top_slopes</code></pre>
<pre><code>## # A tibble: 16 × 7
## word term estimate std.error statistic
## &lt;chr&gt; &lt;chr&gt; &lt;dbl&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 #dolorespark time_floor -1.239208e-07 1.877375e-08 -6.600749
## 2 #sanfrancisco time_floor -3.211771e-08 6.679608e-09 -4.808322
## 3 #sf time_floor -3.483812e-08 8.745735e-09 -3.983441
## 4 alley time_floor -7.976653e-08 1.277341e-08 -6.244732
## 5 clarion time_floor -9.680816e-08 1.449039e-08 -6.680855
## 6 district time_floor 6.701575e-08 5.346673e-09 12.534102
## 7 dolores time_floor -1.132466e-07 7.718189e-09 -14.672698
## 8 francisco time_floor 4.515439e-08 4.642235e-09 9.726865
## 9 manufactory time_floor -7.442008e-08 1.672651e-08 -4.449228
## 10 mission time_floor 3.625027e-08 3.999962e-09 9.062656
## 11 park time_floor -1.134007e-07 7.699185e-09 -14.728916
## 12 san time_floor 4.009249e-08 4.369834e-09 9.174832
## 13 sf time_floor -9.883069e-08 7.359724e-09 -13.428586
## 14 street time_floor -5.396273e-08 1.121572e-08 -4.811349
## 15 tartine time_floor -6.296002e-08 1.209275e-08 -5.206427
## 16 valencia time_floor -8.125194e-08 2.093658e-08 -3.880859
## # ... with 2 more variables: p.value &lt;dbl&gt;, adjusted.p.value &lt;dbl&gt;</code></pre>
<p>Let’s plot them!</p>
<pre class="r"><code>words_by_time %&gt;%
inner_join(top_slopes, by = c(&quot;word&quot;)) %&gt;%
ggplot(aes(time_floor, count/time_total, color = word)) +
geom_line(size = 1.3) +
labs(x = NULL, y = &quot;Word frequency&quot;)</code></pre>
<p><img src="/portfolio/2017-03-09-fun-w-tidy-text_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>After looking at some of these features of our data set, it’s time to explore TOPIC MODELING, or (paraphrasing from David Blei) finding structure in more-or-less unstructured documents. To do this we need a document-term matrix. At the moment, the tweets are a little problematic in that they are broken up by words… whereas we actually would like the text of the tweet back as that is what we are treating as our ‘document’. The question at the moment is… do we want to keep the hashtags / can we?</p>
<pre class="r"><code># Let&#39;s try by taking our tweets that have been tidied already. First we need to count each word though, and create some kind of column that has
# This first one is helpful for seeing encodings that need to be removed
tidy_tweets %&gt;%
count(document, word, sort=TRUE)</code></pre>
<pre><code>## Source: local data frame [102,888 x 3]
## Groups: document [14,958]
##
## document word n
## &lt;int&gt; &lt;chr&gt; &lt;int&gt;
## 1 5932 facewithtearsofjoy 17
## 2 8849 balloon 8
## 3 12204 blackquestionmarkornament 8
## 4 7697 nov 7
## 5 7697 wed 7
## 6 12204 whitequestionmarkornament 7
## 7 2110 sliceofpizza 6
## 8 2452 nailpolish 6
## 9 2741 facewithnogoodgesture 6
## 10 4014 earofrice 6
## # ... with 102,878 more rows</code></pre>
<pre class="r"><code># Counting words so we can make a dtm with our preserved corpus with hashtags and such
tweet_words &lt;- tidy_tweets %&gt;%
count(document, word) %&gt;%
ungroup()
total_words &lt;- tweet_words %&gt;%
group_by(document) %&gt;%
summarize(total = sum(n))
post_words &lt;- left_join(tweet_words, total_words)
post_words</code></pre>
<pre><code>## # A tibble: 102,888 × 4
## document word n total
## &lt;int&gt; &lt;chr&gt; &lt;int&gt; &lt;int&gt;
## 1 1 #nofilter 1 7
## 2 1 #sanfrancisco 1 7
## 3 1 afternoon 1 7
## 4 1 dolores 1 7
## 5 1 park 2 7
## 6 1 sf 1 7
## 7 2 @publicworkssf 1 5
## 8 2 #dustyrhino 1 5
## 9 2 close 1 5
## 10 2 grin 1 5
## # ... with 102,878 more rows</code></pre>
<pre class="r"><code>new_dtm &lt;- post_words %&gt;%
cast_dtm(document, word, n)</code></pre>
<p>This seems to have worked :O :O :O Let’s see how topic modeling works here now…</p>
<p>Visualization in TIDY form also from <a href="http://tidytextmining.com/topicmodeling.html">Silge and Robinson</a>!</p>
<pre class="r"><code>#Set parameters for Gibbs sampling (parameters those used in
#Grun and Hornik 2011)
burnin &lt;- 4000
iter &lt;- 2000
thin &lt;- 500
seed &lt;-list(2003,5,63,100001,765)
nstart &lt;- 5
best &lt;- TRUE
k &lt;- 12
test_lda2 &lt;-LDA(new_dtm,k, method=&quot;Gibbs&quot;,
control=list(nstart=nstart, seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
# Make that TIDY!!!
test_lda_td2 &lt;- tidy(test_lda2)
lda_top_terms2 &lt;- test_lda_td2 %&gt;%
group_by(topic) %&gt;%
top_n(10, beta) %&gt;%
ungroup() %&gt;%
arrange(topic, -beta)
lda_top_terms2 %&gt;%
mutate(term = reorder(term, beta)) %&gt;%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_bar(stat = &quot;identity&quot;, show.legend = FALSE) +
facet_wrap(~ topic, scales = &quot;free&quot;) +
coord_flip()</code></pre>
<p><img src="/portfolio/2017-03-09-fun-w-tidy-text_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
</description>
</item>
<item>
<title>Building A Website</title>
<link>/portfolio/2017-03-12-build-website/</link>
<pubDate>Sun, 12 Mar 2017 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-12-build-website/</guid>
<description><!-- BLOGDOWN-HEAD -->
<!-- /BLOGDOWN-HEAD -->
<!-- BLOGDOWN-BODY-BEFORE -->
<!-- /BLOGDOWN-BODY-BEFORE -->
<p>This website was constructed using GitHub, the <a href="https://github.com/rstudio/blogdown">blogdown package</a> for R and Hugo. I got started with <a href="https://proquestionasker.github.io/blog/Making_Site/">Amber Thomas’s amazing tutorial</a> and got some further help from <a href="http://robertmyles.github.io/2017/02/01/how-to-make-a-github-pages-blog-with-rstudio-and-hugo/">Robert Myles McDonnell</a> and more help from <a href="http://whipperstacker.com/2015/11/27/deploying-a-stand-alone-hugo-site-to-github-pages-mapped-to-a-custom-domain/">here</a>. If you are interested in building a site with this configuration, I would start either with Amber’s or Robert’s tutorials as they are very detailed and will give an understanding of how this whole thing works. If, however, the steps detailed in either of these things don’t work for you, try out the process that eventually worked for me.</p>
<div id="kates-approach" class="section level4">
<h4>Kate’s Approach</h4>
<p>Let’s start from the very beginning. I was running into a lot of issues (see comments on Robert’s blog and on Amber’s tutorial) but this specific series of steps managed to work for me.</p>
<p>First, start fresh. Create a USERNAME repo and a special HUGO repo on the ONLINE GitHub.</p>
<p>Go back to your computer. (I’m working with a Mac).</p>
<p>Create a new project (I think you can probably do this within a folder, but I went the project route) in R studio. Create a new directory in your own computer and name it whatever you want. Then in R use blogdown to create a site and if you want install a theme etc. Then, <strong>serve_site()</strong> and <strong>STOP</strong> R with the little red stop sign located in the upper right corner of the Console window. (I have to do this anyway because whenever I <strong>serve_site()</strong> or <strong>new_site()</strong> in R Studio the “&gt;” doesn’t show up in my console afterwards). Update: I just figured out that this is because the site is being continuously “constructed” once you serve it – i.e. you can make some changes and it’ll automatically update in your little Viewer window. Probably doesn’t matter too much, but I’d just stop it just in case.</p>
<p>Now, in that directory you created with the project will be ALL the website files. Don’t move anything around. I then followed the steps from <a href="http://whipperstacker.com/2015/11/27/deploying-a-stand-alone-hugo-site-to-github-pages-mapped-to-a-custom-domain/">this link</a> mentioned above and did the following in terminal:</p>
<p>Change directories to the directory where the website files are (let’s call this ‘Local Hugo’). Now remove the public folder (I am <em>pretty</em> sure that is what this command does)</p>
<pre><code>rm -r public/</code></pre>
<p>And NOW you are going to link up to your online GitHub.</p>
<pre><code>$ git init
$ git remote add origin git@github.com:USERNAME/HUGO.git
$ git submodule add git@github.com:USERNAME/USERNAME.github.io.git public
</code></pre>
<p>(I get asked for a password here I set up for my .ssh/id_rsa here)</p>
<p>Then:</p>
<pre><code>$ git commit -m &#39;initial commit&#39;
$ git push origin master</code></pre>
<p>(Get asked again for password)</p>
<p>Now I go back to R studio where my project is up and running. I did the customary new_post(‘Hello world’) post to test things out and then made some changes to the config file (put my name etc). Then I <strong>serve_site()</strong> in R, <strong>STOP</strong> it again and return to terminal. I then followed the commands listed on Robert’s tutorial.</p>
<p>Change directories in terminal to your computer Local Hugo folder and then change directories again so you are in the ‘public’ folder in that directory. Then:</p>
<pre><code>$ git add -A
$ git commit -m &#39;lovely new site&#39;
$ git push origin master</code></pre>
<p>(Again I get asked for my password)</p>
<p>Then go to your username.github.io site. It <em>should</em> be there! Now when you make changes it’s the same process: do your thing in R studio, <strong>serve_site()</strong> and then <strong>STOP</strong> and then go to terminal and run the same commands (navigate to “public” local Hugo folder, add -A, commit with a message and then push).</p>
</div>
</description>
</item>
<item>
<title>Adventures in Text Mining</title>
<link>/portfolio/2017-03-15-text-mining/</link>
<pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-15-text-mining/</guid>
<description><p>There are many wonderful tutorials on how to <a href="https://blogs.sap.com/2014/03/16/setting-up-twitter-api-to-work-with-r/">work with Twitter REST APIs</a> (even a video walk-through <a href="https://www.youtube.com/watch?v=lT4Kosc_ers">here</a>) so I won’t describe that process. Instead, I will show some examples of using the <a href="https://cran.r-project.org/web/packages/twitteR/twitteR.pdf">twitteR</a> and related packages to look at geotagged posts occurring within a specific neighborhood (i.e. how to use the <strong>searchTwitter()</strong> function in twitteR to search by <em>location</em>, not specific search term). I will also be using methods described in Julia Silge and David Robinson’s <a href="http://tidytextmining.com/">new book</a>.</p>
<pre class="r"><code>packs = c(&quot;twitteR&quot;,&quot;RCurl&quot;,&quot;RJSONIO&quot;,&quot;stringr&quot;,&quot;ggplot2&quot;,&quot;devtools&quot;,&quot;DataCombine&quot;,&quot;ggmap&quot;,
&quot;topicmodels&quot;,&quot;slam&quot;,&quot;Rmpfr&quot;,&quot;tm&quot;,&quot;stringr&quot;,&quot;wordcloud&quot;,&quot;plyr&quot;,
&quot;tidytext&quot;,&quot;dplyr&quot;,&quot;tidyr&quot;,&quot;xlsx&quot;)
lapply(packs, library, character.only=T)</code></pre>
<pre class="r"><code># key = &quot;YOUR KEY HERE&quot;
# secret = &quot;YOUR SECRET HERE&quot;
# tok = &quot;YOUR TOK HERE&quot;
# tok_sec = &quot;YOUR TOK_SEC HERE&quot;
twitter_oauth &lt;- setup_twitter_oauth(key, secret, tok, tok_sec)</code></pre>
<p>I’m interested in the Mission District neighborhood in San Francisco, California. I obtain a set of coordinates using Google maps and plug that into the ‘geocode’ parameter and then set a radius of 1 kilometer. I know from experience that I only get around 1,000 - 2,000 posts per time I do this, so I set the number of tweets (n) I would like to get from Twitter at ‘7,000’. If you are looking at a more ‘active’ area, or a larger area (more about this later) you can always adjust this number. The API will give you a combination of the most <a href="https://dev.twitter.com/rest/public/search">“recent or popular” tweets</a> that usually extend back about 5 days or so. If you are looking at a smaller area, this means to get any kind of decent tweet corpus you’ll have to spent some time collecting data week after week. Also if you want to look at a larger area than a 3-4 kilometer radius, a lot of times you’ll get a bunch of spam-like posts that don’t have latitude and longitude coordinates associated with them. A work around I thought of (for another project looking at posts in an entire city) is to figure out a series of spots to collect tweets (trying to avoid overlap as much as possible) and stiching those data frames all together and getting rid of any duplicate posts you picked up if your radii overlapped.</p>
<p>Luckily for the Mission District, however, we are interested in a smaller area and don’t have to worry about multiple sampling points and rbind’ing data frames together, and just run the searchTwitter function once:</p>
<pre class="r"><code>geo &lt;- searchTwitter(&#39;&#39;,n=7000, geocode=&#39;37.76,-122.42,1km&#39;,
retryOnRateLimit=1)</code></pre>
<div id="processing-the-data" class="section level4">
<h4>Processing the data</h4>
<p>Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:</p>
<pre class="r"><code>geoDF&lt;-twListToDF(geo)</code></pre>
<p>Chances are there will be emojis in your Twitter data. You can ‘transform’ these emojis into prose using this code as well as a <a href="https://github.com/lyons7/emojidictionary">CSV file</a> I’ve put together of what all of the emojis look like in R. (The idea for this comes from <a href="http://opiateforthemass.es/articles/emoticons-in-R/">Jessica Peterka-Bonetta’s work</a> – she has a list of emojis as well, but it does not include the newest batch of emojis, Unicode Version 9.0, nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.</p>
<p>Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.</p>
<pre class="r"><code>emoticons &lt;- read.csv(&quot;Decoded Emojis Col Sep.csv&quot;, header = T)</code></pre>
<p>To transform the emojis, you first need to transform the tweet data into ASCII:</p>
<pre class="r"><code>geoDF$text &lt;- iconv(geoDF$text, from = &quot;latin1&quot;, to = &quot;ascii&quot;,
sub = &quot;byte&quot;)</code></pre>
<p>To ‘count’ the emojis you do a find and replace using the CSV file of ‘Decoded Emojis’ as a reference. Here I am using the <a href="http://www.inside-r.org/packages/cran/DataCombine/docs/FindReplace">DataCombine package</a>. What this does is identifies emojis in the tweets and then replaces them with a prose version. I used whatever description pops up when hovering one’s cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.</p>
<pre class="r"><code>data &lt;- FindReplace(data = geoDF, Var = &quot;text&quot;,
replaceData = emoticons,
from = &quot;R_Encoding&quot;, to = &quot;Name&quot;,
exact = FALSE)</code></pre>
<p>Now might be a good time to save this file, perhaps in CSV format with the date of when the data was collected:</p>
<pre class="r"><code># write.csv(data,file=paste(&quot;ALL&quot;,Sys.Date(),&quot;.csv&quot;))</code></pre>
</div>
<div id="visualizing-the-data" class="section level4">
<h4>Visualizing the data</h4>
<p>Now let’s play around with visualizing the data. I want to superimpose different aspects of the tweets I collected on a map. First I have to get a map, which I do using the <a href="https://cran.r-project.org/web/packages/ggmap/ggmap.pdf">ggmap package</a> which interacts with Google Map’s API. When you use this package, be sure to cite it, as it requests you to when you first load the package into your library. (Well, really you should cite every R package you use, right?)</p>
<p>I request a map of the Mission District, and then check to make sure the map is what I want (in terms of zoom, area covered, etc.)</p>
<pre class="r"><code>map &lt;- get_map(location = &#39;Capp St. and 20th, San Francisco,
California&#39;, zoom = 15)
# To check out the map
ggmap(map)</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-10-1.png" width="672" /> Looks good to me! Now let’s start to visualize our Twitter data. We can start by seeing where our posts are on a map.</p>
<pre class="r"><code># Tell R what we want to map
# Need to specify that lat/lon should be treated like numbers
data$longitude&lt;-as.numeric(data$longitude)
data$latitude&lt;-as.numeric(data$latitude)</code></pre>
<p>For now I just want to look at latitude and longitude, but it is possible to map other aspects as well - it just depends on what you’d like to look at.</p>
<pre class="r"><code>Mission_tweets &lt;- ggmap(map) + geom_point(aes(x=longitude, y=latitude),
data=data, alpha=0.5)
Mission_tweets</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>We can also look at WHEN the posts were generated. We can make a graph of post frequency over time.Graphs constructed with help from <a href="http://www.cyclismo.org/tutorial/R/time.html">here</a>, <a href="https://gist.github.com/stephenturner/3132596">here</a>, <a href="http://stackoverflow.com/questions/27626915/r-graph-frequency-of-observations-over-time-with-small-value-range">here</a>, <a href="http://michaelbommarito.com/2011/03/12/a-quick-look-at-march11-saudi-tweets/">here</a>, <a href="http://stackoverflow.com/questions/31796744/plot-count-frequency-of-tweets-for-word-by-month">here</a>, <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/as.POSIXlt.html">here</a>, <a href="http://sape.inf.usi.ch/quick-reference/ggplot2/geom">here</a> and <a href="http://stackoverflow.com/questions/3541713/how-to-plot-two-histograms-together-in-r">here</a>.</p>
<pre class="r"><code># Create a data frame with number of tweets per time
d &lt;- as.data.frame(table(data$created))
d &lt;- d[order(d$Freq, decreasing=T), ]
names(d) &lt;- c(&quot;created&quot;,&quot;freq&quot;)
# Combine this with existing data frame
newdata1 &lt;- merge(data,d,by=&quot;created&quot;)
# Tell R that &#39;created&#39; is not an integer or factor but a time.
data$created &lt;- as.POSIXct(data$created)</code></pre>
<p>Now plot number of tweets over period of time across 20 minute intervals:</p>
<pre class="r"><code>minutes &lt;- 60
Freq&lt;-data$freq
plot1&lt;-ggplot(data, aes(created)) + geom_freqpoly(binwidth=60*minutes)
plot1</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-14-1.png" width="672" /> This might be more informative if you want to look at specific time periods. We can look at the frequency of posts over the course of a specific day if we want.</p>
<pre class="r"><code>data2 &lt;- data[data$created &lt;= &quot;2017-03-11 00:31:00&quot;, ]
minutes &lt;- 60
Freq&lt;-data2$freq
plot2&lt;-ggplot(data2, aes(created)) + geom_freqpoly(binwidth=60*minutes)
plot2</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-15-1.png" width="672" /></p>
<p>Let’s look at other ways to visualize Twitter data. I will be using a larger corpus of posts I’ve been building up for about a year (as mentioned above, I only get about 1,000 posts per searchTwitter per week so it took some time to get a good corpus going).</p>
<p>Some more processessing needs to be completed before looking at things like most frequent terms or what kind of sentiments seem to be expressed in our corpus. All of the following steps to ‘clean’ the data of URLs, odd characters and ‘stop words’ (a.k.a. words like ‘the’ or ‘and’ that aren’t very informative re. what the post is actually discussing) are taken from <a href="http://tidytextmining.com/sentiment.html#the-sentiments-dataset">Silge and Robinson</a>.</p>
<pre class="r"><code>tweets=read.csv(&quot;Col_Sep_INSTACORPUS.csv&quot;, header=T)
# Get rid of stuff particular to the data (here encodings of links
# and such)
# Most of these are characters I don&#39;t have encodings for (other scripts, etc.)
tweets$text = gsub(&quot;Just posted a photo&quot;,&quot;&quot;, tweets$text)
tweets$text = gsub( &quot;&lt;.*?&gt;&quot;, &quot;&quot;, tweets$text)
# Get rid of super frequent spam-y posters
tweets &lt;- tweets[! tweets$screenName %in% c(&quot;4AMSOUNDS&quot;,
&quot;BruciusTattoo&quot;,&quot;LionsHeartSF&quot;,&quot;hermesalchemist&quot;,&quot;Mrsourmash&quot;),]
# Now for Silge and Robinson&#39;s code. What this is doing is getting rid of
# URLs, re-tweets (RT) and ampersands. This also gets rid of stop words
# without having to get rid of hashtags and @ signs by using
# str_detect and filter!
reg &lt;- &quot;([^A-Za-z_\\d#@&#39;]|&#39;(?![A-Za-z_\\d#@]))&quot;
tidy_tweets &lt;- tweets %&gt;%
filter(!str_detect(text, &quot;^RT&quot;)) %&gt;%
mutate(text = str_replace_all(text, &quot;https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;amp;|&amp;lt;|&amp;gt;|RT|https&quot;, &quot;&quot;)) %&gt;%
unnest_tokens(word, text, token = &quot;regex&quot;, pattern = reg) %&gt;%
filter(!word %in% stop_words$word,
str_detect(word, &quot;[a-z]&quot;))
# Get rid of stop words by doing an &#39;anti-join&#39; (amazing run-down of what
# all the joins do is available here:
# http://stat545.com/bit001_dplyr-cheatsheet.html#anti_joinsuperheroes-publishers)
data(stop_words)
tidy_tweets &lt;- tidy_tweets %&gt;%
anti_join(stop_words)</code></pre>
<p>Now we can look at things like most frequent words and sentiments expressed in our corpus.</p>
<pre class="r"><code># Find most common words in corpus
tidy_tweets %&gt;%
count(word, sort = TRUE) </code></pre>
<pre><code>## # A tibble: 27,441 × 2
## word n
## &lt;chr&gt; &lt;int&gt;
## 1 mission 3327
## 2 san 2751
## 3 francisco 2447
## 4 district 1871
## 5 #sanfrancisco 1175
## 6 park 1060
## 7 dolores 1045
## 8 sf 1031
## 9 #sf 696
## 10 day 597
## # ... with 27,431 more rows</code></pre>
<p>Plot most common words:</p>
<pre class="r"><code>tidy_tweets %&gt;%
count(word, sort = TRUE) %&gt;%
filter(n &gt; 150) %&gt;%
mutate(word = reorder(word, n)) %&gt;%
ggplot(aes(word, n)) +
geom_bar(stat = &quot;identity&quot;) +
xlab(NULL) +
coord_flip()</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-18-1.png" width="672" /></p>
<p>What about different sentiments?</p>
<pre class="r"><code># What are the most common &#39;joy&#39; words?
nrcjoy &lt;- sentiments %&gt;%
filter(lexicon == &quot;nrc&quot;) %&gt;%
filter(sentiment == &quot;joy&quot;)
tidy_tweets %&gt;%
semi_join(nrcjoy) %&gt;%
count(word, sort = TRUE)</code></pre>
<pre><code>## # A tibble: 355 × 2
## word n
## &lt;chr&gt; &lt;int&gt;
## 1 love 466
## 2 happy 402
## 3 art 266
## 4 birthday 205
## 5 favorite 194
## 6 beautiful 179
## 7 fun 173
## 8 food 139
## 9 music 126
## 10 chocolate 105
## # ... with 345 more rows</code></pre>
<pre class="r"><code># What are the most common &#39;disgust&#39; words?
nrcdisgust &lt;- sentiments %&gt;%
filter(lexicon == &quot;nrc&quot;) %&gt;%
filter(sentiment == &quot;disgust&quot;)
tidy_tweets %&gt;%
semi_join(nrcdisgust) %&gt;%
count(word, sort = TRUE)</code></pre>
<pre><code>## # A tibble: 217 × 2
## word n
## &lt;chr&gt; &lt;int&gt;
## 1 finally 100
## 2 feeling 55
## 3 gray 50
## 4 tree 50
## 5 hanging 44
## 6 bad 43
## 7 boy 40
## 8 shit 36
## 9 lemon 29
## 10 treat 28
## # ... with 207 more rows</code></pre>
<pre class="r"><code># We can also look at counts of negative and positive words
bingsenti &lt;- sentiments %&gt;%
filter(lexicon ==&quot;bing&quot;)
bing_word_counts &lt;- tidy_tweets %&gt;%
inner_join(bingsenti) %&gt;%
count(word, sentiment, sort = TRUE) %&gt;%
ungroup()
# And graph them!
bing_word_counts %&gt;%
filter(n &gt; 25) %&gt;%
mutate(n = ifelse(sentiment == &quot;negative&quot;, -n, n)) %&gt;%
mutate(word = reorder(word, n)) %&gt;%
ggplot(aes(word, n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = &quot;identity&quot;) +
labs(y = &quot;Contribution to sentiment&quot;,
x = NULL) +
coord_flip()</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-19-1.png" width="672" /></p>
<p>We can also make pretty word clouds! Code for this taken from <a href="https://rstudio-pubs-static.s3.amazonaws.com/132792_864e3813b0ec47cb95c7e1e2e2ad83e7.html">here</a>.</p>
<pre class="r"><code># First have to make a document term matrix, which involves a few steps
tidy_tweets %&gt;%
count(document, word, sort=TRUE)
tweet_words &lt;- tidy_tweets %&gt;%
count(document, word) %&gt;%
ungroup()
total_words &lt;- tweet_words %&gt;%
group_by(document) %&gt;%
summarize(total = sum(n))
post_words &lt;- left_join(tweet_words, total_words)
dtm &lt;- post_words %&gt;%
cast_dtm(document, word, n)
# Need freq count for word cloud to work
freq = data.frame(sort(colSums(as.matrix(dtm)), decreasing=TRUE))</code></pre>
<pre class="r"><code>wordcloud(rownames(freq), freq[,1], max.words=70,
colors=brewer.pal(1, &quot;Dark2&quot;))</code></pre>
<p><img src="/portfolio/2017-03-15-text-mining_files/figure-html/unnamed-chunk-21-1.png" width="672" /></p>
<p>If you are interested in delving even deeper, you can try techniques like topic modeling, a process I describe and demonstrate <a href="https://lyons7.github.io/portfolio/2017-03-09-fun-w-tidy-text/">here</a>!</p>
</div>
</description>
</item>
<item>
<title>Identifying and Visualizing Emojis</title>
<link>/portfolio/2017-03-10-emoji-maps/</link>
<pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-03-10-emoji-maps/</guid>
<description><p>Dealing with emojis in mined social media data can be tricky for a number of reasons. First, you have to decode them and then… well I guess that is it. After you decode them there is a number of cool things you can look at though!</p>
<div id="processing-the-data" class="section level2">
<h2>Processing the data</h2>
<p>As mentioned, if you are working with social media data, chances are there will be emojis in that data. You can ‘transform’ these emojis into prose using this code as well as a <a href="https://github.com/lyons7/emojidictionary">CSV file</a> I’ve put together of what all of the emojis look like in R. (The idea for this comes from <a href="http://opiateforthemass.es/articles/emoticons-in-R/">Jessica Peterka-Bonetta’s work</a> – she has a list of emojis as well, but it does not include the newest batch of emojis, Unicode Version 9.0, nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.</p>
</div>
<div id="processing-the-data-1" class="section level2">
<p>Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.</p>
<pre class="r"><code>tweets=read.csv(&quot;Col_Sep_INSTACORPUS.csv&quot;, header=T)
emoticons &lt;- read.csv(&quot;Decoded Emojis Col Sep.csv&quot;, header = T)</code></pre>
<p>To transform the emojis, you first need to transform your tweet data into ASCII:</p>
<pre class="r"><code>tweets$text &lt;- iconv(tweets$text, from = &quot;latin1&quot;, to = &quot;ascii&quot;,
sub = &quot;byte&quot;)</code></pre>
</div>
<div id="processing-the-data-2" class="section level2">
<p>To ‘count’ the emojis you do a find and replace using the CSV file of ‘Decoded Emojis’ as a reference. Here I am using the <a href="http://www.inside-r.org/packages/cran/DataCombine/docs/FindReplace">DataCombine package</a>. What this does is identifies emojis in the tweeted Instagram posts and then replaces them with a prose version. I used whatever description pops up when hovering one’s cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.</p>
<pre class="r"><code>library(DataCombine)
tweets &lt;- FindReplace(data = tweets, Var = &quot;text&quot;,
replaceData = emoticons,
from = &quot;R_Encoding&quot;, to = &quot;Name&quot;,
exact = FALSE)</code></pre>
<p>Now I’m going to subset the data to just look at those posts that have emojis in them. I got help in doing this from <a href="http://stackoverflow.com/questions/26319567/use-grepl-to-search-either-of-multiple-substrings-in-a-text-in-r">here</a>. Again I use my emoji dictionary available <a href="https://github.com/lyons7/emojidictionary">here</a>.</p>
<pre class="r"><code>emoticons &lt;- read.csv(&quot;Decoded Emojis Col Sep.csv&quot;, header = T)
emogrepl &lt;- grepl(paste(emoticons$Name, collapse = &quot;|&quot;), tweets$text)
emogreplDF&lt;-as.data.frame(emogrepl)
tweets$ID7 &lt;- 1:nrow(tweets)
emogreplDF$ID7 &lt;- 1:nrow(emogreplDF)
tweets &lt;- merge(tweets,emogreplDF,by=&quot;ID7&quot;)
emosub &lt;- tweets[tweets$emogrepl == &quot;TRUE&quot;, ]</code></pre>
<p>Now that you have a subset of emojis you can compare posts with emojis vs. posts without etc. etc.!</p>
<p>How about subsetting BY emoji? Let’s look just at posts that have certain emojis in them, like the red heart emoji or the face with tears of joy.</p>
<p>First we do pattern matching and replacement. The first command looks through the text of the emosub data frame and finds all instances in which the string ‘HEAVYBLACKHEART’ is present and then generates a list of T/F values</p>
<pre class="r"><code>heartgrepl &lt;- grepl(paste(&quot; HEAVYBLACKHEART &quot;), emosub$text)
# Turn that list of T/F values into a data frame so we can link it back to the original posts
heartgreplDF&lt;-as.data.frame(heartgrepl)
# Make a new row so as to smush them together (the T/F designation and your data frame of posts)
emosub$ID7 &lt;- 1:nrow(emosub)
heartgreplDF$ID7 &lt;- 1:nrow(heartgreplDF)
emosub &lt;- merge(emosub,heartgreplDF,by=&quot;ID7&quot;)
redheart &lt;- emosub[emosub$heartgrepl == &quot;TRUE&quot;, ]</code></pre>
<p>Let’s do the same with FACEWITHTEARSOFJOY</p>
<pre class="r"><code>lolfacegrepl &lt;- grepl(paste(&quot; FACEWITHTEARSOFJOY &quot;), emosub$text)
lolfacegreplDF&lt;-as.data.frame(lolfacegrepl)
emosub$ID7 &lt;- 1:nrow(emosub)
lolfacegreplDF$ID7 &lt;- 1:nrow(lolfacegreplDF)
emosub &lt;- merge(emosub,lolfacegreplDF,by=&quot;ID7&quot;)
lolface &lt;- emosub[emosub$lolfacegrepl == &quot;TRUE&quot;, ]</code></pre>
<p>Now FACEWITHHEARTSHAPEDEYES</p>
<pre class="r"><code>hearteyesgrepl &lt;- grepl(paste(&quot; SMILINGFACEWITHHEARTSHAPEDEYES &quot;), emosub$text)
hearteyesgreplDF&lt;-as.data.frame(hearteyesgrepl)
emosub$ID7 &lt;- 1:nrow(emosub)
hearteyesgreplDF$ID7 &lt;- 1:nrow(hearteyesgreplDF)
emosub &lt;- merge(emosub,hearteyesgreplDF,by=&quot;ID7&quot;)
hearteyes &lt;- emosub[emosub$hearteyesgrepl == &quot;TRUE&quot;, ]</code></pre>
<p>Sparkles!!!!</p>
<pre class="r"><code>sparklesgrepl &lt;- grepl(paste(&quot; SPARKLES &quot;), emosub$text)
sparklesgreplDF&lt;-as.data.frame(sparklesgrepl)
emosub$ID7 &lt;- 1:nrow(emosub)
sparklesgreplDF$ID7 &lt;- 1:nrow(sparklesgreplDF)
emosub &lt;- merge(emosub,sparklesgreplDF,by=&quot;ID7&quot;)
sparkles &lt;- emosub[emosub$sparklesgrepl == &quot;TRUE&quot;, ]</code></pre>
<p>Face savouring delicious food!!!!!!!!!!!!!!!</p>
<pre class="r"><code>savourfoodgrepl &lt;- grepl(paste(&quot; FACESAVOURINGDELICIOUSFOOD &quot;), emosub$text)
savourfoodgreplDF&lt;-as.data.frame(savourfoodgrepl)
emosub$ID7 &lt;- 1:nrow(emosub)
savourfoodgreplDF$ID7 &lt;- 1:nrow(savourfoodgreplDF)
emosub &lt;- merge(emosub,savourfoodgreplDF,by=&quot;ID7&quot;)
savourfood &lt;- emosub[emosub$savourfoodgrepl == &quot;TRUE&quot;, ]</code></pre>
<h2>Mapping the data</h2>
<p>Let’s have a little fun and try to map where some of these emojis occur. I am using the <a href="https://github.com/dill/emoGG">emoGG</a> package.</p>
<pre class="r"><code># devtools::install_github(&quot;dill/emoGG&quot;)
library(emoGG)
# Find the emojis we want to use for a graph (might take a few times to get your search query right)
emoji_search(&quot;heart face&quot;)</code></pre>
<pre><code>## emoji code keyword
## 1 grinning 1f600 face
## 5 grin 1f601 face
## 9 joy 1f602 face
## 15 smiley 1f603 face
## 19 smile 1f604 face
## 26 sweat_smile 1f605 face
## 35 laughing 1f606 face
## 37 innocent 1f607 face
## 46 wink 1f609 face
## 50 blush 1f60a face
## 58 relaxed 263a face
## 66 yum 1f60b face
## 69 relieved 1f60c face
## 74 heart_eyes 1f60d face
## 81 sunglasses 1f60e face
## 86 smirk 1f60f face
## 94 expressionless 1f611 face
## 102 sweat 1f613 face
## 107 pensive 1f614 face
## 112 confused 1f615 face
## 117 confounded 1f616 face
## 124 kissing 1f617 face
## 128 kissing_heart 1f618 face
## 134 kissing_smiling_eyes 1f619 face
## 138 kissing_closed_eyes 1f61a face
## 144 stuck_out_tongue 1f61b face
## 150 stuck_out_tongue_winking_eye 1f61c face
## 156 stuck_out_tongue_closed_eyes 1f61d face
## 161 disappointed 1f61e face
## 165 worried 1f61f face
## 169 angry 1f620 face
## 176 cry 1f622 face
## 181 persevere 1f623 face
## 186 triumph 1f624 face
## 191 disappointed_relieved 1f625 face
## 195 frowning 1f626 face
## 198 anguished 1f627 face
## 201 fearful 1f628 face
## 207 weary 1f629 face
## 213 sleepy 1f62a face
## 221 grimacing 1f62c face
## 224 sob 1f62d face
## 230 open_mouth 1f62e face
## 234 hushed 1f62f face
## 237 cold_sweat 1f630 face
## 239 scream 1f631 face
## 243 astonished 1f632 face
## 247 flushed 1f633 face
## 251 sleeping 1f634 face
## 259 no_mouth 1f636 face
## 261 mask 1f637 face
## 513 ear 1f442 face
## 514 ear 1f442 hear
## 515 ear 1f442 sound
## 516 ear 1f442 listen
## 526 kiss 1f48b face
## 1467 heart 2764 love
## 1468 heart 2764 like
## 1469 heart 2764 valentines
## 1502 cupid 1f498 heart
## 1667 art 1f3a8 design
## 1668 art 1f3a8 paint
## 1669 art 1f3a8 draw
## 2733 a 1f170 red-square
## 2734 a 1f170 alphabet
## 2735 a 1f170 letter
## 3358 bowtie face
## 3366 neckbeard face</code></pre>
<pre class="r"><code># We find the code &quot;1f60d&quot; for the smiling face with heart shaped eyes. Let&#39;s try to graph this on a map!
# Using the ggmap package here
map &lt;- get_map(location = &#39;Capp St. and 20th, San Francisco,
California&#39;, zoom = 15)
lat &lt;- hearteyes$latitude
lon &lt;- hearteyes$longitude
# Without the background
# mapPointshearteyes &lt;- ggplot(hearteyes, aes(lon,lat)) + geom_emoji(emoji=&quot;1f60d&quot;)
mapPointshearteyes &lt;- ggmap(map) + geom_emoji(aes(x = lon, y = lat),
data=hearteyes, emoji=&quot;1f60d&quot;)</code></pre>
<pre class="r"><code>mapPointshearteyes</code></pre>
<p><img src="/portfolio/2017-03-10-emoji-maps_files/figure-html/unnamed-chunk-12-1.png" width="672" /></p>
<p>Now let’s try multiple emojis at once (help from <a href="http://blog.revolutionanalytics.com/2015/11/emojis-in-ggplot-graphics.html">here</a>).</p>
<pre class="r"><code># Can we do this with plain old layering?
# emoji_search(&quot;sparkles&quot;)
# sparkles = &quot;2728&quot;
# red heart = &quot;2764&quot;
mapPointsmulti &lt;- ggmap(map) + geom_emoji(aes(x = lon, y = lat),
data=hearteyes, emoji=&quot;1f60d&quot;) +
geom_emoji(aes(x=sparkles$longitude, y=sparkles$latitude),
data=sparkles, emoji=&quot;2728&quot;) +
geom_emoji(aes(x=redheart$longitude, y=redheart$latitude),
data=redheart, emoji=&quot;2764&quot;)
mapPointsmulti</code></pre>
<p><img src="/portfolio/2017-03-10-emoji-maps_files/figure-html/unnamed-chunk-13-1.png" width="672" /></p>
<p>How about emojis that are associated with food?</p>
<pre class="r"><code># apparently called the &#39;yum&#39; emoji: 1f60b
mapPointssavourface &lt;- ggmap(map) + geom_emoji(aes(x=savourfood$longitude,y=savourfood$latitude),
data=savourfood, emoji=&quot;1f60b&quot;)
mapPointssavourface</code></pre>
<p><img src="/portfolio/2017-03-10-emoji-maps_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
</div>
</description>
</item>
<item>
<title>Thesis Time!</title>
<link>/portfolio/2017-05-06-thesis-time/</link>
<pubDate>Thu, 23 Jul 2015 21:13:14 -0500</pubDate>
<guid>/portfolio/2017-05-06-thesis-time/</guid>
<description><div id="introduction" class="section level2">
<h2>Introduction</h2>
<p>The following is all of the code used to run analyses used in my dissertation.</p>
<pre class="r"><code>packs = c(&quot;twitteR&quot;,&quot;RCurl&quot;,&quot;RJSONIO&quot;,&quot;stringr&quot;,&quot;ggplot2&quot;,&quot;devtools&quot;,&quot;DataCombine&quot;,&quot;ggmap&quot;,&quot;topicmodels&quot;,&quot;slam&quot;,&quot;Rmpfr&quot;,&quot;tm&quot;,&quot;stringr&quot;,&quot;wordcloud&quot;,&quot;plyr&quot;,&quot;tidytext&quot;,&quot;dplyr&quot;,&quot;tidyr&quot;,&quot;xlsx&quot;,&quot;ggrepel&quot;,&quot;lubridate&quot;,&quot;purrr&quot;,&quot;broom&quot;, &quot;wordcloud&quot;,&quot;emoGG&quot;,&quot;ldatuning&quot;)
lapply(packs, library, character.only=T)</code></pre>
</div>
<div id="getting-the-data" class="section level2">
<h2>Getting the Data</h2>
<p>To collect data, I used the <a href="https://cran.r-project.org/web/packages/twitteR/twitteR.pdf">twitteR package</a>. I’m interested in the Mission District neighborhood in San Francisco, California. I obtain a set of coordinates using Google maps and plug that into the ‘geocode’ parameter and then set a radius of 1 kilometer. I know from experience that I only get around 1,000 - 2,000 posts per time I do this, so I set the number of tweets (n) I would like to get from Twitter at ‘7,000’.</p>
<pre class="r"><code># key = &quot;YOUR KEY HERE&quot;
# secret = &quot;YOUR SECRET HERE&quot;
# tok = &quot;YOUR TOK HERE&quot;
# tok_sec = &quot;YOUR TOK_SEC HERE&quot;
twitter_oauth &lt;- setup_twitter_oauth(key, secret, tok, tok_sec)
# To collect tweets
geo &lt;- searchTwitter(&#39;&#39;,n=7000, geocode=&#39;37.76,-122.42,1km&#39;,
retryOnRateLimit=1)</code></pre>
<p>Now I want to identify emojis and separate just those posts that came from Instagram. I then save those to a CSV file and compile it by copy-pasting by hand to get a corpus.</p>
<pre class="r"><code># Now you have a list of tweets. Lists are very difficult to deal with in R, so you convert this into a data frame:
geoDF&lt;-twListToDF(geo)</code></pre>
<p>Chances are there will be emojis in your Twitter data. You can ‘transform’ these emojis into prose using this code as well as a <a href="https://github.com/lyons7/emojidictionary">CSV file</a> I’ve put together of what all of the emojis look like in R. (The idea for this comes from <a href="http://opiateforthemass.es/articles/emoticons-in-R/">Jessica Peterka-Bonetta’s work</a> – she has a list of emojis as well, but it does not include the newest batch of emojis, Unicode Version 9.0, nor the different skin color options for human-based emojis). If you use this emoji list for your own research, please make sure to acknowledge both myself and Jessica.</p>
<p>Load in the CSV file. You want to make sure it is located in the correct working directory so R can find it when you tell it to read it in.</p>
<pre class="r"><code>emoticons &lt;- read.csv(&quot;Decoded Emojis Col Sep.csv&quot;, header = T)</code></pre>
<p>To transform the emojis, you first need to transform the tweet data into ASCII:</p>
<pre class="r"><code>geoDF$text &lt;- iconv(geoDF$text, from = &quot;latin1&quot;, to = &quot;ascii&quot;,
sub = &quot;byte&quot;)</code></pre>
<p>To ‘count’ the emojis you do a find and replace using the <a href="https://github.com/lyons7/emojidictionary">CSV file of ‘Decoded Emojis’</a> as a reference. Here I am using the <a href="http://www.inside-r.org/packages/cran/DataCombine/docs/FindReplace">DataCombine package</a>. What this does is identifies emojis in the tweets and then replaces them with a prose version. I used whatever description pops up when hovering one’s cursor over an emoji on an Apple emoji keyboard. If not completely the same as other platforms, it provides enough information to find the emoji in question if you are not sure which one was used in the post.</p>
<pre class="r"><code>data &lt;- FindReplace(data = geoDF, Var = &quot;text&quot;,
replaceData = emoticons,
from = &quot;R_Encoding&quot;, to = &quot;Name&quot;,
exact = FALSE)</code></pre>
<p>Now might be a good time to save this file, perhaps in CSV format with the date of when the data was collected:</p>
<pre class="r"><code>write.csv(data,file=paste(&quot;ALL&quot;,Sys.Date(),&quot;.csv&quot;))</code></pre>
<p>Subset to just those posts that come from Instagram Now you have a data frame which you can manipulate in various ways. For my research, I’m just interested in posts that have occured on Instagram. (Why not just access them via Instagram’s API you ask? Long story short: they are very <em>very</em> conservative about providing access for academic research). I’ve found a work-around which is filtering mined tweets by those that have Instagram as a source:</p>
<pre class="r"><code>data &lt;- data[data$statusSource == &quot;&lt;a href=\&quot;http://instagram.com\&quot; rel=\&quot;nofollow\&quot;&gt;Instagram&lt;/a&gt;&quot;, ]
#Save this file
write.csv(data,file=paste(&quot;INSTA&quot;,Sys.Date(),&quot;.csv&quot;))</code></pre>
<p>Having done this for eight months, we have a nice corpus! Let’s load that in.</p>
</div>
<div id="analyzing-the-data" class="section level2">
<h2>Analyzing the Data</h2>
<div id="preparing-data-for-topic-modeling-and-sentiment-analysis" class="section level4">
<h4>Preparing data for Topic Modeling and Sentiment Analysis</h4>
<p>The data need to be processed a bit more in order to analyze them. Let’s try from the start with <a href="http://tidytextmining.com/">Silge and Robinson</a>.</p>
<pre class="r"><code># Get rid of stuff particular to the data (here encodings of links and such)
# Most of these are characters I don&#39;t have encodings for (other scripts, etc.)
tweets$text = gsub(&quot;Just posted a photo&quot;,&quot;&quot;, tweets$text)
tweets$text = gsub( &quot;&lt;.*?&gt;&quot;, &quot;&quot;, tweets$text)
# Get rid of super frequent spam posters
tweets &lt;- tweets[! tweets$screenName %in% c(&quot;4AMSOUNDS&quot;,&quot;BruciusTattoo&quot;,&quot;LionsHeartSF&quot;,&quot;hermesalchemist&quot;,&quot;Mrsourmash&quot;,&quot;AaronTheEra&quot;,&quot;AmnesiaBar&quot;,&quot;audreymose2&quot;,&quot;audreymosez&quot;,&quot;Bernalcutlery&quot;,&quot;blncdbrkfst&quot;,&quot;BrunosSF&quot;,&quot;chiddythekidd&quot;,&quot;ChurchChills&quot;,&quot;deeXiepoo&quot;,&quot;fabricoutletsf&quot;,&quot;gever&quot;,&quot;miramirasf&quot;,&quot;papalote415&quot;,&quot;HappyHoundsMasg&quot;,&quot;faern_me&quot;),]
# If you want to combine colors, run this at least 3 times over to make sure it &#39;sticks&#39;
# coltweets &lt;- tweets
# coltweets$text &lt;- gsub(&quot; COLONE &quot;, &quot;COLONE&quot;, coltweets$text)
# coltweets$text &lt;- gsub(&quot; COLTWO &quot;, &quot;COLTWO&quot;, coltweets$text)
# coltweets$text &lt;- gsub(&quot; COLTHREE &quot;, &quot;COLTHREE&quot;, coltweets$text)
# coltweets$text &lt;- gsub(&quot; COLFOUR &quot;, &quot;COLFOUR&quot;, coltweets$text)
# coltweets$text &lt;- gsub(&quot; COLFIVE &quot;, &quot;COLFIVE&quot;, coltweets$text)
# Let&#39;s just use this for now. Maybe good to keep these things together
# tweets &lt;- coltweets
# This makes a larger list of stop words combining those from the tm package and tidy text -- even though the tm package stop word list is pretty small anyway, just doing this just in case
data(stop_words)
mystopwords &lt;- c(stopwords(&#39;english&#39;),stop_words$word, stopwords(&#39;spanish&#39;))
# Now for Silge and Robinson&#39;s code. What this is doing is getting rid of
# URLs, re-tweets (RT) and ampersands. This also gets rid of stop words
# without having to get rid of hashtags and @ signs by using
# str_detect and filter!
reg &lt;- &quot;([^A-Za-z_\\d#@&#39;]|&#39;(?![A-Za-z_\\d#@]))&quot;
tidy_tweets &lt;- tweets %&gt;%
filter(!str_detect(text, &quot;^RT&quot;)) %&gt;%
mutate(text = str_replace_all(text, &quot;https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&amp;amp;|&amp;lt;|&amp;gt;|RT|https&quot;, &quot;&quot;)) %&gt;%
unnest_tokens(word, text, token = &quot;regex&quot;, pattern = reg) %&gt;%
filter(!word %in% mystopwords,
str_detect(word, &quot;[a-z]&quot;))</code></pre>
</div>
</div>
<div id="frequency-analysis-and-sentiment-analysis" class="section level2">
<h2>Frequency analysis and Sentiment analysis</h2>
<pre class="r"><code>freq &lt;- tidy_tweets %&gt;%
group_by(latitude,longitude) %&gt;%
count(word, sort = TRUE) %&gt;%
left_join(tidy_tweets %&gt;%
group_by(latitude,longitude) %&gt;%
summarise(total = n())) %&gt;%
mutate(freq = n/total)</code></pre>
<p>The n here is the total number of times this term has shown up, and the total is how many terms there are present in a particular coordinate. Now we have a representation of terms, their frequency and their position. Now I might want to plot this somehow… one way would be to try to plot the most frequent terms (n &gt; 50) (Some help on how to do this was taken from <a href="http://blog.revolutionanalytics.com/2016/01/avoid-overlapping-labels-in-ggplot2-charts.html">here</a> and <a href="http://stackoverflow.com/questions/14288001/geom-text-not-working-when-ggmap-and-geom-point-used">here</a>)</p>
<pre class="r"><code>freq2 &lt;- subset(freq, n &gt; 50)
map &lt;- get_map(location = &#39;Valencia St. and 20th, San Francisco,
California&#39;, zoom = 15)
freq2$longitude&lt;-as.numeric(freq2$longitude)
freq2$latitude&lt;-as.numeric(freq2$latitude)
mapPoints &lt;- ggmap(map) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_label_repel(data = freq2, aes(x = longitude, y = latitude, label = word),size = 3) </code></pre>
<p><img src="/portfolio/2017-05-06-thesis-time_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<p>Let’s zoom into that main central area to see what’s going on!</p>
<pre class="r"><code>map2 &lt;- get_map(location = &#39;Valencia St. and 19th, San Francisco,
California&#39;, zoom = 16)
mapPoints2 &lt;- ggmap(map2) + geom_jitter(alpha = 0.1, size = 2.5, width = 0.25, height = 0.25) +
geom_label_repel(data = freq2, aes(x = longitude, y = latitude, label = word),size = 3) </code></pre>
<p><img src="/portfolio/2017-05-06-thesis-time_files/figure-html/unnamed-chunk-16-1.png" width="672" /></p>
<p>What about 24th?</p>
<pre class="r"><code># Have to go a bit bigger to get more terms
freq3 &lt;- subset(freq, n &gt; 15)
map3 &lt;- get_map(location = &#39;Folsom St. and 24th, San Francisco,
California&#39;, zoom = 16)