dhweek2017-ocr/index.html at master · nmwolf/dhweek2017-ocr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
<!doctype html>
<html>
	<head>
		<meta charset="utf-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

		<title>Extracting Text and Data from Files Using OCR</title>

		<link rel="stylesheet" href="css/reveal.css">
		<link rel="stylesheet" href="css/theme/simple.css">

		<!-- Theme used for syntax highlighting of code -->
		<link rel="stylesheet" href="lib/css/zenburn.css">

		<!-- Printing and PDF exports -->

		<script>
			var link = document.createElement( 'link' );
			link.rel = 'stylesheet';
			link.type = 'text/css';
			link.href = window.location.search.match( /print-pdf/gi ) ? 'css/print/pdf.css' : 'css/print/paper.css';
			document.getElementsByTagName( 'head' )[0].appendChild( link );
		</script>

	</head>
	<body>
		<div class="reveal">
			<div class="slides">

				<!--SLIDE 1-->
				<section>
					<h2>Using OCR to Build a Digital Humanities Corpus</h2><br/>

					<div class="title-page"><p>Nick Wolf | February 7, 2017</p><p>Get this presentation: <a href="https://tinyurl.com/dhweek2017-ocr">https://tinyurl.com/dhweek2017-ocr</a></div>

					<img src="imgs/dataservices.png" width="30%" height="30%" align="middle"><br/>

					<div class='footer'>
						<hr/>
						<p>Nick's ORCID: <a href="http://orcid.org/0000-0001-5512-6151">0000-0001-5512-6151</a><br/>
						This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.</p>
					</div>
				</section>


				<!--SLIDE 2-->
				<section>
					<h3>What is OCR and What Can We Do with It?</h3>
					<ul>
					<li>OCR = Optical Character Recognition</li>
					<li>A system that analyzes an image of a writing glyph-by-glyph and turns it into a document of machine-readable characters</li>
					<li>High-performing OCR depends on machine-learning: you supervise your computer in recognizing images of characters—including unusual fonts, non-English language texts, etc.</li>
					<li>Many options, but for highest accuracy (especially on older documents) use ABBYY or Tesseract</li>
					</ul>
				</section>

				<!--SLIDE 3-->
				<section>
					<h3>What is OCR and What Can We Do with It?</h3>
					<ul>
					<li>Aim is to produce a machine-readable file, whether plain text or html/xml (if possible, with bounding boxes)</li>
					<li>Preserve metadata elements important to DH inquiry (e.g. document structural elements, like page number)</li>
					<li>Preserve textual ornamentation (e.g. bold, italics, indents, capitalization) that may assist with identifying word tokens</li>
					</ul>
				</section>
				<!-- SLIDE 4-->
				<section>
					<h3>Input Formats</h3>
					<ul>
					<li>PDF or Tiff. For Tesseract, Tiff only.</li>
					<li>Input images at least 300dpi with good contrast.</li>
					</ul>
				</section>

				<!--SLIDE 4-->
				<section>
				<img src="imgs/abbyy.jpg" width="15%" height="15%" style="float:right; margin-left:0.5em;">
					<h2>Option One: ABBYY FineReader</h2>
					<p>Basic specs:</p>
					<ul>
					<li>Mac and Windows versions. Currently, Windows interface much better.</li>
					<li>30-day Trial License Available (great for one-off projects)</li>
					<li><a href="https://www.abbyy.com/en-us/finereader/professional/for-education/">Discount license</a> ($118.99 currently) for teachers/students/researchers</li>
					<li>Two Mac versions on scanner computer stations in Digital Studio at Bobst; licensed version here in 617, trial version in 5th-Floor Research Commons labs.</li>
					</ul><br/><br/>
					<br/>
					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 8=6-->
				<section>
					<h2>ABBYY Tutorial Materials</h2>
					<ol>
					<li>Login to your NYU E-mail/Home/Drive Account</li>
					<li>Download <a href="https://drive.google.com/a/nyu.edu/file/d/0Bz9LVQAqjIjLTVhkZnZLNHg2NTA/view?usp=sharing">this document</a> and save it somewhere on the system where you can find it.</li>
					<li>FYI: Best resolution for OCR is 300 dpi</li>
					</ol>
					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 7-->
				<section>
						<h2>Step One: Load Document</h2>
						<p>Generally, select "Image or PDF File to Other Formats"</p>
						<img src="imgs/OCR-Tutorial/tutorial-img-1.PNG" width="80%">

						<div class='footer'>
					</div>
				</section>
				<!-- SLIDE 8-->
				<section>
						<h2>Step Two: Initial Pass by ABBYY</h2>
						<p>Hands off! Let ABBYY try to recognize every page first</p>
						<img src="imgs/OCR-Tutorial/tutorial-img-2.PNG" width="80%">

						<div class='footer'>
					</div>
				</section>
				<!--SLIDE 9-->
				<section>
					<h2>Step Three: The Windows FineReader Dashboard</h2>

					<p>When asked if you want to save, select no...we need to refine the output considerably.</p>

					<img src="imgs/OCR-Tutorial/tutorial-img-3-1.PNG" width="80%">

					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 10-->
				<section>
					<h2>Text Area Types</h2>

					<ul>
					<li>Text Area (light green)</li>
					<li>Picture Area (red)</li>
					<li>Table Area (blue)</li>
					</ul>
					<div class='footer'>
					</div>
				</section>

				<!--SLIDE 11-->
				<section>
					<h2>Step Four: Best Workflow Order of Operations</h2>
					<ol>
					<li>Work page by page to first <strong>DELETE</strong> all unwanted text/picture/table areas.</li>
					<li><strong>ADJUST</strong> size of current areas to capture anything on the page, preserving the original text/picture/table box.</li>
					<p>Why? Order matters! The output text will follow the order of text/picture/table boxes in first window. Any boxes you add get tacked on to end of page. So preserve the naturally created order of boxes first, deleting and editing what is already there</p>
					<li><strong>ADD</strong> text areas as needed, adjust order</li>
					</ol>
					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 12-->
				<section>
					<h2>Step Four: Best Workflow Order of Operations</h2>

					<p>4. Click "Read" on top menu from time to time to generate a new output text. This allows ABBYY to continue to use its embedded tools to guess at words</p>
					<p>5. Use the bottom detail window to adjust location of row and columns in tables. Select your table area using selection pointer, then click on table row/column line to delete, add, or move separators.</p>
					<p>6. Implement the pattern trainer by following the "Creating and Training a User Pattern Tutorial <a href="http://help.abbyy.com/FineReader/FineReader12/English/ImproveResults/OrnateType.htm">here</a></p>

					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 13-->
				<section>
					<h2>Step Four: Best Workflow Order of Operations</h2>

					<p>7. <strong>FINALLY</strong> you may want to walk through the righthand text editor window and correct mistakes in blue. However, note that ABBYY marks in blue things that are wrong and things that might be wrong. <strong>Don't waste time eliminating blue markup.</strong></p>

					<div class='footer'>
					</div>
				</section>
				<!--SLIDE 14-->
				<section>
					<h2>Step Five: Save and Export</h2>
					<ol>
					<li>Save the ABBYY project bundle. Go to FILE >> Save FineReader Document and save the project form time to time.</li>
					<li>When ready to export, hit the "Save" icon at top menu bar and select out put format. Note also the dropdown options under the "Document Layout" section. Opt in or out to keep things like pictures, page numbers, headers, footers, line breaks, hyphens.</li>
					<li>Best options: RTF (if you want the exact layout on the page, especially line breaks) or TXT (if you just want text and line breaks).

					</ol>
					<div class='footer'>
					</div>
				</section>
				<section>
					<h3>Option Two (the free option): Tesseract</h3>
					<ul>
					<li>Find documentation at <a href="https://github.com/tesseract-ocr/tesseract">here</a></li>
					<li>Install by first loading <a href="http://brew.sh">Homebrew</a> on Mac (if you don't have it already). From terminal, type: <code>/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"</code></li>
					<li>Then type: <code>brew install tesseract</code></li>
					<li>Works on .tiff (uncompressed) files. Use tools like <a href="https://www.imagemagick.org/script/index.php">ImageMagick</a> and <a href="https://www.ghostscript.com/">Ghostscript</a> to make conversions.</li>
					</ul><br/><br/>
					<img src="imgs/ocr.png" width="80%"/>
				</section>
				<section>
					<h3>Running Tesseract</h3>
					<ul>
					<li>Runs in the command line, but don't be intimidated...basic command is: <code>tesseract input-image-location output-text-location</code></li>
    					<li>To output to an html file with bounding boxes, use <code>tesseract input-image-location output-text-location hocr</code></li>
					<li>Batch OCR: <code>for item in *.tif; do tesseract $item output_folder_name/$item; done</code>
					<li>Training must also be done through command line, and that is a little harder. See tutorial at <a href="https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract">https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract</a>.</li>

				<!-- CONCLUSION SLIDE -->
				<section>
					<h2>Happy OCRing. Questions?</h2><br/>

					<p>Email us: <a href="mailto:vicky.steeves@nyu.edu">vicky.steeves@nyu.edu</a> & <a href="mailto:nicholas.wolf@nyu.edu">nicholas.wolf@nyu.edu</a></p>
					<p>Learn more about RDM: <a href="http://guides.nyu.edu/data_management">guides.nyu.edu/data_management</a></p>

					<p>Get this presentation: <a href="http://guides.nyu.edu/data_management/resources">guides.nyu.edu/data_management/resources</a></p>
					<p>Make an appointment: <a href="http://guides.nyu.edu/appointment">guides.nyu.edu/appointment</a></p>

					<div class='footer'>
						<hr/>
						<p>Vicky's ORCID: <a href="orcid.org/0000-0003-4298-168X/">0000-0003-4298-168X</a> | Nick's ORCID: <a href="http://orcid.org/0000-0001-5512-6151">0000-0001-5512-6151</a><br/>
						This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc/4.0/">Creative Commons Attribution-NonCommercial 4.0 International License</a>.</p>
					</div>

					<!--SPEAKER NOTES-->
					<aside class="notes">
						Oh hey, these are some notes. They'll be hidden in your presentation, but you can see them if you open the speaker notes window (hit 's' on your keyboard).
					</aside>
				</section>
			</div>
		</div>

		<script src="lib/js/head.min.js"></script>
		<script src="js/reveal.js"></script>

		<script>

			// More info https://github.com/hakimel/reveal.js#configuration
			Reveal.initialize({
				history: true,
							// More info https://github.com/hakimel/reveal.js#dependencies
				dependencies: [
					{ src: 'plugin/markdown/marked.js' },
					{ src: 'plugin/markdown/markdown.js' },
					{ src: 'plugin/notes/notes.js', async: true },
					{ src: 'plugin/highlight/highlight.js', async: true, callback: function() { hljs.initHighlightingOnLoad(); } }
				]
			});
		</script>
	</body>
</html>