-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdocs.html
More file actions
1094 lines (956 loc) · 59.4 KB
/
Copy pathdocs.html
File metadata and controls
1094 lines (956 loc) · 59.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>FineFoundry Documentation</title>
<meta name="description" content="Complete documentation for FineFoundry - installation, usage guides, training, inference, CLI tools, and deployment.">
<!-- Favicon -->
<link rel="icon" type="image/x-icon" href="favicon.ico">
<link rel="apple-touch-icon" sizes="180x180" href="img/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="192x192" href="img/icon-192.png">
<!-- Open Graph / Social Media -->
<meta property="og:type" content="website">
<meta property="og:url" content="https://finefoundry.dev/docs.html">
<meta property="og:title" content="FineFoundry Documentation">
<meta property="og:description" content="Complete documentation for FineFoundry - installation, usage guides, training, inference, CLI tools, and deployment.">
<meta property="og:image" content="https://finefoundry.dev/img/FineFoundry-logo.png">
<!-- Twitter Card -->
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:site" content="@SourceBoxLLC">
<meta name="twitter:creator" content="@SourceBoxLLC">
<meta name="twitter:title" content="FineFoundry Documentation">
<meta name="twitter:description" content="Complete documentation for FineFoundry - installation, usage guides, training, inference, CLI tools, and deployment.">
<meta name="twitter:image" content="https://finefoundry.dev/img/FineFoundry-logo.png">
<!-- Bootstrap CSS -->
<link href="https://cdn.jsdelivr.net/npm/bootstrap@5.3.2/dist/css/bootstrap.min.css" rel="stylesheet">
<!-- Bootstrap Icons -->
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bootstrap-icons@1.11.1/font/bootstrap-icons.css">
<!-- Custom CSS -->
<link rel="stylesheet" href="css/style.css?v=20251209-1">
<style>
.docs-hero {
padding: 100px 0 40px 0;
background: radial-gradient(circle at top left, rgba(255,193,7,0.25), transparent 55%),
radial-gradient(circle at bottom right, rgba(211,47,47,0.35), transparent 55%),
#000;
color: #fff;
}
.docs-hero h1 {
font-size: 2.5rem;
}
.docs-content {
padding-top: 30px;
padding-bottom: 60px;
}
.docs-sidebar-wrapper {
position: sticky;
top: 70px;
max-height: calc(100vh - 90px);
overflow-y: auto;
}
.docs-sidebar .nav-link {
color: var(--dark-text-muted);
font-size: 0.875rem;
padding: 0.5rem 0.75rem;
border-left: 3px solid transparent;
margin-bottom: 2px;
border-radius: 0;
}
.docs-sidebar .nav-link:hover {
color: var(--primary-color);
background-color: rgba(var(--bs-primary-rgb), 0.1);
border-left-color: rgba(var(--bs-primary-rgb), 0.5);
}
.docs-sidebar .nav-link.active {
color: var(--primary-color);
background-color: rgba(var(--bs-primary-rgb), 0.15);
border-left-color: var(--primary-color);
font-weight: 600;
}
.docs-sidebar .nav-link i {
width: 20px;
text-align: center;
}
.docs-sidebar .nav-section {
font-size: 0.7rem;
font-weight: 700;
text-transform: uppercase;
letter-spacing: 0.05em;
color: var(--dark-text-muted);
padding: 1rem 0.75rem 0.5rem;
margin-top: 0.5rem;
}
.docs-sidebar .nav-section:first-child {
margin-top: 0;
padding-top: 0;
}
.doc-section {
scroll-margin-top: 90px;
padding-bottom: 2rem;
border-bottom: 1px solid var(--dark-border);
margin-bottom: 2rem;
}
.doc-section:last-child {
border-bottom: none;
}
.doc-section h2 {
font-size: 1.75rem;
margin-bottom: 1rem;
}
.doc-section h3 {
font-size: 1.25rem;
margin-top: 1.5rem;
margin-bottom: 0.75rem;
color: var(--primary-color);
}
.doc-section h4 {
font-size: 1rem;
margin-top: 1.25rem;
margin-bottom: 0.5rem;
}
.doc-section pre {
margin: 1rem 0;
}
.doc-section ul, .doc-section ol {
margin-bottom: 1rem;
}
.doc-section li {
margin-bottom: 0.25rem;
}
.feature-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 1rem;
margin: 1rem 0;
}
.feature-box {
background: var(--dark-surface-alt);
border: 1px solid var(--dark-border);
border-radius: 8px;
padding: 1rem;
}
.feature-box strong {
display: block;
margin-bottom: 0.25rem;
}
.tip-box {
background: rgba(var(--bs-primary-rgb), 0.1);
border-left: 4px solid var(--primary-color);
padding: 1rem;
border-radius: 0 8px 8px 0;
margin: 1rem 0;
}
.tip-box strong {
color: var(--primary-color);
}
</style>
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-expand-lg navbar-dark bg-dark fixed-top">
<div class="container">
<a class="navbar-brand" href="index.html">
<i class="bi bi-gear-fill"></i> FineFoundry
</a>
<button class="navbar-toggler" type="button" data-bs-toggle="collapse" data-bs-target="#navbarNav">
<span class="navbar-toggler-icon"></span>
</button>
<div class="collapse navbar-collapse" id="navbarNav">
<ul class="navbar-nav ms-auto">
<li class="nav-item">
<a class="nav-link" href="index.html#home">Home</a>
</li>
<li class="nav-item">
<a class="nav-link" href="index.html#features">Features</a>
</li>
<li class="nav-item">
<a class="nav-link active" href="#">Documentation</a>
</li>
<li class="nav-item">
<a class="nav-link" href="https://github.com/SourceBox-LLC/FineFoundry" target="_blank">
<i class="bi bi-github me-1"></i>GitHub
</a>
</li>
</ul>
</div>
</div>
</nav>
<!-- Docs Hero -->
<section class="docs-hero">
<div class="container">
<div class="row align-items-center">
<div class="col-lg-8">
<span class="badge bg-warning text-dark mb-3"><i class="bi bi-journal-text me-1"></i> Official Documentation</span>
<h1 class="fw-bold mb-3">FineFoundry Documentation</h1>
<p class="lead mb-0">
Complete guides for installation, usage, training, inference, and deployment.
</p>
</div>
<div class="col-lg-4 text-lg-end d-none d-lg-block">
<i class="bi bi-journal-code display-1 text-warning opacity-50"></i>
</div>
</div>
</div>
</section>
<!-- Docs Content -->
<section class="docs-content">
<div class="container">
<div class="row">
<!-- Sidebar -->
<div class="col-lg-3 mb-4">
<div class="docs-sidebar-wrapper">
<div class="card border-0 shadow-sm">
<div class="card-body p-2">
<nav class="docs-sidebar nav flex-column">
<div class="nav-section">Getting Started</div>
<a class="nav-link active" href="#quick-start">
<i class="bi bi-rocket-takeoff-fill me-2"></i>Quick Start
</a>
<a class="nav-link" href="#installation">
<i class="bi bi-download me-2"></i>Installation
</a>
<div class="nav-section">User Guides</div>
<a class="nav-link" href="#scrape-tab">
<i class="bi bi-search me-2"></i>Data Sources Tab
</a>
<a class="nav-link" href="#synthetic-data">
<i class="bi bi-stars me-2"></i>Synthetic Data
</a>
<a class="nav-link" href="#build-publish">
<i class="bi bi-hammer me-2"></i>Publish
</a>
<a class="nav-link" href="#training-tab">
<i class="bi bi-gpu-card me-2"></i>Training Tab
</a>
<a class="nav-link" href="#inference-tab">
<i class="bi bi-chat-dots me-2"></i>Inference Tab
</a>
<a class="nav-link" href="#merge-tab">
<i class="bi bi-shuffle me-2"></i>Merge Datasets
</a>
<a class="nav-link" href="#analysis-tab">
<i class="bi bi-pie-chart me-2"></i>Dataset Analysis
</a>
<a class="nav-link" href="#settings-tab">
<i class="bi bi-gear me-2"></i>Settings
</a>
<div class="nav-section">CLI & API</div>
<a class="nav-link" href="#cli-usage">
<i class="bi bi-terminal me-2"></i>CLI Tools
</a>
<a class="nav-link" href="#python-api">
<i class="bi bi-code-slash me-2"></i>Python API
</a>
<div class="nav-section">Deployment</div>
<a class="nav-link" href="#docker-deployment">
<i class="bi bi-pc-display me-2"></i>Local Training
</a>
<a class="nav-link" href="#runpod-setup">
<i class="bi bi-cloud me-2"></i>RunPod (Cloud)
</a>
<div class="nav-section">Resources</div>
<a class="nav-link" href="#troubleshooting">
<i class="bi bi-question-circle me-2"></i>Troubleshooting
</a>
<a class="nav-link" href="#upgrade-notes">
<i class="bi bi-arrow-up-circle me-2"></i>Upgrade Notes
</a>
</nav>
<hr class="my-2">
<div class="p-2">
<a href="https://github.com/SourceBox-LLC/FineFoundry/tree/master/docs" target="_blank" class="btn btn-outline-secondary btn-sm w-100">
<i class="bi bi-github me-2"></i>GitHub Docs
</a>
</div>
</div>
</div>
</div>
</div>
<!-- Main Content -->
<div class="col-lg-9">
<!-- Quick Start -->
<div id="quick-start" class="doc-section">
<h2><i class="bi bi-rocket-takeoff-fill text-primary me-2"></i>Quick Start Guide</h2>
<p>Get FineFoundry running in about 5 minutes. No complicated setup required!</p>
<h3>What You Need</h3>
<ul>
<li><strong>Python 3.10 or newer</strong> — <a href="https://www.python.org/downloads/" target="_blank">Download here</a> if you don't have it</li>
<li><strong>Windows, macOS, or Linux</strong> — All platforms supported</li>
<li><strong>Optional:</strong> NVIDIA GPU for training (8GB+ VRAM recommended)</li>
</ul>
<h3>The Easy Way (uv)</h3>
<pre class="bg-dark text-light p-3 rounded"><code># Clone the repository
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core
# Install uv if needed
pip install uv
# Run the application (uv handles dependencies automatically)
# One-time (macOS/Linux): allow executing the launcher script
chmod +x run_finefoundry.sh
./run_finefoundry.sh
# Alternative (without the launcher script)
uv run src/main.py</code></pre>
<h3>The Traditional Way (pip)</h3>
<pre class="bg-dark text-light p-3 rounded"><code># Clone and set up a virtual environment
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core
python -m venv venv
# Activate (macOS/Linux)
source venv/bin/activate
# Activate (Windows PowerShell)
./venv/Scripts/Activate.ps1
# Install and run
pip install -e .
python src/main.py</code></pre>
<h3>What You'll See</h3>
<p>FineFoundry opens as a desktop app with tabs for each step of your workflow:</p>
<ul>
<li><strong>Data Sources</strong> — Collect training data from websites or your own documents</li>
<li><strong>Publish</strong> — Organize and optionally share your datasets</li>
<li><strong>Training</strong> — Teach AI models using your data</li>
<li><strong>Inference</strong> — Chat with your trained models to test them</li>
<li><strong>Merge</strong> — Combine multiple datasets together</li>
<li><strong>Analysis</strong> — Check your data quality before training</li>
<li><strong>Settings</strong> — Set up accounts and preferences</li>
</ul>
<h3>Your First 5 Minutes</h3>
<ol>
<li>Go to <strong>Data Sources</strong> tab</li>
<li>Select a source (try Reddit or Stack Exchange)</li>
<li>Click <strong>Start</strong> and wait for data to collect</li>
<li>Click <strong>Preview</strong> to see what you got!</li>
</ol>
<p>That's it! You've just collected your first training dataset.</p>
</div>
<!-- Installation -->
<div id="installation" class="doc-section">
<h2><i class="bi bi-download text-primary me-2"></i>Installation</h2>
<p>Everything you need to get FineFoundry running on your machine.</p>
<h3>What You Need</h3>
<p>FineFoundry runs on <strong>Windows 10+</strong>, <strong>macOS 11+</strong>, and <strong>Linux</strong> (Ubuntu 20.04+). You'll need <strong>Python 3.10 or higher</strong> and at least <strong>8GB RAM</strong> (16GB+ recommended for training). For local GPU training, you'll want an NVIDIA card with CUDA support.</p>
<h3>Getting uv</h3>
<p>We recommend uv for managing dependencies—it's faster than pip and handles everything automatically:</p>
<pre class="bg-dark text-light p-3 rounded"><code>pip install uv
uv --version # verify it works</code></pre>
<h3>Running FineFoundry</h3>
<pre class="bg-dark text-light p-3 rounded"><code># Make sure Python 3.10+ is installed
python --version
# Run with uv (recommended)
uv run src/main.py</code></pre>
<div class="tip-box">
<strong>💡 Tip:</strong> If dependencies seem broken, delete the <code>.venv</code> folder and run <code>uv run src/main.py</code> again. uv will recreate everything fresh.
</div>
</div>
<!-- Data Sources Tab -->
<div id="scrape-tab" class="doc-section">
<h2><i class="bi bi-search text-primary me-2"></i>Data Sources Tab</h2>
<p>This is where you collect data to train your AI. You can grab conversations from websites like Reddit and 4chan, or create training data from your own documents.</p>
<img src="img/ff_scrape_tab.png" alt="Data Sources Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>Where You Can Get Data</h3>
<div class="feature-grid">
<div class="feature-box">
<strong>4chan</strong>
<p class="small text-muted mb-0">Collect conversations from boards. Good for casual, unfiltered dialogue.</p>
</div>
<div class="feature-box">
<strong>Reddit</strong>
<p class="small text-muted mb-0">Collect from subreddits or posts. Great for topic-specific data.</p>
</div>
<div class="feature-box">
<strong>Stack Exchange</strong>
<p class="small text-muted mb-0">Q&A pairs from sites like Stack Overflow. Perfect for technical training.</p>
</div>
<div class="feature-box">
<strong><i class="bi bi-stars text-warning me-1"></i>Synthetic</strong>
<p class="small text-muted mb-0">Create training data from your own PDFs and documents.</p>
</div>
</div>
<h3>Key Settings</h3>
<ul>
<li><strong>Max Threads/Posts</strong> — How many pages to collect from</li>
<li><strong>Max Pairs</strong> — Stop after collecting this many conversation pairs</li>
<li><strong>Delay</strong> — Pause between requests (be polite to websites!)</li>
<li><strong>Min Length</strong> — Skip very short pairs (filters out junk)</li>
</ul>
<p><strong>Tip:</strong> Start with small numbers (50 threads, 500 pairs) to test, then scale up.</p>
<h3>Pairing Modes</h3>
<h4>Normal Mode</h4>
<p>Creates pairs from adjacent posts. Simple and fast, but loses conversational context.</p>
<h4>Contextual Mode</h4>
<p>Builds context from the conversation thread:</p>
<ul>
<li><strong>quote_chain</strong> – Follows reply chains via quote references</li>
<li><strong>cumulative</strong> – Accumulates all previous posts as context</li>
<li><strong>last_k</strong> – Uses the last K posts as context</li>
</ul>
<h3>Output Format</h3>
<pre class="bg-dark text-light p-3 rounded"><code>[
{"input": "What do you think about...", "output": "I believe that..."},
{"input": "Can you explain...", "output": "Sure, here's how..."}
]</code></pre>
<div class="tip-box">
<strong>💡 Best Practice:</strong> Start with smaller runs (50 threads, 500 pairs) to validate your configuration before scaling up.
</div>
<div class="alert alert-warning">
<strong>Offline Mode:</strong> Only the <strong>Synthetic</strong> data source can be used. Network sources (4chan/Reddit/Stack Exchange) are disabled.
</div>
<h3 id="synthetic-data">Synthetic Data Generation</h3>
<p>Generate training data from your own documents using local LLMs powered by Unsloth's SyntheticDataKit.</p>
<h4>Supported Input Formats</h4>
<ul>
<li>PDF documents</li>
<li>DOCX (Word documents)</li>
<li>PPTX (PowerPoint)</li>
<li>HTML/HTM web pages</li>
<li>TXT plain text</li>
<li>URLs (fetched and parsed)</li>
</ul>
<h4>Generation Types</h4>
<ul>
<li><strong>qa</strong> – Question-answer pairs from document content</li>
<li><strong>cot</strong> – Chain-of-thought reasoning examples</li>
<li><strong>summary</strong> – Document summaries</li>
</ul>
<h4>Synthetic Parameters</h4>
<ul>
<li><strong>Model</strong> – Local LLM to use (default: <code>unsloth/Llama-3.2-3B-Instruct</code>)</li>
<li><strong>Generation Type</strong> – qa, cot, or summary</li>
<li><strong>Num Pairs</strong> – Target examples per chunk</li>
<li><strong>Max Chunks</strong> – Maximum document chunks to process</li>
<li><strong>Curate</strong> – Enable quality filtering with threshold</li>
</ul>
<div class="tip-box">
<strong>💡 Note:</strong> First run takes 30-60 seconds for model loading. A snackbar notification appears immediately when you click Start. Subsequent runs are faster.
</div>
</div>
<!-- Publish -->
<div id="build-publish" class="doc-section">
<h2><i class="bi bi-hammer text-primary me-2"></i>Publish Tab</h2>
<p>This is where you prepare your data for training and optionally share it on Hugging Face (a popular AI community site).</p>
<img src="img/ff_buld_publish.png" alt="Publish Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>Building Your Dataset</h3>
<ol>
<li>Select your collected data from the dropdown</li>
<li>Set the split (how much for training vs. validation)</li>
<li>Click <strong>Build Dataset</strong></li>
</ol>
<p>That's it! Your data is now ready for training.</p>
<h3>Sharing (Optional)</h3>
<p>Want others to use your dataset or trained model? You can upload to Hugging Face:</p>
<ol>
<li>Enable <strong>Push to Hub</strong></li>
<li>Enter your repo name (like <code>username/my-dataset</code>)</li>
<li>Click <strong>Push + Upload README</strong></li>
</ol>
<p>FineFoundry automatically creates a nice description page for your work!</p>
<div class="alert alert-warning">
<strong>Offline Mode:</strong> Hub push is disabled when offline, but you can still build datasets locally.
</div>
</div>
<!-- Training Tab -->
<div id="training-tab" class="doc-section">
<h2><i class="bi bi-gpu-card text-primary me-2"></i>Training Tab</h2>
<p>This is where you teach AI models using your collected data. Pick a dataset, choose your settings, and train—either on cloud GPUs via RunPod or directly on your own computer.</p>
<img src="img/ff_training.png" alt="Training Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>Where to Train</h3>
<div class="feature-grid">
<div class="feature-box">
<strong><i class="bi bi-pc-display me-2"></i>Your Computer</strong>
<p class="small text-muted mb-0">Train on your own GPU—free, private, no internet needed</p>
</div>
<div class="feature-box">
<strong><i class="bi bi-cloud me-2"></i>RunPod (Cloud)</strong>
<p class="small text-muted mb-0">Rent powerful GPUs for faster training—pay only for what you use</p>
</div>
</div>
<h3>How Training Works</h3>
<p>FineFoundry uses modern AI training techniques (LoRA fine-tuning) that let you customize large language models without needing expensive hardware. Local training runs natively on your machine, while RunPod training uses cloud containers for maximum power.</p>
<h3>Skill Levels</h3>
<h4>Beginner Mode</h4>
<p>Simplifies choices with safe presets:</p>
<ul>
<li><strong>Fastest (RunPod)</strong> – Higher throughput on stronger GPUs</li>
<li><strong>Cheapest (RunPod)</strong> – Conservative params for smaller GPUs</li>
<li><strong>Quick local test</strong> – Short run for sanity checks</li>
<li><strong>Auto Set (local)</strong> – Detects GPU VRAM and aggressively pushes throughput while still aiming to avoid OOM</li>
<li><strong>Simple custom</strong> – Guided controls for duration, memory/stability, and speed vs quality</li>
</ul>
<img src="img/ff_auto_set.png" alt="Auto Set (local) preset" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h4>Expert Mode</h4>
<p>Full control over all hyperparameters for experienced users.</p>
<h3>Hyperparameters</h3>
<ul>
<li><strong>Base model</strong> – Default: <code>unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit</code></li>
<li><strong>Epochs</strong> – Number of training epochs</li>
<li><strong>Learning rate</strong> – Step size for optimization</li>
<li><strong>Batch size</strong> – Samples per device per step</li>
<li><strong>Gradient accumulation</strong> – Steps before weight update</li>
<li><strong>Max steps</strong> – Upper bound on training steps</li>
<li><strong>Packing</strong> – Pack multiple short examples for throughput</li>
<li><strong>Auto-resume</strong> – Continue from latest checkpoint</li>
</ul>
<h3>Quick Local Inference</h3>
<p>After a successful local run, the Quick Local Inference panel appears:</p>
<ul>
<li><strong>Sample prompts</strong> – 5 random prompts from your training dataset are auto-loaded for quick testing</li>
<li>Enter a prompt manually or select a sample from the dropdown</li>
<li>Choose presets: Deterministic, Balanced, or Creative</li>
<li>Adjust temperature and max tokens with sliders</li>
<li>View prompt/response history</li>
<li><strong>Export chats</strong> – Save your prompt/response history to a text file</li>
</ul>
<p class="text-muted">The sample prompts feature lets you quickly verify your model learned from the training data without manually copying prompts.</p>
<h3>Saving Configurations</h3>
<p>Training configs are stored in the SQLite database:</p>
<ul>
<li>Click <strong>Save current setup</strong> to snapshot your configuration</li>
<li>Use the dropdown to load saved configs</li>
<li>The last used config auto-loads on startup</li>
<li>All configs persist across sessions in <code>finefoundry.db</code></li>
</ul>
<div class="alert alert-warning">
<strong>Offline Mode:</strong> RunPod training and Hugging Face Hub actions are disabled; local-only workflows are enforced.
</div>
</div>
<!-- Inference Tab -->
<div id="inference-tab" class="doc-section">
<h2><i class="bi bi-chat-dots text-primary me-2"></i>Inference Tab</h2>
<p>This is where you chat with your trained AI model! Think of it as a testing playground where you can ask questions and see how your model responds.</p>
<img src="img/ff_inferance.png" alt="Inference Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>How to Test Your Model</h3>
<ol>
<li>Select a <strong>completed training run</strong> from the dropdown</li>
<li>Wait for the green "ready" status</li>
<li>Type a message and click <strong>Generate</strong></li>
<li>See how your AI responds!</li>
</ol>
<p>Try different questions to see how well your model learned. For extended conversations, click <strong>Full Chat View</strong> to open a dedicated chat window.</p>
<h3>Sample Prompts</h3>
<p>The Inference tab lets you select <strong>any saved dataset</strong> to sample prompts from:</p>
<ol>
<li>Select a dataset from the <strong>Dataset for sample prompts</strong> dropdown</li>
<li>5 random prompts are loaded into the <strong>Sample prompts</strong> dropdown</li>
<li>Click the refresh button to get new random samples</li>
<li>Select a sample to automatically fill the prompt text area</li>
</ol>
<p class="text-muted">Unlike Quick Local Inference (which only uses the training dataset), the Inference tab can test against any dataset in your database.</p>
<h3>Adapter Validation</h3>
<p>When you select a training run, FineFoundry:</p>
<ol>
<li>Shows a loading spinner while checking the folder</li>
<li>Verifies the directory contains LoRA artifacts (<code>adapter_config.json</code>, weight files)</li>
<li>If valid: unlocks the Prompt & Responses section</li>
<li>If invalid: shows an error and locks the controls</li>
</ol>
<h3>Generation Controls</h3>
<ul>
<li><strong>Preset dropdown</strong> – Quick settings for different use cases</li>
<li><strong>Temperature slider</strong> – Controls randomness (0.0 = deterministic)</li>
<li><strong>Max new tokens slider</strong> – Upper bound on generated tokens</li>
</ul>
<h3>Full Chat View</h3>
<p>Click <strong>Full Chat View</strong> to open a focused chat dialog:</p>
<img src="img/ff_inferance_full_chat_view.png" alt="Full Chat View Screenshot" class="img-fluid rounded shadow-sm my-3" style="max-width: 500px;">
<ul>
<li>Large chat area with user/assistant bubbles</li>
<li>Multiline message composer</li>
<li>Shared conversation history with main view</li>
<li>Proper chat templates for multi-turn conversations</li>
<li>Clear history and close buttons</li>
</ul>
<h3>Under the Hood</h3>
<p>Powered by the same stack as training:</p>
<ul>
<li><strong>Transformers</strong> – <code>AutoModelForCausalLM</code>, <code>AutoTokenizer</code></li>
<li><strong>PEFT</strong> – <code>PeftModel</code> for adapter loading</li>
<li><strong>bitsandbytes</strong> – 4-bit quantization on CUDA</li>
<li><strong>Chat Templates</strong> – Proper formatting for instruct models (Llama-3.1, etc.)</li>
<li><strong>Repetition Penalty</strong> – Prevents degenerate/looping outputs</li>
<li><strong>100% local</strong> – No external API calls</li>
</ul>
</div>
<!-- Merge Tab -->
<div id="merge-tab" class="doc-section">
<h2><i class="bi bi-shuffle text-primary me-2"></i>Merge Datasets Tab</h2>
<p>This is where you combine data from different sources into one big dataset. Great for when you've collected from multiple places and want to train on everything together.</p>
<img src="img/ff_merge.png" alt="Merge Datasets Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>Use Cases</h3>
<ul>
<li>Combining data from multiple scraping sessions</li>
<li>Merging database sessions with Hugging Face datasets (when online)</li>
<li>Creating larger, more diverse training datasets</li>
</ul>
<h3>Operations</h3>
<ul>
<li><strong>Concatenate</strong> – Stack all datasets sequentially</li>
<li><strong>Interleave</strong> – Alternate records for better distribution</li>
</ul>
<h3>Supported Sources</h3>
<ul>
<li><strong>Database Session</strong> – Load from your scrape history</li>
<li><strong>Hugging Face</strong> – Load from Hub with repo, split, and config (when online)</li>
</ul>
<div class="alert alert-warning">
<strong>Offline Mode:</strong> Hugging Face dataset sources are disabled; database sessions remain available.
</div>
<h3>Column Mapping</h3>
<p>FineFoundry automatically handles column mapping:</p>
<ul>
<li>Auto-detects common patterns: <code>input/output</code>, <code>prompt/response</code>, <code>question/answer</code></li>
<li>Normalizes all datasets to <code>input/output</code> format</li>
<li>Filters rows with empty input or output</li>
</ul>
<h3>Output</h3>
<ul>
<li><strong>Database</strong> – Merged data saved to a new database session</li>
<li><strong>Database + Export JSON</strong> – Also export to JSON for external tools</li>
</ul>
<h3>Download Merged Dataset</h3>
<p>If you enabled JSON export, click <strong>Download Merged Dataset</strong> to copy the result to another location.</p>
</div>
<!-- Analysis Tab -->
<div id="analysis-tab" class="doc-section">
<h2><i class="bi bi-pie-chart text-primary me-2"></i>Dataset Analysis Tab</h2>
<p>This is where you check your data quality before training. Think of it as a health checkup for your dataset—it can spot problems before they waste hours of training time.</p>
<img src="img/ff_dataset_analysis.png" alt="Dataset Analysis Tab Screenshot" class="img-fluid rounded shadow-sm my-4" style="max-width: 100%;">
<h3>Analysis Modules</h3>
<div class="row g-3">
<div class="col-md-6">
<ul>
<li><strong>Basic Stats</strong> – Record counts, mean lengths</li>
<li><strong>Duplicates & Similarity</strong> – Approximate duplicate rate</li>
<li><strong>Sentiment</strong> – Polarity distribution</li>
<li><strong>Class Balance</strong> – Short/medium/long buckets</li>
</ul>
</div>
<div class="col-md-6">
<ul>
<li><strong>Data Leakage</strong> – Train/test overlap detection</li>
<li><strong>Toxicity</strong> – Harmful content detection</li>
<li><strong>Readability</strong> – Text complexity metrics</li>
<li><strong>Topics</strong> – Topic distribution analysis</li>
</ul>
</div>
</div>
<h3>Workflow</h3>
<ol>
<li>Select dataset source (Database Session, or Hugging Face when online)</li>
<li>Enable the analysis modules you need</li>
<li>Click <strong>Analyze Dataset</strong></li>
<li>Review summary stats and visualizations</li>
</ol>
<div class="alert alert-warning">
<strong>Offline Mode:</strong> Hugging Face dataset source and Hugging Face inference backend options are disabled.
</div>
<div class="tip-box">
<strong>💡 Best Practice:</strong> Run analysis before committing to long training runs. Use Duplicates & Similarity to spot unintentional dataset duplication.
</div>
</div>
<!-- Settings Tab -->
<div id="settings-tab" class="doc-section">
<h2><i class="bi bi-gear text-primary me-2"></i>Settings Tab</h2>
<p>This is where you set up your accounts and preferences. You only need to do this once—your settings are saved automatically.</p>
<h3>What You Can Set Up</h3>
<ul>
<li><strong>Hugging Face</strong> — For sharing datasets and models online</li>
<li><strong>RunPod</strong> — For training on cloud GPUs</li>
<li><strong>Proxy</strong> — For routing traffic through Tor or VPN (optional)</li>
<li><strong>Ollama</strong> — For auto-generating descriptions (optional)</li>
</ul>
<h3>Quick Setup</h3>
<ol>
<li>Get your token/API key from the respective service</li>
<li>Paste it in the appropriate field</li>
<li>Click <strong>Test</strong> to verify it works</li>
<li>Click <strong>Save</strong></li>
</ol>
<h3>System Check</h3>
<p>At the bottom, there's a <strong>System Check</strong> panel that runs tests to make sure everything is working. Use it after installation or if something seems broken.</p>
<div class="alert alert-info">
<i class="bi bi-info-circle-fill me-2"></i>
All settings are stored locally in <code>finefoundry.db</code> and never sent to external servers.
</div>
</div>
<!-- CLI Usage -->
<div id="cli-usage" class="doc-section">
<h2><i class="bi bi-terminal text-primary me-2"></i>CLI Tools</h2>
<p>Everything in the GUI also works from the command line. Use CLI tools when you want to automate workflows, run scheduled jobs, or integrate with CI pipelines.</p>
<h3>Building Datasets</h3>
<p>The <code>src/save_dataset.py</code> script turns JSON pairs into a proper Hugging Face dataset:</p>
<pre class="bg-dark text-light p-3 rounded"><code># Configure constants in the file header, then run:
uv run src/save_dataset.py</code></pre>
<p>Configuration options in the file:</p>
<pre class="bg-dark text-light p-3 rounded"><code>DATA_FILE = "scraped_training_data.json"
SAVE_DIR = "hf_dataset"
SEED = 42
SHUFFLE = True
VAL_SIZE = 0.01
TEST_SIZE = 0.0
MIN_LEN = 1
PUSH_TO_HUB = True
REPO_ID = "username/my-dataset"
PRIVATE = True
HF_TOKEN = None # uses env HF_TOKEN if None</code></pre>
<h3>Reddit Scraper CLI</h3>
<pre class="bg-dark text-light p-3 rounded"><code>uv run src/scrapers/reddit_scraper.py \
--url https://www.reddit.com/r/AskReddit/ \
--max-posts 50 \
--mode contextual \
--k 4 \
--max-input-chars 2000 \
--pairs-path reddit_pairs.json \
--cleanup</code></pre>
<h4>Important Options</h4>
<ul>
<li><code>--url</code> – Subreddit or post URL to crawl</li>
<li><code>--max-posts</code> – Maximum posts to process</li>
<li><code>--mode</code> – <code>parent_child</code> or <code>contextual</code></li>
<li><code>--k</code> – Context depth for contextual mode</li>
<li><code>--pairs-path</code> – Output path for pairs JSON</li>
<li><code>--cleanup</code> – Delete dump folder after copying pairs</li>
</ul>
<h3>CLI vs GUI</h3>
<p>Use the GUI when you want to explore interactively and see visual feedback. Use CLI for automation—cron jobs, CI pipelines, batch processing, or reproducing exact configurations across machines.</p>
</div>
<!-- Python API -->
<div id="python-api" class="doc-section">
<h2><i class="bi bi-code-slash text-primary me-2"></i>Python API</h2>
<p>Import FineFoundry's modules directly into your own Python scripts for custom workflows.</p>
<h3>4chan Scraper</h3>
<pre class="bg-dark text-light p-3 rounded"><code>import sys
sys.path.append("src")
from scrapers.fourchan_scraper import scrape
pairs = scrape(
board="pol",
max_threads=150,
max_pairs=5000,
mode="contextual",
strategy="cumulative"
)
# pairs is a list of {"input": ..., "output": ...} dicts</code></pre>
<h3>Dataset Builder</h3>
<pre class="bg-dark text-light p-3 rounded"><code>import sys
sys.path.append("src")
from db.scraped_data import get_pairs_for_session
from save_dataset import build_dataset_dict, normalize_records
# Load pairs from a database scrape session
pairs = get_pairs_for_session(session_id=1)
examples = normalize_records(pairs, min_len=1)
# Build a DatasetDict with train/validation/test splits
dd = build_dataset_dict(examples, val_size=0.05, test_size=0.0)</code></pre>
<h3>Local Inference</h3>
<pre class="bg-dark text-light p-3 rounded"><code>import sys
sys.path.append("src")
from helpers.local_inference import generate_text
response = generate_text(
base_model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
adapter_path="/path/to/adapter",
prompt="What is machine learning?",
temperature=0.7,
max_new_tokens=256,
)</code></pre>
</div>
<!-- Local Training -->
<div id="docker-deployment" class="doc-section">
<h2><i class="bi bi-pc-display text-primary me-2"></i>Local Training</h2>
<p>Train AI models directly on your own computer—no cloud services, no extra costs, complete privacy.</p>
<h3>What You Need</h3>
<ul>
<li><strong>NVIDIA GPU</strong> with at least 8GB VRAM (12GB+ recommended)</li>
<li><strong>CUDA drivers</strong> installed on your system</li>
<li><strong>Python packages</strong> are installed automatically by FineFoundry</li>
</ul>
<h3>How It Works</h3>
<p>When you start local training, FineFoundry runs the Unsloth trainer as a native Python process on your machine. No Docker or containers needed—everything runs directly on your GPU.</p>
<h3>Getting Started</h3>
<ol>
<li>Go to the <strong>Training</strong> tab</li>
<li>Select <strong>Local</strong> as your training target</li>
<li>Choose a preset (try "Quick local test" first)</li>
<li>Click <strong>Start Training</strong></li>
</ol>
<h3>Tips for Success</h3>
<ul>
<li><strong>Start small</strong> — Use "Quick local test" preset first to verify everything works</li>
<li><strong>Watch memory</strong> — If you get "out of memory" errors, reduce batch size</li>
<li><strong>Close other apps</strong> — Games and browsers use GPU memory too</li>
<li><strong>Check progress</strong> — Training logs show real-time updates</li>
</ul>
<div class="tip-box">
<strong>💡 No GPU?</strong> You can still use FineFoundry for data collection and analysis. For training, use RunPod cloud GPUs instead.
</div>
</div>
<!-- RunPod Setup -->
<div id="runpod-setup" class="doc-section">
<h2><i class="bi bi-cloud text-primary me-2"></i>RunPod Setup</h2>
<p>Run training jobs on remote GPUs using RunPod.</p>
<h3>How It Works</h3>
<p>When you select <strong>RunPod – Pod</strong> as the training target:</p>
<ol>
<li>FineFoundry connects using your RunPod API key</li>
<li>Ensures a Network Volume exists (mounted at <code>/data</code>)</li>
<li>Ensures a Pod Template exists for your hardware</li>
<li>Launches pods to run training jobs</li>
<li>Writes outputs to <code>/data/outputs/...</code> on the network volume</li>
</ol>
<h3>Prerequisites</h3>
<ul>
<li>RunPod account with billing/credits</li>
<li>RunPod API key (configure in Settings tab)</li>
<li>Available GPU type in your desired region</li>
</ul>
<h3>Step 1: Configure API Key</h3>
<ol>
<li>Open the <strong>Settings</strong> tab</li>
<li>Paste your API key in <strong>RunPod Settings</strong></li>
<li>Click <strong>Test</strong> to verify, then <strong>Save</strong></li>
</ol>
<h3>Step 2: Create Network Volume</h3>
<p>In the RunPod console:</p>
<ol>
<li>Create a Network Volume (size depends on your needs)</li>
<li>Note the volume identifier</li>
<li>In FineFoundry, use <strong>Ensure Infrastructure</strong> to verify</li>
</ol>
<h3>Step 3: Create Pod Template</h3>
<p>Create a template that:</p>
<ul>
<li>Uses <code>docker.io/sbussiso/unsloth-trainer:latest</code></li>
<li>Mounts the Network Volume at <code>/data</code></li>
<li>Has your desired GPU/CPU/RAM resources</li>
</ul>
<h3>Step 4: Launch Training</h3>
<ol>
<li>Set Training target to <strong>RunPod – Pod</strong></li>
<li>Configure dataset and hyperparameters</li>
<li>Set Output dir under <code>/data/outputs/...</code></li>
<li>Start the training job</li>
</ol>
</div>
<!-- Troubleshooting -->
<div id="troubleshooting" class="doc-section">
<h2><i class="bi bi-question-circle text-primary me-2"></i>Troubleshooting</h2>
<p>Something not working? Find your problem below and follow the fix.</p>
<h3>App Won't Start</h3>
<ul>
<li><strong>"Python not found"</strong> — Make sure Python 3.10+ is installed</li>
<li><strong>"Module not found"</strong> — Run <code>pip install -e . --upgrade</code></li>
<li><strong>Dependencies broken</strong> — Delete <code>.venv</code> folder and run <code>uv run src/main.py</code> again</li>
</ul>
<h3>Training Problems</h3>
<h4>"Out of memory" or "CUDA OOM"</h4>
<p>Your graphics card ran out of space. Try these:</p>
<ol>
<li>Use <strong>"Quick local test"</strong> preset (uses less memory)</li>
<li>Reduce <strong>batch size</strong> to 1 or 2</li>
<li>Close other programs using your GPU</li>
<li>Try <strong>RunPod</strong> cloud GPUs instead</li>
</ol>
<h4>Training seems stuck</h4>
<ul>
<li>Check the logs — it might just be slow</li>
<li>Large models can take hours</li>
<li>Look for red error messages</li>
</ul>
<h3>Sharing Problems</h3>
<ul>
<li><strong>"401" or "403" error</strong> — Your token is wrong or expired. Re-enter it in Settings.</li>
<li><strong>Make sure token has "write" permission</strong> — Read-only tokens can't upload</li>
</ul>
<h3>Testing Problems (Inference)</h3>
<ul>
<li><strong>"Validating" never finishes</strong> — Training might not have completed successfully</li>
<li><strong>First response slow</strong> — Normal! The model needs to load first (30-60 seconds)</li>
<li><strong>Responses don't make sense</strong> — Try training for more steps or check data quality</li>
</ul>
<h3>Quick Fixes to Try First</h3>
<ol>
<li><strong>Restart the app</strong> — Fixes many temporary issues</li>
<li><strong>Check the logs</strong> — Look for red error messages</li>
<li><strong>Try smaller settings</strong> — Fewer threads, smaller batch size</li>