davidbp
diff --git a/‎python_libraries/PyMuPDF/kmeans_datastreams.pdf‎
607 KB b/‎python_libraries/PyMuPDF/kmeans_datastreams.pdf‎
607 KB
diff --git a/‎python_libraries/PyMuPDF/read_pdf_text.ipynb‎
Lines changed: 163 additions & 0 deletions b/‎python_libraries/PyMuPDF/read_pdf_text.ipynb‎
Lines changed: 163 additions & 0 deletions
@@ -0,0 +1,163 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "toc": true
+   },
+   "source": [
+    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
+    "<div class=\"toc\"><ul class=\"toc-item\"></ul></div>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 85,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2022-04-21T16:07:40.927727Z",
+     "start_time": "2022-04-21T16:07:40.923706Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "#pip install pymupdf\n",
+    "from docarray import Document, DocumentArray\n",
+    "import fitz\n",
+    "\n",
+    "pdf = fitz.open('kmeans_datastreams.pdf')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 93,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2022-04-21T16:08:13.443125Z",
+     "start_time": "2022-04-21T16:08:13.414026Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "'K-means for Evolving Data Streams\\n1st Arkaitz Bidaurrazaga\\nBasque Center for Applied Mathematics\\nBilbao, Spain\\nabidaurrazaga@bcamath.org\\n2nd Aritz P´erez\\nBasque Center for Applied Mathematics\\nBilbao, Spain\\naperez@bcamath.org\\n3rd Marco Cap´o\\nBasque Center for Applied Mathematics\\nBilbao, Spain\\nmcapo@bcamath.org\\nAbstract—Nowadays, streaming data analysis has become a\\nrelevant area of research in machine learning. Most of the data\\nstreams available are unlabeled, and thus it is necessary to\\ndevelop speciﬁc clustering techniques that take into account the\\nparticularities of the streaming data. In streaming data scenarios,\\nthe data is composed of an increasing sequence of batches of\\nsamples where the concept drift phenomenon may occur. In\\nthis work, we formally deﬁne the streaming K-means (SKM)\\nproblem, which implies a restart of the error function when a\\nconcept drift occurs. An approximated error function that does\\nnot rely on concept drift detection is proposed. We prove that\\nsuch a surrogate is a good approximation of the SKM error.\\nThen, we introduce an algorithm to deal with SKM problem\\nby minimizing the surrogate error function each time a new\\nbatch arrives. Alternative initialization criteria are presented and\\ntheoretically analyzed for streaming data scenarios. Among them,\\nwe develop and analyze theoretically two initialization methods\\nthat search for the best trade-off between the importance that\\nis given to the past and the current batches. The experiments\\nshow that the proposed algorithm with, the proposed initialization\\ncriteria, obtain the best results when dealing with the SKM\\nproblem without requiring to detect when concept drift takes\\nplace.\\nI. INTRODUCTION\\nOne of the most relevant unsupervised data analysis prob-\\nlems is clustering [1], which consists of partitioning the data\\ninto a number of disjoint subsets called clusters. Among a wide\\nvariety of clustering methods, K-means algorithm is one of the\\nmost popular [2]. In fact, it has been identiﬁed as one of the\\ntop-10 most important algorithms in data mining [3]. Before\\nexplaining the K-means algorithm, we shall brieﬂy introduce\\nthe K-means problem.\\nA. K-means Problem\\nGiven a data set of d-dimensional points of size n, X =\\n{xi}n\\ni=1 ⊂ Rd, the K-means problem is deﬁned as ﬁnding a\\nset of K centroids C = {ck}K\\nk=1 ⊂ Rd, which minimizes the\\nK-means error function:\\nE(X, C) =\\n1\\n|X| ·\\n�\\nx∈X\\n∥x − cx∥2 ; cx = arg min\\nc∈C\\n∥x − c∥, (1)\\nwhere ∥ · ∥ denotes the Euclidean distance or L2 norm.\\nOne of the open questions in K-means is how to determine a\\nproper value K for each dataset. In the literature, one can ﬁnd\\ndifferent techniques to approach this issue. In particular, [4]\\npropose a novel technique that determines a number of clusters\\nby performing recursive bisections of a given set of clusters as\\nlong as the obtained Bayesian Information Criterion (BIC) is\\nimproved. On the other hand, [5] propose a method to learn K\\nbased on a statistical test for determining whether instances are\\na random sample from a Gaussian distribution with arbitrary\\ndimension and covariance matrix. Other approaches include\\nthe use of cluster validity measures, such as the Silhouette\\nindex, Davies-Bouldin and Calinski-Harabasz measures to\\nautomate the selection of this parameter [6], [7]. In any case, in\\nthis work, as it is common in most of the K-means problem-\\nrelated literature [8]–[11], the number of clusters is assumed\\nto be predetermined.\\na) K-means Algorithm: The K-means problem is known\\nto be NP-hard for K > 1 and d > 1 [12]. The most popular\\nheuristic approach to this problem is Lloyd’s algorithm [13].\\nGiven a set of initial centroids, Lloyd’s algorithm iterates two\\nsteps until convergence: 1) assignation step and 2) update step.\\nIn the assignation step, given a set of centroids, C = {ck}K\\nk=1,\\nthe set of points is partitioned into K clusters, P = {Pk}K\\nk=1,\\nby assigning each point to the closest centroid. Next, the new\\nset of centroids is obtained by computing the center of mass\\nof the points in each partition. This set of centroids minimizes\\nthe K-means error with respect to the given partition of the set\\nof points. These two steps are repeated until a ﬁxed point is\\nreached, meaning, when the assignation step does not change\\nthe partition. This process has a O(n · K · d) time complexity.\\nThe combination of an initialization method plus Lloyd’s\\nalgorithm is called the K-means algorithm. Many alternative\\ninitialization methods exist and, in the upcoming section, we\\nelaborate on them.\\nb) K-means Initialization:\\nThe solution obtained by\\nK-means algorithm strongly depends on the initial set of\\ncentroids [14]–[16]. Consequently, in the literature different\\ninitializations have been proposed. One of the most simple,\\nyet effective, methods is Forgy’s approach [17]. Forgy’s ini-\\ntialization consists of choosing K data points at random as\\ninitial centroids. The main drawback of this approach is that\\nit tends to choose data points located at dense regions of\\nthe space, thus these regions tend to be over-represented.\\nIn order to address this problem, probabilistic methods with\\nstrong theoretical guarantees have been proposed. K-means++\\n(KM++) [10] initialization iteratively selects points from X at\\nrandom, where the probability of selection is proportional to\\nthe distance of the closest centroid previously selected. This\\nstrategy has become one of the most popular due its strong\\ntheoretical guarantees. Among other popular alternatives, we\\nhave variations of KM++, such as the K-means—— [11] and\\narXiv:2012.03628v2  [stat.ML]  17 Sep 2021\\n'"
+      ]
+     },
+     "execution_count": 93,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "pages = [page.get_text() for page in pdf]\n",
+    "pages[0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once can use this to build a DocumentArray containing the text for each line"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 84,
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2022-04-21T16:07:22.885440Z",
+     "start_time": "2022-04-21T16:07:22.855857Z"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('Algorithm 2 Forgetful SKM algorithm (FSKM)',\n",
+       " '1: Predetermined: Maximum number of batches saved')"
+      ]
+     },
+     "execution_count": 84,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "da = DocumentArray([Document(text=page.get_text()) for page in pdf])\n",
+    "data = [content for content in page.split('\\n')]\n",
+    "data[0], data[1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.5"
+  },
+  "toc": {
+   "base_numbering": 1,
+   "nav_menu": {},
+   "number_sections": true,
+   "sideBar": true,
+   "skip_h1_title": false,
+   "title_cell": "Table of Contents",
+   "title_sidebar": "Contents",
+   "toc_cell": true,
+   "toc_position": {},
+   "toc_section_display": true,
+   "toc_window_display": true
+  },
+  "varInspector": {
+   "cols": {
+    "lenName": 16,
+    "lenType": 16,
+    "lenVar": 40
+   },
+   "kernels_config": {
+    "python": {
+     "delete_cmd_postfix": "",
+     "delete_cmd_prefix": "del ",
+     "library": "var_list.py",
+     "varRefreshCmd": "print(var_dic_list())"
+    },
+    "r": {
+     "delete_cmd_postfix": ") ",
+     "delete_cmd_prefix": "rm(",
+     "library": "var_list.r",
+     "varRefreshCmd": "cat(var_dic_list()) "
+    }
+   },
+   "types_to_exclude": [
+    "module",
+    "function",
+    "builtin_function_or_method",
+    "instance",
+    "_Feature"
+   ],
+   "window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}