-
Problem Type : Supervised Text Classification Problem.
-
Similar Problem Type with smaller dataset: The 20 Newsgroup Dataset. This comes as part of scikit-learn package/library.
-
Install Github's git-lfs for managing the versioning of large files. The train.zip file is of size 1.3 GB.
-
The repository limit for files in LFS, is strictly 1GB, either in Github or Bitbucket. So this option of using git-lfs doesn't come to rescue in this use-case. Thus, copied the train.zip to OneDrive for future reference.
-
train.csvhas columnsWebpage_id, Domain, Url, Taghtml_data.csvhas columnsWebpage_id, Html-
Should we merge these files into one?? The
html_data.csvfile size is ~7 GB. -
For each id in html_data.csv, we should read the Html feature, classify it to one of the labels 9 predefined labels and add it to a new column titled
Tag -
But wait, even before that, we need to split the train-dataset in train/test datasets per the problem statement.
-
Loading large CSV file (in the order of GBs) into memory is undoable business. So we load them in chunks in pandas and write it to RDBMS, for doing data munging later on it. Refer Working with large CSV files in Python for more. Alternatively, you can also refer to Dask – A better way to work with large CSV files in Python.
import pandas as pd from sqlalchemy import create_engine # To understand the structure of the csv file and make sure the data is formatted in a way that makes sense for your work. file = '/path/to/csv/file' print pd.read_csv(file, nrows=5) # to read in only 5 rows # Before we can actually work with the data, we need to do something with it so we can begin to filter it to work with subsets of the data. # But with large data files, we need to store the data somewhere else. In this case, we’ll set up a local sqllite database, read the csv file in chunks and then write those chunks to sqllite. csv_database = create_engine('sqlite:///csv_database.db') # Iterate through the CSV file in chunks and store the data into sqllite chunksize = 100000 i = 0 j = 1 for df in pd.read_csv(file, chunksize=chunksize, iterator=True): df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) df.index += j i+=1 df.to_sql('table', csv_database, if_exists='append') j = df.index[-1] + 1 # To access the data now, you can run commands like the following: # df = pd.read_sql_query('SELECT * FROM table', csv_database) # Of course, using ‘select *…’ will load all data into memory, which is the problem we are trying to get away from so you should throw from filters into your select statements to filter the data. For example: df = pd.read_sql_query('SELECT COl1, COL2 FROM table where COL1 = SOMEVALUE', csv_database)
-
-
Bad assumptions and poor skills in Python led me to assume that the the Webpage_Id(s) from the test dataset doesn't exist in the html dataset. And because of this I ended up attempting to web-scrape all the URLs from test-dataset. This is what happens when you work too hard and don't take a step back, every now and then. Shame on me!
-
My attempt to crawling Internet to read and persist the HTML page:
# OBJECTIVE: Crawl Internet to get web-page for persistence import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.util.retry import Retry from requests.packages.urllib3.util import make_headers import sys, traceback import pycurl from io import BytesIO import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) #When session.verify = False def curl_page(url): retry = Retry(connect=3, backoff_factor=0.5, total=10, redirect=5) adapter = HTTPAdapter(max_retries=retry) session = requests.Session() session.mount('http://', adapter) session.mount('https://', adapter) session.verify = False # No more "SSL Error because of Bad Handshake on account of failure in certificate verification" try: page = session.get(url) return page.content.strip() except Exception as err: print('Error accessing the URL:',url) # traceback.print_exc() return curl_page2(url) # This method is called by curl_page() when exception raises def curl_page2(url): try: buffer = BytesIO() c = pycurl.Curl() c.setopt(c.URL, url) c.setopt(c.WRITEDATA, buffer) c.perform() c.close() body = buffer.getvalue() buffer.close() return body.strip() except Exception as ex: print('Error again, accessing the URL:',url) print(str(ex)) return None # url = 'https://www.zacks.com/stock/news/253595/walmart-launches-new-venture-to-boost-ecommerce-business?cid=CS-NASDAQ-FT-253595' # curl_page2(url) # For every row in the dataframe, pick the URL, crawlthe web and persist it in DB for idx,row in testdf.iterrows(): if(idx < 2590): continue # Last HTTTP Maxout error at 928 # if(idx >1000): break #For Testing purpose conn = db_engine.connect() # Auto-commit = True, implicitly print('Crawling web-page for id:',idx) page = curl_page(row.Url) try: if(page is not None): conn.execute("UPDATE testdf SET Html=? WHERE Webpage_id=?", page,idx) except Exception as err: print('Error updating record with id=',idx, "\nMore details:\t",err) finally: conn.close() print('Done!')
- Issues with crawling URLs to get the web-page using the above method.
- Not all are well-formed.
- There exists invalid URLs.
- The host or web-site could be down/moved, making the URL void. Example URL.
- The connection could be https. Now that means verification of certification validity is required ideally. In my case this was causing issues with some URLs, so had to configure to skip this certification verification and suppress API warnings asking me to do verify certificate.
- Connection time-outs
- Issues with crawling URLs to get the web-page using the above method.
-
When working in a laptop, there are constraints of Memory (you can't load ~7GB of CSV data as pandas DataFrame - Jupyter Notebook crashes!). Cloud is the answer.
-
Extracting domains from the URL is such hard stuff. This can however be made simple with the use of external library for this purpose like below:
import tldextract def extract_domain(url): return tldextract.extract(url).domain df.Domain = df.Domain.apply(extract_domain) df.Domain.value_counts().sort_index() # show in alphabetical order
-
Extracting title from web-page like below:
def extract_title(page): if (page == None): return None soup = BeautifulSoup(page, 'html.parser') title_tag = soup.find('title') if (title_tag == None): title = None else: title = title_tag.text.strip() return title #Test method definition print(extract_title("<html></html>")) print(extract_title(webpage))
-
Extracting all text within text-container HTML tags like below:
import re from bs4 import BeautifulSoup from bs4.element import Comment def is_visible_content(element): if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']: return False if isinstance(element, Comment): return False return True def remove_extra_spaces(str): return u" ".join(str.split()) def extract_text(page): if (page == None): return None soup = BeautifulSoup(page, 'html.parser') #, from_encoding="utf-8" texts = soup.findAll(text=True) # Extracts text from all HTML Markups, incl nested ones visible_texts = filter(is_visible_content, texts) # The u-prefix u" ".join() indicates Unicode and has been in python since v2.0 # Ref. Read: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/ text = u" ".join(remove_extra_spaces(t.strip()) for t in visible_texts) text = text.replace(',','') text = text.replace('|','') text = re.sub(r'\s\s+',' ',text).strip() return text.encode('utf-8',errors='ignore').decode('utf-8').strip() text = extract_text(hdf.head(1)['Html'].values[0]) text
-
Merge columns of data-frames like below:
# OBJECTIVE : Merge the Title column of pdf dataframe into the df dataframe # A little code play for trying it out below: raw_data = { 'subject_id': [11, 12, 13, 14, 15], 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches'] } df_a = pd.DataFrame(raw_data, columns = ['subject_id','first_name', 'last_name']) df_a.set_index('subject_id', inplace=True) df_a raw_data = { 'subject_id': [11, 12, 13, 14, 15, 17, 18, 19, 20, 21], 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16] } df_b = pd.DataFrame(raw_data, columns = ['subject_id','test_id']) df_b = df_b.set_index('subject_id') df_b pd.merge(df_a,df_b,on='subject_id') # df_a.merge(df_b, on='subject_id') # df_a.merge(df_b,how='inner', left_index=True, right_index=True) # df_a.merge(df_b,how='left', left_index=True, right_index=True) # df_a.merge(df_b,how='right', left_index=True, right_index=True) # df_a.merge(df_b,how='outer', left_index=True, right_index=True) # pd.concat([df_a,df_b], axis=1)
-
- When non-english characters are extracted from HTML web-pages and later attempting to persist it to DB, the driver throws an error saying like:
- The solution to this is to use the
uas prefix to string operation likeu''.join(..)so that the characters are read as Unicode and is persisted the same way.
-
Good reads Atlassian Docs - Tutorials Git LFS and Git LFS with Bitbucket.
-
Features
- Versioning : Version large files—even those as large as a couple GB in size—with Git.
- More repo space: Host more in your Git repositories. External file storage makes it easy to keep your repository at a manageable size.
- Faster cloning and fetching
- Same git flow
- Same access controls and permissions
-
Getting started
-
To create a new Git LFS aware repository, you'll need to run
git lfs installafter you create the repository. You only have to setup Git LFS only once. This installs a specialpre-pushGit hook in your repository that will transfer Git LFS files to the server when yougit push. Note: Git LFS is automatically enabled for all Bitbucket Cloudrepositories. -
Select the file types you'd like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime with
git lfs track "*.zip".Note: Make sure .gitattributes is tracked with
git add .gitattributes. -
Just commit and push to GitHub as you normally would. Yay!
-