Crawl

A module to help you crawl policing websites and ingest them for later use with your LLM

This notebook is intended to follow on from police_risk_open_ai.core.llm, and mostly replicates the opening stages of the OpenAI embedding tutorial, though we make some code changes as we go.

We start by replicating their code that helps scrape pages using beautifulsoup.

source

crawl

 crawl (url)

source

get_domain_hyperlinks

 get_domain_hyperlinks (local_domain, url)

source

get_hyperlinks

 get_hyperlinks (url)

source

HyperlinkParser

 HyperlinkParser ()

Find tags and other markup and call handler functions.

Usage: p = HTMLParser() p.feed(data) … p.close()

Start tags are handled by calling self.handle_starttag() or self.handle_startendtag(); end tags by self.handle_endtag(). The data between tags is passed from the parser to the derived class by calling self.handle_data() with the data as argument (the data may be split up in arbitrary chunks). If convert_charrefs is True the character references are converted automatically to the corresponding Unicode character (and self.handle_data() is no longer split in chunks), otherwise they are passed by calling self.handle_entityref() or self.handle_charref() with the string containing respectively the named or numeric reference as the argument.

Using the above function “out of the box” on the College of Policing APP website though doesn’t work as intended.

domain = "college.police.uk/app" # <- put your domain to be crawled
full_url = "https://www.college.police.uk/app" # <- put your domain to be crawled with https or http


crawl(full_url)

https://www.college.police.uk/app
HTTP Error 403: Forbidden

This seems to be a cloudflare response to prevent crawlers, but we can get it around it by modifying out user heading. Rather than requesting the URL directly, we’ll inject in a header that looks like the Firefox browser.

That said, when you do scrape websites, make sure you do it ethically: consider how much you’re pulling in, what the audience is, and whether you might be impacting service for other users. Given the APP is public and in high use, I feel that’s okay here.


    request = urllib.request.Request(url, headers={'User-Agent':'Mozilla/5.0'})
    
    # Try to open the URL and read the HTML
    try:
        # Open the URL and read the HTML
        with urllib.request.urlopen(request) as response:

I also add in a max length of the URL, because the College has some real weird stuff that is breaking my code.


        # Save text from the url to a <url>.txt file
        if len(url) < 500:
            with open('text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:

That code does succesfully manage to run through the entirety of the APP domain, so we bundle it up below.

source

crawl

 crawl (url)

source

get_domain_hyperlinks

 get_domain_hyperlinks (local_domain, url)

source

get_hyperlinks

 get_hyperlinks (url)

source

HyperlinkParser

 HyperlinkParser ()

Find tags and other markup and call handler functions.

Usage: p = HTMLParser() p.feed(data) … p.close()

Let’s test it out on my website. As you can see, loops through each hyperlink, and crawls it in turn, saving each as a local file.

# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^http[s]*://.+'

domain = "andreasthinks.me" # <- put your domain to be crawled
full_url = "https://andreasthinks.me/" # <- put your domain to be crawled with https or http


crawl(full_url)

https://andreasthinks.me/
https://andreasthinks.me/./recent_work.html
https://andreasthinks.me/./index.xml

/home/andreasthinksmint/python_env/cop-bot/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

https://andreasthinks.me/./posts/lockdown_effect/index.html
https://andreasthinks.me/../../index.xml
HTTP Error 400: Bad Request
https://andreasthinks.me/../../recent_work.html
HTTP Error 400: Bad Request
https://andreasthinks.me/../../about.html
HTTP Error 400: Bad Request
https://andreasthinks.me/../../index.html
HTTP Error 400: Bad Request
https://andreasthinks.me/./posts/migrated_to_quarto/index.html
https://andreasthinks.me/./posts/burglary_attendance/index.html

KeyboardInterrupt:

With our API now scraped, we can move on to cleaning irrelevant data, and outputing something we can work with.

source

clean_scrapped_data

 clean_scrapped_data (scrape_directory,
                      output_file='processed/scraped.csv')

takes a folder containing all the file from your scrapped data, cleans it all, saves as a CSV and returns the dataframe

if given None as output_file, dataframe returned but not saved

source

remove_newlines

 remove_newlines (serie)

cleaned_df = clean_scrapped_data("text/www.college.police.uk")
cleaned_df

/tmp/ipykernel_197341/3678221184.py:5: FutureWarning: The default value of regex will change from True to False in a future version.
  serie = serie.str.replace('\\n', ' ')

	fname	text
0	.police.uk app	.police.uk app. APP (authorised prof...
1	.police.uk	.police.uk. Working together \| Coll...
2	.police.uk about	.police.uk about. About us \| College...
3	.police.uk about concordats	.police.uk about concordats. Concord...
4	.police.uk about publication scheme	.police.uk about publication scheme. ...
...	...	...
4441	.police.uk cdn cgi l email protection#cb8fedaa...	.police.uk cdn cgi l email protection#cb8fedaa...
4442	.police.uk cdn cgi l email protection#d3b7f5b2...	.police.uk cdn cgi l email protection#d3b7f5b2...
4443	.police.uk cdn cgi l email protection#206f6446...	.police.uk cdn cgi l email protection#206f6446...
4444	.police.uk cdn cgi l email protection#97f4f8f9...	.police.uk cdn cgi l email protection#97f4f8f9...
4445	.police.uk cdn cgi l email protection#02706771...	.police.uk cdn cgi l email protection#02706771...

4446 rows × 2 columns

So there you have it! An entire website, scraped and cleaned, ready to be ingested into our AI model.

Before we can begin analysis, we need to split our text into tokens, recognisable chunks of text-data our model will recognise.

df = pd.read_csv("processed/scraped.csv",index_col=0)
df.columns = ['title', 'text']

df

Before we can begin analysis, we need to split our text into tokens, recognisable chunks of text-data our model will recognise.

# Tokenize the text and save the number of tokens to a new column
df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))

# Visualize the distribution of the number of tokens per row using a histogram
df.n_tokens.hist()

We can’t handle that many tokens, so we build a function to break them into chunks.

source

split_into_many

 split_into_many (text, max_tokens=500)

source

produce_df_embeddings

 produce_df_embeddings (df, chunk_size=100)

produces embeddings from the open AI api in chunks

Bringing it Together

Let’s now make a master function that scrapes a website, cleans it, tokenises it, converts to embeddings, and saves.

def crawl(url, export_dir='scrape_export'):
    # Parse the URL and get the domain

    if not os.path.exists(export_dir + "/"):
            os.mkdir(export_dir + "/")
    
    export_directory_loc = export_dir + "/"
    
    local_domain = urlparse(url).netloc

    # Create a queue to store the URLs to crawl
    queue = deque([url])

    # Create a set to store the URLs that have already been seen (no duplicates)
    seen = set([url])

    # Create a directory to store the text files
    if not os.path.exists(export_directory_loc + "text/"):
            os.mkdir(export_directory_loc + "text/")

    if not os.path.exists(export_directory_loc + "text/"+local_domain+"/"):
            os.mkdir(export_directory_loc + "text/" + local_domain + "/")

    # Create a directory to store the csv files
    if not os.path.exists(export_directory_loc + "processed"):
            os.mkdir(export_directory_loc + "processed")

    # While the queue is not empty, continue crawling
    while queue:

        # Get the next URL from the queue
        url = queue.pop()
        print(url) # for debugging and to see the progress

        # Save text from the url to a <url>.txt file
        with open(export_directory_loc + 'text/'+local_domain+'/'+url[8:].replace("/", "_") + ".txt", "w", encoding="UTF-8") as f:

            # Get the text from the URL using BeautifulSoup
            soup = BeautifulSoup(requests.get(url).text, "html.parser")

            # Get the text but remove the tags
            text = soup.get_text()

            # If the crawler gets to a page that requires JavaScript, it will stop the crawl
            if ("You need to enable JavaScript to run this app." in text):
                print("Unable to parse page " + url + " due to JavaScript being required")
            
            # Otherwise, write the text to the file in the text directory
            f.write(text)

        # Get the hyperlinks from the URL and add them to the queue
        for link in get_domain_hyperlinks(local_domain, url):
            if link not in seen:
                queue.append(link)
                seen.add(link)

domain = "andreasthinks.me" # <- put your domain to be crawled
full_url = "https://andreasthinks.me/" # <- put your domain to be crawled with https or http


crawl(full_url)

https://andreasthinks.me/
https://andreasthinks.me/./recent_work.html
https://andreasthinks.me/./index.xml

/home/andreasthinksmint/python_env/cop-bot/lib/python3.10/site-packages/bs4/builder/__init__.py:545: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  warnings.warn(

https://andreasthinks.me/./posts/lockdown_effect/index.html
https://andreasthinks.me/../../index.xml

KeyboardInterrupt:

source

scrape_website

 scrape_website (url, export_dir='scrape_export')

Takes a url, scrapes and saves to the folder