Cheetah Web-Crawler

Last Update: 3/22/ 2020

Cheetah, a valuable website of tax documentation, can be better utilized with Python. This article will help you understand how Cheetah works and how to use it by Python.

Understand how Cheetah works

To begin with, we need to understand how Cheetah works. By clicking tax-state &local, one can easily identify the tax documentation for different states in different years by selecting them from the drop-down menu.

Picture 1. webpage of selecting the category

Of course, this interactive function is supported by javascript, which makes the website not that stable for web-crawler. And we will go back to this later. After we select the exact file we want, we will enter a page like this.

Picture 2. sample page of one file

An interesting fact about Cheetah is that if your file is too large, it will send it to your mailbox rather than download it directly. On the contrary, if your file is relatively small, you can download it instantly by clicking the download button. To help the reader better understand the email part, we post a screenshot below.

Picture 3. pop-out window of filing email

In addition, there are also some other things you need to take into consideration when designing the program.

  • First, if you are looking for a specific type of tax code, a particular state might not have that tax in a particular year, and you don’t know it.
  • Second, the website might stop working for a little well if you continuously requesting the download.
  • Third, sometimes the webpage is slow, so the element didn’t pop out timely.
  • Fourth, you might need to go to your email box to download part of the documents.
  • Fifth, the downloaded data is not auto-named by year or state.

How to use it in Python

Unlike SEC Edgar or Google Trend Data, there are barely any available open-source python libraries for Cheetah. So we can only make this happen by developing our own tool.

Cheetah has its own download tool, which is a paid function, and we do not express our opinion on it as we haven’t used it before.

Set-up Selenium

As we have mentioned above, Cheetah is written in javascript, so we might need the help of selenium. A commonly used library for the javascript web-crawler. To use selenium another tool need to be used: Chormedriver.

The path for the chrome driver should be set. If you are using Mac, you can know the path of chrome driver by dragging and dropping anything into the OS X Terminal. If you are a first-time user, testing code for set-up selenium is:

Picture 4. Test Code for selenium

This code will automatically open a Google Page and search for ChromeDriver. If you made it, the basic setup for ChromeDriver is finished.

Find the pattern and use Selenium

A critical part of the work is to read the raw code of the webpage and find the pattern. When dealing with a javascript based website, we would like to focus on the code of tiny objects -such as a button or a line in a drop-down menu- rather than a bigger object.

Once you identified the pattern, you can locate the XPath in Chrome by examing the element and clicking copy XPath. We now provide a short example of the XPath and how we use it in python.

Picture 5. Xpath utilize sample (page 5 in developer-dairy)

For more details, please download the document below.

[mycred_sell_this]

[/mycred_sell_this]

After you feel that you are crystal clear about the XPath thing, you can begin to work on designing loops for selecting. There are basically three layers of loops: state, year and type of tax code. All patterns can be found in the above document. Also in part B of the document, it also helps you understand how to create a program to go to your email and get your file downloaded automatically, read the RTF file and rename it.

A sample code is also provided for this program.

[mycred_sell_this]https://github.com/kaizhen-li/Research/blob/master/auto-tax.py[/mycred_sell_this]

Currently, we haven’t updated the code since 2019 February and we can’t ensure that it does not have any compatibility issue with the current Cheetah website. Feel free to reach out to kaizhenl@outlook.com about the update.

Was this article helpful?

Related Articles

Leave A Comment?

You must be logged in to post a comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.