Showing posts with label phantomjs. Show all posts
Showing posts with label phantomjs. Show all posts

Sunday, September 14, 2014

Pesky Perpetual Problems in Parsing PACER Case Information

In the last year, I worked with a legacy application that automated PACER interactions for users and wrapped it in a nicer user interface. This post is about a few of the the data issues, both in data model and in scraping, that I noted as I worked through dozens of issues -- both major and minor -- with the engineers and contractors I brought on board to help out. The issues mentioned below aren't specific to that code base, so I hope that some general documentation and advice helps people working to truly open up public court data for deep analysis.

 The biggest challenge with PACER is not that each court upgrades their instance of the PACER software distribution separately, but rather that each court can bizarrely customize different aspects of the PACER data itself. This is a bit of a nightmare, because it means that your testing has to have pretty amazing coverage across all the PACER sites and then their cases to try and tease out issues before your data set gets corrupt. All the PACER sites had pretty consistent error messaging on their login screens: wrong password, your bill is overdue, your client code is incorrect, etc. The problem is when an individual case, docket entry, or document has been sealed or otherwise had its access restricted. I saw literally dozens of different error messages, coming at different points in the PACER interaction, as local admins sometimes used the proper configuration features of PACER to do things or just did one-off custom errors that they wrote themselves. The reason this becomes so critical is that the end users of your PACER-wrapping application need to understand the problem is with PACER -- and not your application. After a year of doing test-driven development as we fixed customer- and tester-reported problems with specific PACER cases, we finally reached a point where things were mostly fine. Until another PACER admin puts errors into something other than a <b> or <h4> tag.

The second biggest challenge is when cases change venue. The prefix in the case number might change, but it is still hypothetically the same case and should carry the same accrued metadata along with it. Due to the way ECF data entry works, the link between the cases's original venue and the new one is often obvious, but just as often it is not. The text in the docket entry appears to get hand-edited or the ECF software doesn't close an HTML tag, and sometimes the documents themselves can have their header and footer text modified. The latter issue means that an incredibly basic text recognition and comparison won't work to find "same" documents in other cases. So, if I'm following a case and it moves (I may not know if I'm not involved), I'll now be monitoring an ostensibly dead case. Dealing with this in the data model is challenging, but so are the UI aspects -- how do you reasonably notify a research user following dozens of cases both in-UI and via email alerts?

The third challenge is that the recent upgrades to the PACER system are making more and more use of JavaScript. For companies that are using static page scraping technologies, like Nokogiri or Hpricot, the introduction of PACER pages with even a minor amount dynamic DOM necessitates a dramatic change-up in underlying technologies. Since I worked on web application security products in the early 2000s, I has already participated in the cat and mouse battle between web sites that don't want to be crawled and the folks writing the crawling software. My best advice would be for anyone doing (semi-)automated PACER traversal to start using PhantomJS, perhaps even with Capybara, to automate interactions with PACER. If you're already doing acceptance test-driven development, you're probably already using those technologies for your own test suites and perhaps even know how to keep Capybara/Selenium test suites fast and maintainable in the medium/long-term. For some PACER-related product teams, this eminent problem may be the final nail in their coffin.

With the recent data loss of important public court records on PACER*, keeping an open and secure archive of the data in the existing system is critical -- especially before the "NextGen" upgrade where the individual courts' custom handling of various things will potentially lose information forever.

If enough people like this article, I'll write some more in-depth details on specific courts/cases that were non-trivial to parse and model in the database.

* The Latest Really Big Screw Up With PACER