Part time: Data crawler team (d/w/m)

We're looking for part-time scraping and crawling engineers.

OpenSanctions is expanding its coverage of sanctioned entities, people in political office and other positions of influence, and entities linked to other sources of risk. That means we need to source data from hundreds of web pages (and, whenever we’re lucky, structured data files).

We are building a team of freelancers to help us crawl web pages, excel files, and many other kinds of data sources and transform the data into our standardised format.

You need the following skills, traits and experience:

  • Experience of web scraping using Python
  • ETL/data processing using Python
  • Running programs in a command line environment
  • git, GitHub, pull request workflow
  • A keen eye for detail and data correctness
  • Taking an interest in the meaning of data
  • A passion for documenting a dataset so that those with an interest in that data will see its significance, and its limitations

We build crawlers using tooling in our custom ETL framework which is carefully designed for the nature of our data pipeline and our scaling needs. Each crawler can focus on what is unique about the data source it is intended to integrate into our database.

We will expect you to use this tooling and write crawlers consistent with the existing style. This makes maintenance much easier for the entire crawler team.

What to expect

This is what you can expect as a member of the crawler team:

Tasks

We maintain a backlog of tasks on a kanban board. These tasks comprise either writing a new crawler for a data source we don’t track yet, or updating an existing crawler.

We expect you to start the top priority card when you are ready to work on it, and take tasks on at the pace that suits you. But we find it works best to complete a task within 2-3 days so that you and whoever reviews it can maintain context easily.

Time-boxes

Tasks will have a time-box indicating the maximum number of hours we roughly expect it to take. Sometimes you will see ways to do things faster than we thought and finish in less time. Sometimes surprises will show up and it might take longer to complete the task than the time box.

The point of the time-box is to prompt you to talk to us before exceeding it (ideally as soon as you realise it’s more complex, but hopefully latest at 80% of the time-box) so that we can take a look with you. Sometimes we can give guidance and help you cut scope which still yields the level of data quality we need. At other times we will agree to increase the time-box because it’s simply more complex than anticipated.

Deadlines

Sometimes we need to commit to a customer to have data from a given source in our database by a certain date. In these cases, we will indicate a deadline date on a task. We may also add a bonus to a task for completing it before the deadline.

If you realise you won’t be able to finish by the deadline after starting a task, it’s important to let us know so we can decide whether someone can help out, or whether the task should be completed by someone else.

Your first few tasks

Initially we will agree a fixed cost for each task and assign specific tasks considered good for on-boarding to you.

After that

Once it’s clear you understand what we need and if we see that we work well together, we’ll transition you to an agreed hourly rate with agreed minimum and maximum work levels per month.

Interested?

Contact jobs@opensanctions.org