This post was featured as a guest blog on the Open Data Institute’s blog.
HTML and other related Web technologies were initially developed to allow for the publication of natural language documents on the Internet, for reading by humans.
The key phrase in the previous sentence is “for reading by humans”.
Humans are very good at reading natural language documents and interpreting meaning. Computers on the other hand, while they are good at serving these documents, are very bad at reading them and interpreting them. Computers require precise instructions: if you want a computer to do something for you, it is better that you have data rather than documents.
The standards for serving Web documents have been widely adopted and they power the Web that we know and love today. By contrast, while there are proposed standards for serving Web data, these standards have not been widely adopted: less than 1% of websites use RDF.
Continue reading The Web was designed for documents, it was not designed for data
The web is a wonderful place for information. I can open a browser and have the answer to any question within minutes. But the web is not so great when it comes to data. Getting data from the web is difficult. And the only solution is a bit of a dirty secret for our industry, a dirty secret that we don’t like to talk about…”web scraping”.
The reality is that if you are a data owner with a data source on a website, then that data source is almost certainly being scraped, today. You have no insight into this. You have no control over it. It is just a cost for you. This is not good.
Web scraping is also not good for data users. It is high cost as it requires expensive developer time. The rights that you have to use the data are uncertain: do you need to hide the fact that you are scraping? do you need to combine the data with other data before you can use it? If you need multiple data sources then you create a data integration problem for yourself: you have to normalise and integrate the results of multiple web scrapes. Even if you tried to pay the data owner for access to their data source, they probably wouldn’t be able to take the money off you as they are not in the business of selling data.
In summary, getting data is a problem and web scraping is neither a good technical solution nor a good economic solution.
Import•io is a place where data users (people who want data) and data owners (people who have data) can better interact. It is a platform upon which connectors to data sources can be built along with a suite of tools to make it easy to build connectors to either APIs or web data sources.
Continue reading import•io launches!