Calling @crtweet for clarification, please!

May 7, 2009

Michael Wurzer

There’s been quite a dustup over the decision reportedly made by the Indianapolis Metropolitan Board of REALTORS® (MIBOR) that their MLS IDX rules against “scraping” also prohibit Google from indexing an agent’s site showing IDX listings.

For a bit of background, indexing is what Google does — it crawls the web and creates indexes of as much of it as it can so that when people search on Google it can return relevant results quickly. Here’s what Wikipedia has to say about scraping (with some emphases from me added):

Web scraping (or Web harvesting, Web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers, such as the Internet Explorer (IE) and the Mozilla Web browser. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software.

The highlighted sentence is where the confusion begins on this issue. Scraping and indexing are closely related. That they are different, however, is emphasized by the important words “in contrast” that follow the “closely related” sentence. Put together, indexing is “closely related” to scraping but it is “in contrast” to it in what I think are important ways, namely the resulting use of the data. I’ll expound on this more below, but, for now, back to the controversy at hand.

In responding to the post on Agent Genius, Hilary Marsh from NAR said:

. . . questions have arisen about the scope of the requirement that IDX site operators protect the listings of other participants displayed on their IDX sites from “scraping”. Specifically, whether the policy distinguishes between “malicious” scraping and what might be considered “good” or “benign” scraping. Also, whether “indexing” is a type of scraping. The Center for REALTOR® Technology (”CRT”) advised that while the intent of “scrapers” may be malicious, and the intent of “indexers” good, the two practices from the Web server’s view appear to be the same. Consequently, NAR staff responded to questioners that the requirement to prevent scraping includes indexing.

So, the rub of the issue is that MIBOR punted the ball back to NAR, which asked CRT, and CRT (as a technical body) said, technically, there’s no difference between scraping and indexing. Of course, as is clear from the above Wikipedia definition, CRT is right — there really is no distinction from the perspective of the computer activity between scraping and indexing. Both processes read the web site and do stuff with the data.

However, focusing on the technical process here is wrong. Instead, the important distinction is between the results of the activity. Here is perhaps a compelling explanation of how these two are different. When you go to visit a web site, your web browser reads the web site and displays the information back to you. In fact, most web browsers store a copy of that site on your computer so that it can display it back to you faster if you look at it again later. From a technical perspective, your visit to the web site and your browser caching the content locally on your computer is not very different from what a scraper does.

However, nobody is going to argue that web visitors are scrapers. Why? Because of their intent and what they are doing with the data. A consumer looking at content is a good thing. So, too, I would suggest is Google indexing the web and real estate content. Google is not (at least today) taking the content and presenting it as their own creation. Instead, they are linking back to the source of the data, which provides a critically important service to the web site being indexed. This is what the web is all about and so interpreting indexing and scraping as the same thing results in the leap backward the commenters on the Agent Genius post decry. It’s an undoing of the web for IDX sites, which have become critically important to agents and brokers today.

Before concluding this post, however, I also want to point out that not every one agrees that Google’s indexes are positive or even benign. In Belgium, a court has ruled that Google’s News service violates certain newspapers copyrights. In hailing the opinion, the winner of the case is quoted by the New York Times as saying:

”Today we celebrate a victory for content producers,” said Margaret Boribon, secretary-general of Copiepresse. ”We showed that Google cannot make profit for free from the credibility of our newspaper brands, hard work of our journalists and skill of our photographers.”

Could a similar argument be made by MLSs or listing agents about Google indexing listing data? Possibly. However, I think getting a similar ruling from a US court is unlikely. (Any lawyers out there who know the law on this, please comment to clarify, because I’m definitely no expert here.)

More importantly, our industry has accepted the web as its friend and Google is accepted as a critical part of the web. To many, in fact, Google is the web. What’s wrong with the MIBOR decision and CRT’s narrow, technical interpretation that led to MIBOR’s decision, is that it goes against the many decisions that have already been made that the web is the real estate industry’s friend. That decision cannot be unmade. It’s done. Rule interpretations like that provided by CRT, however, do result in NAR members not being able to compete. As many on Agent Genius have commented, Trulia, Zillow and Realtor.com are not hamstrung by this same interpretation of the IDX policy, which only hinders and restricts NAR’s members. That’s wrong.

Fortunately, we live in a web world and, for many, that means we know each other personally. Most of those commenting over at Agent Genius have met, know and greatly respect Chris McKeever (@crtweet on Twitter), who now heads up CRT. My hope is that Chris can join the conversation and clarify CRT’s interpretation or let us know why the current interpretation is best. I’m asking for this conversation with the greatest respect for Chris and everyone at CRT. MIBOR put them on the hot seat but perhaps there’s a possibility the conversation can result in greater understanding for everyone, and hopefully a quick clarification on this critically important matter for MLS organizations that haven’t yet interpreted the policy on this issue.