A technology company dedicated to halting and preventing the scraping of online data has set its sights on the real estate industry.
With strategic help from real estate consulting firm Clareity Consulting, Arlington, Va.-based Distil Networks has developed what it hopes will be an “industrywide intelligence network” to identify and thwart those who scrape real estate listing data without permission.
Cyborg image via Shutterstock.
“You (multiple listing services) put out your most valuable asset on your front lawn with a sign that says, ‘Please don’t steal me.’ That’s the only protection you have. Until now,” said Distil CEO Rami Essaid.
Scrapers, which are essentially programs or “bots,” look just like human beings when they access a website, he said.
“But if you put Distil between your website and your end users, we can identify the bots and only let through legitimate traffic,” he said.
Thus far, 12 MLSs have deployed the service on their public-facing websites and 23 others are in the pipeline. Eventually, the goal is to get websites all across the industry — from those of agents and brokers to those of franchisors and big portals — to sign up for Distil’s network, said Matt Cohen, Clareity’s chief technologist.
“We really need coverage for this to work. Once (a bot goes) to one site, we know the bot is bad and when they go to another site they are pre-identified (and blocked). It’s like an antivirus,” he said.
“If they find an unprotected site, they’ll feast all they want.”
Because agents, brokers and MLSs tend to provide the same listing data through their public websites, every site is a “point of vulnerability,” Essaid said.
“Our ability to protect the data is only as strong as the weakest link,” he said.
Distil’s technology differentiates between “good” bots, like search engines, and “bad” bots that scrape data, and blocks activity only from bad bots, Cohen said.
So far, every site that has implemented Distil has seen malicious bot activity, indicating that many MLSs — and other industry participants — may not be aware they have a scraping problem, Essaid said.
To his shock, Essaid found that even low-profile, run-of-the-mill agent websites are at risk.
“It’s not just your big realtor.com or your Prudentials. Smaller sites also get targeted. My dad’s a Realtor. His site I would imagine no one would even get to (but) sites small and large have this problem,” he said.
Rosemary Scardina, CEO of East Bay Regional Data Inc., agreed to become the first MLS to try Distil in beta beginning in February. EBRD was moving to a new site and she was interested to see how vulnerable the old one was.
“I think that anyone that doesn’t feel they have a scraping problem — unless they have the best technology people on staff — is living in a fool’s paradise,” she said.
“It wasn’t a matter of if I had a scraping problem, it was a matter of how bad the scraping problem was.”
So far, Distil has found that up to 60 percent of Web traffic on public-facing MLS websites are data scrapers. That much activity puts significant stress on a website’s bandwidth, Essaid said. By blocking scrapers from the EBRD site, Distil made the site load twice as fast, improving user experience and reducing infrastructure costs, he said.
Scardina, whose MLS has 4,000 subscribers, wonders why some people don’t think scraping is wrong.
“It’s behind a fire wall. It’s locked down. How is that different than going into a jewelry store and stealing something?” she asked.
Scardina said Distil has given her peace of mind.
“I’m in a medium-size MLS and I don’t have the staff and technology resources” to stop scrapers, she said.
Before Distil, “it was very hard to find (provable) instances of scraping. You can guess and wonder how people got your information, but going back all the way to the origin … and being able to a stop to it … (is) the true solution to scraping,” Scardina said.
One surprising finding for Scardina was that the majority of her site’s scrapers came from China.
“I didn’t realize how far they come from. I’ve been told that labor is cheap” there, she said.
“Data is international. Whoever wants the information could come from anywhere, and they could use anyone from any part of the world to get it for them,” she added.
Realtor.com operator Move Inc. was an early pioneer in industry anti-scraping efforts. In an earnings call today, the company said it had invested millions over the years protecting the broker and agent data it receives from MLSs and prevents more than 1.5 million scraping attempts daily.
Distil’s Essaid noted his company’s multilayered process includes unique fingerprinting that allows the company to track users across IP addresses as well as identity verification and behavioral modeling, which means Distil is able to achieve a success rate of 99 percent.
“The bots are getting very sophisticated. We have to change tests every couple of minutes,” Essaid said.
Clareity, which has focused on industry data security issues since the late 1990s, realized not every MLS or vendor could have one or two people on staff exclusively to combat scrapers, Cohen said. So he set out to find a company that was willing to customize a data protection product for the real estate industry that could scale down to the broker and agent level and would be available at an “appropriate” price.
After several months of scouting, “Distil stood alone,” Cohen said.
Pricing starts at $36 per year for individual agents and varies by brokerage size. Pricing for MLSs also varies, but in general, the fees for real estate plans are significantly discounted from Distil’s regular pricing.
Essaid said focusing on the real estate industry was “just good business.”
“I want to make it sound great and philanthropic, but really, Matt showed me the money,” he said.
“When you break it down to how big the industry is (with) 700,000 Realtor sites … it’s one of those situations where you can get 80-90 percent penetration in this one vertical.”
“It’s an arms race. The people that we protect are always going to need our protection,” he added.
While the MLSs using Distil now are using it for their public-facing websites, he hopes that starting later this summer or fall some early adopters may start to promote it to their subscribers.
Scardina said deploying Distil to individual subscriber websites “may be something down the road,” but EBRD “hasn’t moved in that direction as of yet.”
Essaid and Cohen declined to name any of the data scrapers Distil has identified or any of the MLSs other than EBRD that have implemented Distil.
“Frankly, the information on the bad guys is stuff that we are going to save and use in court,” Cohen said.
“We’re not telling the bad guys they are being monitored, and one of the reasons we’re not telling who the customers are is because the bad guys will know who to avoid. We have to be cagey.”
Cohen anticipates Distil will work hand in hand with another initiative spearheaded by Clareity: REDPLAN Inc., a new nonprofit organization created to protect the intellectual property rights of MLSs and brokerages.
“REDPLAN needs to know who to go after. How do they know? We need an industrywide intelligence network that will know and pass it on,” Cohen said.
The “most abstract” reason to catch scrapers is to protect the copyright rights of MLSs and brokers, Cohen said.
But it is also important because scraping undermines the reputations of brokers and MLSs and MLSs’ ability to generate legitimate revenue, he said.
While many may associate scraping with the improper display of real estate listings, “many scrapers use the data out of the light of day. They sell it out the back door … aggregate it and sell statistical products back into the industry and to financial institutions,” Cohen said.
Selling “gray market stolen data” decreases the actual value of licensing revenue from legitimate organizations such as MLSs or data companies like CoreLogic and LPS, he said.
Some scrapers also use listing data to power illicit marketing efforts, Cohen said.
“Such that a person puts a listing into the MLS, it’s scraped from an IDX (Internet Data Exchange) website, (and scrapers) might put it up against a reverse phone directory,” he said.
“Within a day of it being putting in the MLS, clients are getting calls from landscaping companies and who knows who else. It’s embarrassing for our industry.”
Most MLSs understand the value of the data they are entrusted with and that will hopefully be enough to get them to take reasonable steps to protect it, Cohen said. But he stressed that he expected convincing MLSs to participate would be a communications process.
“The starting point was getting a few MLSs and brokers to dip their toe in the water first and have their sites protected by Distil — which we have already done — in order to start collecting evidence that it’s an issue, which we’ve also now done. Now we just need to get the word out, “ he said.
Editor’s note: This story has been updated to include comments from Move’s second-quarter earnings call and remove an old statistic regarding the effectiveness of realtor.com’s anti-scraping efforts.
Is data scraping a problem for you? Let us know in the comments below.