Real estate websites have seen a 300 percent jump in traffic from “bad bots.” A study from Distil Networks says this may be due to data scraping from real estate startups.
- A study from Distil Networks found that 46 percent of overall website traffic is not from humans.
- Real estate websites have seen a 300 percent jump in traffic from "bad bots." Distil says this may be due to data scraping from real estate startups.
- According to Distil, an average of 48 percent of traffic for large real estate websites is made up of bad bots.
Running a website often takes years of time, effort and money. Real estate companies spend thousands, if not millions, in their quest to attract ever more eyeballs.
But what if almost half of the “people” visiting a real estate site weren’t people at all? And worse yet, what if a big share of those non-people were there to steal?
That’s what a new report from Web security company Distil Networks suggests. The company’s third-annual “Bad Bot Landscape Report” found that 54 percent of overall website traffic in 2015 was human — an increase from 2014.
The rest is from computer programs called “bots.”
Some bots, such as search engines that crawl websites, are considered “good.” Those made up 27 percent of site traffic last year, according to the report.
But “bad” bots account for nearly a fifth of site traffic. Such bots can take up a big share of a site’s bandwidth with illegitimate traffic, driving up infrastructure costs and making the site load more slowly. Or worse.
“Bad bots are used by competitors, hackers and fraudsters and are the key culprits behind web scraping, brute force attacks, competitive data mining, online fraud, account hijacking, data theft, unauthorized vulnerability scans, spam, man-in-the-middle attacks, digital ad fraud, and downtime,” the report said.
“Bad bots create vast economic and productivity loss,” the report added.
To be sure, Distil’s business relies on companies — including many in real estate — who hire Distil to protect their data. In mid-2013, Distil launched what it hoped would be an “industrywide intelligence network” to identify and thwart those who scrape real estate listing data without permission.
Nonetheless, the study results could be illuminating. Distil’s team analyzed 74 billion bot requests, anonymized data from several hundred customers, and Web traffic from its 17 data centers to obtain the report’s results.
Startups to blame?
And it turns out real estate websites could be especially vulnerable. Overall bad bot traffic was down to 19 percent last year from 23 percent in 2014. But real estate websites saw a more than 300-percent increase in bad bot activity between 2014 and 2015, according to Distil.
“This is likely due to the recent explosion of real estate startups, which may be taking a page out of the travel metasite playbook by scraping and aggregating data to get their businesses off the ground,” the report said.
“Why license the data when you can scrape it for free, until your business model proves itself?”
This theory is further supported by the fact that within the real estate industry sites Distil surveyed, larger sites had a significantly higher rate of bad bots than small sites: 48 percent vs. 19 percent, respectively, Orion Cassetto, director of product marketing at Distil Networks, told Inman via email.
Web analytics company Alexa ranks the top sites on the Internet based on their estimated amount of traffic. Distil defined “large” sites as those with a global Alexa traffic ranking of between 1 and 10,000. “Small” sites had an Alexa ranking of 50,001 to 150,000.
“Should it be the case that our hypothesis is correct, this trend could be explained by the smaller sites being the aggregators or scrapers, while the larger sites were the scrapees,” Cassetto said.
“Companies most likely to resort to scraping tactics are those which involve aggregation of existing data and may not yet be profitable enough to license access to the data over an API (application programming interface),” he added.
“Smaller meta-search sites, aggregation sites, list and search sites all have need for huge amounts of data to support their business models and may resort to scraping to get the ball rolling.”
Any kind of “high value” data, not just listing data, could be at risk of scraping, according to Distil.
“This data can be aggregated, used for competitive intelligence, processed and turned into analytics, etc.” Cassetto said.
“The type of data really depends on the ingenuity of the scraper, and the availability of high value data within the industry and websites.”
What real estate sites did Distil’s study include?
Distil’s study encompassed hundreds of domains from 33 companies that list or process real estate data, including MLSs, associations, brokerages, property portals and other real estate marketplaces, Cassetto said.
“Several” of those websites are “very popular” and appear in the Alexa top 10,000 websites worldwide, Cassetto said.
Though most of the real estate sites studied were in the U.S., none of those “large” real estate sites were, so the findings on large real estate sites could have limited applicability here.
The company declined to provide a list of the 33 companies for security reasons. Some MLSs have voluntarily shared that they are Distil customers.
Rosemary Scardina, CEO of East Bay Regional Data Inc. in 2013, agreed to become the first MLS to try Distil. EBRD was moving to a new site, and she was interested to see how vulnerable the old one was.
“I think that anyone that doesn’t feel they have a scraping problem — unless they have the best technology people on staff — is living in a fool’s paradise,” she said at the time.
“It wasn’t a matter of if I had a scraping problem; it was a matter of how bad the scraping problem was.”
At the time, Distil said it had found that up to 60 percent of Web traffic on public-facing MLS websites are data scrapers. That much activity puts significant stress on a website’s bandwidth, said Rami Essaid, co-founder and CEO of Distil Networks.
By blocking scrapers from the EBRD site, Distil made the site load twice as fast, improving user experience and reducing infrastructure costs, he said at the time.
Real estate giants’ traffic in doubt?
Does this indicate that a big chunk of the traffic going to sites such as Zillow, Trulia and realtor.com could be bad bots?
Distil said yes — however, its study did not include Zillow, Trulia, realtor.com, homes.com, redfin.com, coldwellbankerhomers.com or remax.com, all U.S.-based real estate websites with global Alexa traffic rankings between 1 and 10,000. In fact, none of the “large” websites that Distil studied were U.S.-based sites.
“According to our data,” Cassetto said, “although the average real estate site is likely to have 30.66 percent bad bot traffic, larger sites are more likely to see higher rates, with the average reaching 48 percent for sites ranked between 1 and 10,000 on the [global] Alexa ranking scale.”
In response to Cassetto’s assertion, Zillow Group spokeswoman Amanda Woolley said in an emailed statement, “As the largest real estate media company, Zillow Group’s sites have much more sophisticated software to block bots from reaching our sites than do many smaller sites. Virtually all of the traffic that makes it to Zillow Group’s sites, and the traffic that is reported by Google Analytics or comScore, is human traffic.” Zillow Group also told Inman that its sites use CAPTCHA technology to identify and reroute bots.
Move spokeswoman Lexie Puckett Holbert said in an emailed statement, “Realtor.com takes fraudulent traffic very seriously and has extensive precautions in place.”
Move did not respond to requests to elaborate on how bad bot activity figures into realtor.com’s traffic numbers.
Back in an August 2013 earnings call, Move said it had invested millions over the years protecting the broker and agent data it receives from MLSs and, at the time, prevented more than 1.5 million scraping attempts daily.
‘Bad bots’ becoming more sophisticated
An overwhelming and rising share of bad bots — 88 percent — are what Distil calls “Advanced Persistent Bots.” Such bots mimic human behavior, disguise their identities, spoof IP addresses, and rotate or distribute their attacks over multiple IP addresses, among other tactics.
“This shows that bot architects have already taken note of traditional bot detection techniques and are finding new sophisticated ways to invade websites and APIs, in an effort to take advantage of critical assets and impact a business’s bottom line,” Essaid said in a statement.
Advanced Persistent Bots are much harder to identify and block than simple bots and fly under the radar of many existing security solutions, according to Distil.
The report noted that cheap or free cloud computing resources let “anyone with basic computer skills download open source software and get into the bot game.”
“Meanwhile, IT infrastructure teams are under increasing pressure to accurately forecast and provision web infrastructure to meet the speed and availability demands of legitimate users,” the report said.
“IT security teams must ensure that nefarious actors can’t harvest their data or breach their defenses. And marketing teams seek accurate data on website and conversion metrics. Yet most companies still have no visibility or control over malicious website traffic.”