The government's release of the 1940 census will give researchers access to 130 million records. Ancestry.com has been preparing for the expected spike in traffic at its website, while applying artificial intelligence to help people find ancestors in its giant databases.
For Ancestry.com, big data is about to get even bigger.
The subscription-based website for finding long-lost relatives already has 6.7 billion historical records and 4.8 billion people named in family trees on its website. But now it's adding the 1940 United States Federal Census, which the federal government will release on Monday.
The National Archives has turned the 1940 census paperwork into more than 3.8 million digital images. The online archivebeing released after a 72-year waiting periodwill be a gold mine for people just beginning to compile their family history, though it will become easier to use once the images are indexed.
When Ancestry.com's database and index are complete, users will be able to search more than 130 million census records using fields such name, street address, county and state.
Scott Sorensen, Ancestry.com's senior vice president of engineering and its top IT executive, says that his staff has been busily preparing systems for the expected deluge of search requests.
The company learned its lesson two years ago from a huge spike in website traffic during the TV show "Who Do You Think You Are," in which celebrities such as Sarah Jessica Parker discover clues about their ancestors. At the first commercial break, many inspired viewers apparently dashed to their computers to try their hand at family research.
Ancestry.com had prepared for a 300 percent spike in traffic from TV viewers, but the website was slammed by traffic that was (in some cases) 21 times the usual pattern, which "brought us to our knees," Sorensen says.
Since then, the company has added servers and beefed up its network and infrastructure to support bigger surges in traffic, he says.
The company has nearly 5,000 servers at its data center and uses a variety of tools to handle its big data work, including the data-mining software Hadoop; traditional relational database software; statistical software called R; algorithms that employ machine learning, a form of artificial intelligence; and Mongo DB, database software that creates linkages among the public family trees posted on the site.
The Provo, Utah-based company had about $400 million in sales last year and has about 1,000 employees, according to Hoovers.com. It currently has 1.7 million subscribers.
The key business goal at Ancestry.com is to broaden its customer base to include people who are curious about their ancestors but aren't experienced researchers. Sorensen's job is to use technology to make the discovery of ancestors as easy as possibleÂso the first-time searchers don't go away disappointed.
Consequently, his technology group works to improve customer metrics such as "time to first discovery" and (for long-time subscribers) "number of discoveries in a week." The company continues to enhance the "power-user tools" for sophisticated researchers, too, Sorensen says.
Three years ago, most ancestor discoveries were made through the company's custom search engine, but now more discoveries are made through "hinting," whereby Ancestry.com's artificial intelligence technology suggests likely connections or records.
"We take the massive amounts of data we have, and the billions of records that people have attached to the family trees, to do record linking and record matching," Sorensen says. "So you start with 40 million Smith names, and then 4 million John Smiths, but what you want are the four records about your great-great-grandfather John Smith. Our record-linking technology will try to surface those four records and give you a hint," he explained. "We try to make those discoveries more automatic."
What does the future hold? Sorensen says he envisions a time when the company adds socio-economic data to the classic genealogical data to provide more colorful information and context about ancestors. He offered this example: "I can see [from the 1930 census] that my great-great-grandfather had a radio, and was the only person on the block to have a radio. Then [with socio-economic data] here's the additional color that shows what percentage of people had a radio in that time and place."
Mitch Betts is CIO magazine's executive editor. Follow him on Twitter: @mitchbetts.
Read more about data management in CIO's Data Management Drilldown.
Copyright 2009 IDG Magazines Norge AS. All rights reserved
Postboks 9090 Grønland - 0133 OSLO / Telefon 22053000
Ansvarlig redaktør Henning Meese / Utviklingsansvarlig Ulf Helland / Salgsdirektør Tore Harald Pettersen