Update: Sina responded to Sixth Tone’s request for comment on April 24, clarifying that Weibo users’ personal information, social networks, and self-deleted posts would not be included in the NLC’s digital archive. For posts deleted for other reasons, Sina said it would “comply with national laws and regulations” in determining whether to include them in the database.
The National Library of China will archive over 200 billion Weibo posts as part of a project launched Friday that aims to build a comprehensive record of internet data and preserve China’s digital footprint.
“The project is established for the long-term development of national information security and the informatization of society,” Rao Quan, director of the NLC, said in a speech Friday. The project’s instigator and first partner, Chinese technology giant Sina, will contribute 210 million news articles and 200 billion Weibo microblog posts to the archive “for further research,” according to Rao.
Established in 1998 as a news portal, Sina launched Weibo in 2009. The microblog site had 462 million monthly active users at the end of last year — compared with Twitter’s 321 million — and continues to be one of the most important social platforms in China.
Wang Gaofei, Weibo’s CEO, said at Friday’s launch event that he hopes the archive of news articles and social media posts will serve as a reference for “public policymaking and research on social issues, as well as academic theories.” Sun Yigang, the library’s deputy director, echoed Wang’s view that the database would be used for scientific research and “national strategy,” adding that he expects other online platforms to join the project in the near future.
Archiving online data has become an increasingly popular means of recording the cultural and intellectual legacy of the modern digital world. Countries like the U.K. and the Netherlands have established their own institutes dedicated to digital preservation. In France and Germany, national libraries play the role of preserver.
In 2003, an organization called the International Internet Preservation Consortium, or IIPC, was established to oversee the collection, preservation, and sharing of “knowledge from the global web.” Today, its members hail from 45 countries.
The National Library of China, or NLC, joined the IIPC in 2007. It has archived over 20,000 government websites, as well as information about events such as the SARS epidemic in 2002, the Great Wenchuan Earthquake of 2008, and the Beijing Olympics the same year, according to Sun.
The NLC told Sixth Tone on Tuesday that it has complied with the “technological standards” recommended by the IIPC and said it would share the methodology and data collection models used for its web archive project with fellow IIPC members.
Meanwhile, processing such massive amounts of data presents inherent problems. In 2010, the U.S. Library of Congress began cooperating with Twitter to archive all tweeted text, only to discontinue the project in 2018 after the platform had become much larger and more visual as a medium than the archivists anticipated.
Though China is not alone in wanting to preserve online culture, some netizens have voiced concerns over user privacy and whether Weibo has the right to turn posts on its platform over to a third party like the NLC for research purposes.
Zhu Wei, an associate professor at China University of Political Science and Law in Beijing and an expert on copyright infringement and data privacy, notes that under Chinese law, Weibo’s users own the copyright to their posts and must give consent or agree to compensation before those posts may be used by any entity other than Weibo.
“In the past, it was common in China for collective works like this to be used without consent or compensation, as in the case of CNKI, China’s largest academic database,” Zhu told Sixth Tone. “Now, however, the public is more aware of the copyrights they possess.” In August 2017, a Chinese copyright society sued CNKI for intellectual property violations, claiming thousands of dollars in damages — a lawsuit CNKI lost.
In response to Sixth Tone’s questions about intellectual property rights, the NLC said it would comply with the relevant laws and regulations in preserving the data and would only preserve public Weibo posts in the early stages of the project.
Zhu the professor asserts that it’s much less complicated to protect the copyrights of digital media compared with physical media and recommends that institutions devote themselves to the legal and rational use of such data. “Even if the NLC is using the data for the public good, it should not trample on users’ private interests,” he said.
Correction: A previous version of this article stated that Weibo had 462 million daily active users at the end of last year. This figure is for monthly active users.
Editor: David Paulk.
(Header image: IC)