The Affair of the Vanishing Content

"Digitized information, especially on the Internet, hasStanford University assigns a numerical handle to
such rapid turnover these days that total loss is theevery digital "object" (record) in a repository. The
norm. Civilization is developing severe amnesia as ahandle is the clever numerical result of a mathematical
result; indeed it may have become too amnesiacformula whose input is the number of information bits
already to notice the problem properly."(Stewartin the original object being deposited. This allows to
Brand, President, The Long Now Foundationtrack and uniquely identify records across multiple
)Thousands of articles and essays posted byrepositories. It also prevents tampering. SAV also
hundreds of authors were lost forever whenoffers application layers. These allow programmers to
surprisingly shut its virtual gates. A sizable portion ofdevelop digital archive software and permit users to
the 1960 census, recorded on UNIVAC II-A tapes, ischange the "view" (the interface) of an archive and
now inaccessible. Web hosts crash daily, erasing inthus to mine data. Its "reliability layer" verifies the
the process valuable content. Access to web sites iscompleteness and accuracy of digital repositories.The
often suspended - or blocked altogether - becauseInternet Archive, a leading digital depository, in its
of a real (or imagined) violation by the webmaster ofown words:"...is working to prevent the Internet - a
the host's Terms of Service (TOS). Millions of othernew medium with major historical significance - and
web sites - the results of collective, multi-annual,other "born-digital" materials from disappearing into
transcontinental efforts - contain unique stores ofthe past. Collaborating with institutions including the
information in the form of databases, articles,Library of Congress and the Smithsonian, we are
discussion threads, and links to other web sites.working to permanently preserve a record of public
Consider "Central Europe Review". Its archivesmaterial."Data storage is the first phase. It is not as
comprise more than 2500 articles and essays aboutsimple as it sounds. The proliferation of formats of
every conceivable aspect of Central and Easterndigital content has made it necessary to develop a
Europe and the Balkan. It is one of countless suchstandard for archiving Internet objects. The size of
collections.Similar and much larger treasures havethe digitized collections must pose a serious challenge
perished since the dawn of the digital age in theas far as timely retrieval is concerned. Interoperability
1920's. Very few early radio and TV programs haveissues (numerous formats and readers) probably
survived, for instance. The current "digital dark age"requires software and hardware plug-ins to render a
can be compared only to the one which followed thesmooth and transparent user interface.Moreover, as
torching of the Library of Alexandria. The moretime passes, digital data, stored on magnetic media,
accessible and abundant the information available totend to deteriorate. It must be copied to newer
us - the more devalued and common it becomes andmedia every 10 years or so ("migration"). Advances in
the less institutional and cultural memory we seem tohardware and software applications render many of
possess. In the battle between paper and screen,the digital records indecipherable (try reading your
the former has won formidably. Newspaper archives,word processing files from 1981, stored on 5.25"
dating back to the 1700's are now being digitized -floppies!). Special emulators of older hardware and
testifying to the endurance, resilience, and longevitysoftware must be used to decode ancient data files.
of paper.Enter the "Internet Libraries", or DigitalAnd, to ameliorate the impact of inevitable natural
Archival Repositories (DAR). These are libraries thatdisasters, accidents, bankruptcies of publishers, and
provide free access to digital materials replicatedpolitically motivated destruction of data - multiple
across multiple servers ("safety in redundancy"). Theycopies and redundant systems and archives must be
contain Web pages, television programming, films,maintained. As time passes, data formatting
e-books, archives of discussion lists, etc. Such"dictionaries" will be needed. Data preservation is
materials can help linguists trace the development ofhardly useful if the data cannot be searched,
language, journalists conduct research, scholarsretrieved, extracted, and researched. And, as "The
compare notes, students learn, and teachers teach.Economist" put it ("The Economist Technology
The Internet's evolution mirrors closely the social andQuarterly, September 22nd, 2001), without a
cultural history of North America at the end of the"Rosetta Stone" of data formats, future deciphering
20th century. If not preserved, our understanding ofof stored the data might prove to be an
who we are and where we are going will be severelyinsurmountable obstacle.Last, but by no means least,
hampered. The clues to our future lie ensconced inInternet libraries are Internet based. They
our past. It is the only guarantee against repeatingthemselves are as ephemeral as the historical record
the mistakes of our predecessors. Long gone Webthey aim to preserve. This tenuous cyber existence
pages cached by the likes of Google and Alexagoes a long way towards explaining why our
constitute the first tier of such archivalpaperless offices consume much more paper than
undertaking.The Stanford Archival Vault (SAV) inever before.