September 2024
Advisory Committee on Historical Diplomatic Documentation September 9-10, 2024
Minutes
Committee Members
- James Goldgeier, Chair
- Kristin Hoganson
- Sharon Leon
- Nancy McGovern
- Timothy Naftali
- Deborah Pearlstein
- Elizabeth Saunders
- Kori Schake
- Sarah Snyder
Office of the Historian
- Kristin Ahlberg
- Carl Ashley
- Margaret Ball
- Forrest Barnum
- Sara Berndt
- Josh Botts
- Tiffany Cabrera
- Mandy Chalou
- Elizabeth Charles
- Kathryn David
- Cynthia Doell
- Thomas Faith
- Stephanie Freeman
- David Geyer
- Renée Goings
- Ben Greene
- Michelle Guzman
- Charles Hawley
- Kerry Hite
- Adam Howard
- Richard Hulver
- Alina Khachtourian
- Virginia Kinniburgh
- Laura Kolar
- Aaron Marrs
- Michael McCoyer
- Brad Morith
- Christopher Morrison
- David Nickles
- Nicole Orphanides
- Paul Pitman
- Alexander Poster
- John Powers
- Kathleen Rasmussen
- Matthew Regan
- Amanda Ross
- Seth Rotramel
- Daniel Rubin
- Ashley Schofield
- Nathaniel Smith
- Douglas Sun
- Claudia Swain
- Brooks Swett
- Melissa Jane Taylor
- Chris Tudda
- Dean Weatherhead
- Joseph Wicentowski
- Alex Wieland
- Tristan Williams
- James Wilson
- Louise Woodroofe
Bureau of Administration
- Jeff Charlston
- Corynne Gerow
- Timothy Kootz
- Mallory Rogoff
Department of Defense
- J.D. Smith
National Archives and Records Administration
- William Bosanko
- William Fischer
- David Langbart
- Don McIlwain
- Mark Sgambettera
Public
- Over 50 members of the public
Open Session, September 9
Presentation on the Office of the Historian’s Experiments Using Today’s Artificial Intelligence Tools for Historical Inquiry (see video and transcript below)
James Goldgeier opened the session by introducing himself and by welcoming all attendees, in person and online. He then noted that Adriane Lentz-Smith had rotated off the Committee after the June meeting and introduced Elizabeth Saunders of Columbia University as her replacement. He then turned it over Adam Howard.
Howard introduced Joe Wicentowski’s presentation on AI. He noted that Wicentowski’s talk followed previous presentations on artificial intelligence (AI) by the Department of State and other interagency partners. Howard said he will moderate the Q&A following the presentation.
Wicentowski began his presentation on the Office of the Historian’s experiments using today’s artificial intelligence tools for historical inquiry, including the use of large language models and retrieval augmented generation interfaces to published documents from the Foreign Relations of the United States (FRUS) series and the use of multimodal models to transcribe handwritten historical documents.
The presentation began with a preface celebrating the 15th anniversary of OH’s public website, history.state.gov. Through a partnership with the University of Wisconsin Madison, the Office digitized and converted publications to TEI (Text Encoding Initiative) format, making all printed volumes and legacy publications available online. The history.state.gov website receives over 10 million visitors annually, with significant international traffic, and plans are underway to make FRUS digital editions accessible via the Libby e-book lending app.
Wicentowski started his introduction to AI, which focused on generative AI tools and LLMs, not the broader field of machine learning. Wicentowski said he was initially skeptical about the reliability and provenance of AI-generated information, as these tools are trained on vast, undisclosed internet data. However, practical applications have since emerged, such as tools like ChatGPT and Claude, and can be used to interrogate documents by asking questions in plain English. Despite limitations like context window limits, which restrict the amount of information the tool can process at once, these tools showed potential for asking questions about specific documents.
Wicentowski went on to note that an AI tool developed by Amazon was tested for reviewing annotations against a complex style guide, showing promise in identifying mistakes and inconsistencies. The tool correctly identified some mistakes and flagged issues worth checking, revealing some inconsistencies in the style guide. This led to the exploration of using AI to improve the style guide itself, making it more reliable for humans.
Wicentowski then shared that the most impressive development was the use of multimodal AI models, particularly Google’s Gemini 1.5 Pro, to transcribe challenging handwritten documents. Traditional Optical Character Recognition (OCR) struggles with handwritten documents, but multimodal AI models can transcribe handwritten text from images. The Gemini 1.5 Pro model successfully transcribed a challenging set of handwritten index cards, significantly improving accessibility and searchability. The tool processed each card in 15–20 seconds and completed all 8,600 scanned images in 48 hours, producing usable draft transcriptions that will need to be reviewed.
Wicentowski concluded the results of these experiments show that generative AI tools have great potential for some portions of the work of transcribing, annotating, and querying historical documents and sources. The tools exhibit clear shortcomings, making them inappropriate for some tasks, but for other tasks, the office was able to mitigate these issues and derive utility through persistent and careful experimentation and close review.
Howard then opened the floor for questions and noted that James Wilson would moderate any online questions and comments.
David Nickles noted that people from India made up the 2nd most users of the OH website and that China was not even in the top 10. If governments won’t allow their citizens to access the OH site, how do we know who cannot access our website?
In response, Wicentowski shared that it is very difficult to answer, since our analytics tools only tell us how many visits we do receive.
Goldgeier said it was great that Wicentowski and OH are doing this kind of work on the possibilities of using AI. He noted that the Consular Record cards reminded him of his research in the Anthony Lake Papers at the Library of Congress. He noted how many handwritten notes were on index cards and how he was stunned how many references there were to Haiti and Somalia, rather than Russia and/or China, and wondered if AI could analyze those cards and determine that number. He then asked how receptive Google and Amazon personnel have been to Wicentowski’s efforts to use AI, and whether they give feedback on their tools.
Wicentowski replied that Google and Amazon engineers have not only been receptive to his feedback, questions, and suggestions, but also encouraged it and are eager for feedback.
Bill McAllister asked online about whether a researcher could ask an AI tool to find all the important documents on a given subject and therefore eliminate the need for FRUS compilers.
Wicentowski acknowledged the difficulty of the question and invited Kathleen Rasmussen to offer her thoughts. Goldgeier agreed and said he would like to hear Rasmussen’s response to the question and shared that he believes that AI can help, not replace, compilers. Can a compiler ask an AI tool about assessing the importance of a set of documents being consider for inclusion in a FRUS volume?
Rasmussen said it’s a great question, how can AI help rather than overtake or replace historians? Can AI help us researching in giant documentary databases such as State’s e-records? Some historians have indeed experimented in seeing how this can succeed but the results have been unclear so far.
Sharon Leon asked Wicentowski whether enriching published FRUS volumes with technologies such as Named Entity Recognition (NER) and topic modeling might offer concrete benefits to readers of FRUS. Wicentowski agreed that these technologies are promising and worth investigating. Until the OH website is able to offer such capabilities, he encouraged any interested parties to access the FRUS source files on GitHub and apply their own tools to these materials. FRUS is in the public domain and can be used by anyone for data mining, and OH would welcome such experimentation and sharing of results. Leon noted that running topic modeling software live on public-facing websites can require extensive resources and investments; therefore, rather than adding live, dynamic topic modeling capabilities, OH could pre-process FRUS volumes and offer static views of the data instead.
Saunders said she’d learned more about AI from this presentation than any other she’d experienced. She noted her first writing project relied on trying to interpret President Kennedy’s handwriting, so she understands how difficult it is for any tool, including AI, to decipher handwritten documents. Her question was will AI get stupider over time? In other words, how can an AI tool overcome or interpret the biases and egos and the like that are reflected in documents that are, after all, created by human beings? She then mentioned an upcoming journal article in which the author examines Presidential Daily Briefs and identified the use of racial tropes in the material submitted to the Presidents. Can the FRUS corpus of documents also be analyzed by an AI tool to uncover such biased language? How do human biases get reflected in FRUS documentation?
Wicentowski replied that he recognizes that there could be a problem with the language used in original documents, and since the text of all FRUS volumes is online, it is almost certainly to have been included in the set of materials used to train most large language models. However, the entire text of published FRUS is so small compared to the total size of the corpus needed to train large language models that it alone would not likely contribute meaningfully to the bias encoded in a large language model.
Saunders asked if the AI tool accepts the language as “how people will talk”? Wicentowski replied that his hunch is that researchers using AI have to ask the right research questions so that the tool can analyze the documents the way the researcher wants. For example, you could ask the tool “show me an example of racist imagery?” But we don’t know if the tool can distinguish or recognize such language. Thus, the more specific the question can be posed to the AI tool, the more likely it is that the tool will return an accurate answer. The fine- tuning process used to train the AI tool to respond in a “pleasing” way to the questioner can lead it to skew its answers to meet the user’s preconceptions, so we need to both craft our responses explicitly and judge its responses critically.
Snyder asked about the Consular Records cards that had been scanned and when/if those will be made available on OH’s website.
Wicentowski said that OH doesn’t have a timeline for making them available but would like to once they can be reviewed.
Snyder noted that there is a lot of manual work involved in augmenting AI work.
The session ended at 10:58 a.m.
Opening of the Meeting
Goldgeier opened the session by asking for and receiving approval of the minutes from the June meeting. He then welcomed new FSI Deputy Director Maria Brewer and invited her to offer comments to the Committee and meeting attendees.
Remarks from FSI Deputy Director Maria Brewer
Brewer noted she was pleased to attend and was looking forward to getting to know the Committee members better. She said FSI and the Department are committed to continue making positive progress on the issues under the Committee’s purview. Brewer went on to provide information on her own background and her current responsibilities at FSI.
Brewer welcomed Elizabeth Saunders to the HAC and also welcomed new FSI/OH historians hired since the last HAC meeting. She noted that FSI/OH had made 15 hires in the last year and two more were expected before the December HAC meeting, while progress was also being made on filling the recently vacated Assistant to the General Editor position. She also called attention to the recent release of the latest volume, Foreign Relations of the United States, 1981–1988, Volume XXXVIII, International Economic Development; International Debt; Foreign Assistance.
Brewer explained to the Committee that FSI was in the process of reviewing and working through how to act on the latest HAC report’s recommendations. She closed with a brief discussion of FSI’s recent creation of and hiring for a new Provost position and their supporting team.
Goldgeier thanked Brewer for her remarks and invited Howard to make comments. Howard welcomed the new historians that FSI/OH had recently hired and called attention to the release of the latest FRUS volume. He explained that this volume was one of the first produced under the FRUS modernization initiative and that Renée Goings deserved great credit for her leadership in making FRUS modernization happen.
Remarks from the General Editor
Rasmussen discussed the recently released FRUS volume in greater detail. She highlighted the significance of the topics it covers and described some of the volume’s structure and specific content. Rasmussen also congratulated the expansive team that contributed to the research, compilation, review, declassification, editing, and publication of the volume.
Remarks from the Director of Declassification, Publishing, and Digital Initiatives
Powers welcomed two new members to the Declassification, Publishing, and Digital Initiatives (DPD) team that he leads. He also highlighted how many staff members contribute to getting each volume published.
Report from the Office of Information Programs and Services
Mallory Rogoff, Agency Records Officer and Division Chief of the Records and Archives Management Division in the Office of Information, Programs, and Services (IPS) described progress and developments in the Department’s records management program. She explained that they had established an Overseas Records Branch six months ago in recognition of the unique records management challenges at overseas posts. Consequently, the Department now has a team specifically dedicated to better engaging and serving overseas posts, and this initiative will further enhance the Department’s transparency efforts.
Rogoff explained that there are over 200 overseas missions that have a wide range of challenges. Some have incredibly high turnover, while others are “micro-missions.” The new Overseas Records Branch is learning each mission’s operating environment to better assist them with records management. Rogoff noted that in the 1980s and 1990s records management staff engaged in some overseas travel to assist posts, but that ended with cutbacks in the 1990s. Now, overseas records management travel is returning and they also are making use of virtual platforms to assist posts. Rogoff noted that posts greatly appreciate this new level of support. She concluded by providing some examples of the issues they address, including disposition of very old paper records, emergency preparedness regarding records, and the transition to electronic records management.
Kristin Hoganson asked Rogoff how they were addressing overseas posts’ practices in relation to the Office of Management and Budget (OMB) electronic records mandate. Rogoff responded that the Department had submitted an “exception request” to NARA to keep some permanent paper records in place but noted that records management staff often finds that posts’ paper records are temporary and thus eligible for destruction. Rogoff said that NARA had not yet responded to the request.
Hoganson also asked Rogoff for an update on the status of the transfer of the 1982 Central Foreign Policy Files to NARA. Rogoff responded that Agency Records Officer Tim Kootz would address this in depth during the Committee’s closed session.
Closed Session
Report from Information Programs and Services (IPS)
Deputy Assistant Secretary Timothy Kootz stated that the number one problem the Department faces is not the process of transferring the records to NARA but digitalizing the content. He wondered if a public-private partnership idea was feasible but the Department in the near term does not have enough funding to digitalize these records. The United Nations and NATO had recently accomplished this, and a public-private partnership created State’s own Diplomacy Center.
David Langbart commented that a major roadblock is that State had not completed review of the P reel index and had no capability to create withdrawal slips as they had done in the past. Also, no meeting had been scheduled between the Department and NARA.
Timothy Naftali asked if State had a foundation, or another way for outside organizations to fund the digitalization. The project could be presented as a public-private transparency initiative and generate positive publicity for contributors. Kootz replied that this was a great idea.
Hoganson noted that these ventures would be expensive and that historical organizations lack the resources to help. A concern is that contributors to these projects might place digitalized public records behind private paywalls. Kootz agreed that this would be a dealbreaker and observed that DOD was currently working with the University of Maryland. He restated that State would fully coordinate the records transfer with NARA. For instance, State was working with NARA to solve an open-ended problem with the P reel index.
Kootz reported that the FOIA staff in Charleston, South Carolina, was finally in place and in full operation. The staff closed 5,000 more FOIA cases than the previous year, but also is receiving more FOIA cases than ever before. IPS policy is to tackle the new requests first, then deal with the backlog. Bot requests are skewing the volume of FOIA requests and this is happening across the federal government. State is working with the DOJ and other agencies to resolve this challenge.
Snyder asked if FOIA cases are closed through resolution or non-response, and whether there is any indication as to the agenda of the bot creators, if they believe it is deliberate?
Kootz confirmed it was deliberate. The requests are overly broad and use the FOIA requestors page which currently has no Captcha filter. IPS issues “still interested” letters and often older requests have been overtaken by events. With limited resources and a large FOIA backlog, where should IPS devote their efforts? One initiative IPS is using is more personal contact with requestors. This way they can often narrow the scope of the request and close the case faster. It helps that there are many new and very motivated FOIA employees. IPS is also going through a reorganization and the plan is to have this completed by January 2025.
Naftali concurred that there should not be a perpetual paywall, but an arrangement might be made for immediate information closed digital access for those records at the presidential libraries, NARA, and the Bunche Library—and free online access after five years behind a paywall. Kootz agreed there was merit to these ideas.
Langbart added that this was essentially what NARA did with their digital partner initiatives 15- 20 years ago. Ancestry digitalized genealogy records and put them behind a paywall but the records were free at any NARA facility. After 5–7 years Ancestry gave the digitalized records (sans metadata) to NARA.
Leon stated that when Ancestry partnered with individual states, they also did not supply them with the metadata. With only images, it is more difficult to navigate the records. It is important to get everything you need out of these partnerships.
McGovern noted the cultural clash between the library and archival communities and the difference between library skills and archival skills. Kootz acknowledged that federal records staffers are better at collecting records than providing access. State’s goal with their FOIA reading room is to deliver better access to the public.
Goldgeier asked about the Department’s progress reviewing documents using machine learning and AI considering reports that other countries are developing similar capabilities. Kootz agreed that there was a problem because it was possible to compare redactions across thousands of documents and use their inconsistencies to reveal excised information in other documents.
Kootz disagreed that the solution was to release less or create unclassified summaries of the affected documents.
Kootz then introduced J.D. Smith of the Department of Defense (DOD).
Smith described the DOD approach to using AI for declassification. Smith opened by noting that AI could contribute to exceptionally grave threats to national security. He pointed out that the interagency discussion on the use of AI for declassification has been led by Kootz at State and Smith at DOD and stressed the need for a federal approach.
Goldgeier asked whether the revision of the Executive Order related to AI. Smith stated that the draft EO contained language on AI but that there was no money for it. Kootz added that the draft EO included language about what to do but not how to do it. Powers added that the EO will represent an interim compromise. The federal government needs to know about its vulnerabilities before adversaries do, and while some see AI as a threat, Powers suggests that AI represents an opportunity. Kootz noted that the government needs to provide the public with tools to counter deep fakes, and that pulling things back may send people to false sources.
Report from the Department of Defense
Smith began his official report, and Goldgeier noted the HAC’s appreciation of Records and Declassification Division’s (RDD) successful effort to overcome the backlog in FRUS declassification reviews. Smith stated that it “took a village” to take on the backlog. Within RDD, Scott Beaton worked out a procedure for processing the 1,800 backlog cases; RDD also developed an electronic system to track cases and generate response letters. Smith also praised OH declassification staffers who helped understand the FRUS mission and provided invaluable support. DOD has 26 separate declassification offices. RDD can handle equities from the Office of the Secretary of Defense (OSD) and the Joint Staff (JS) directly, but it must refer out documents that contain other equities. RDD’s FRUS reviews release about 70% in full and 30% in part. Only about 1% denied in full. Direct informal exchanges with compilers may make it possible to release additional historically significant material. RDD’s goal is to produce reviews that generate no appeals, making it possible to get FRUS volumes out sooner. Overall, Smith and the RDD team seek to work as partners, rather than impediments.
Naftali asked how DOD could apply the results of FRUS reviews to other declassification reviews. Smith said that at present the results are provided to human reviewers.
Howard asked how RDD could get more waivers from other DOD components. Smith said that it was a matter of building relations and earning trust.
At the end of the discussion, Smith underlined that RDD’s success was due in large part to input from OH for which he remains grateful.
September 10
Closed Session
Meeting with the Archivist of the United States
The Archivist of the United States, Dr. Colleen J. Shogan, made a brief presentation to the committee discussing various initiatives that NARA is undertaking. After her presentation, the members of the committee and Dr. Shogan had a question-and-answer period covering various aspects of NARA’s new initiatives, records issues, staffing, and more.
Below is a video edition and transcript of the lecture presented by Dr. Joseph Wicentowski during the public session.
Experiments using artificial intelligence tools for historical inquiry (Video and transcript)
WICENTOWSKI: Thank you, Adam. Good morning, everyone. I always appreciate the opportunity to speak at this forum about the Office of the Historian’s digital initiatives.
First, I’d like to briefly note an anniversary before moving on to the main topic. Next slide, please. This year, the office’s public website, history.state.gov, turned 15 years old. I recall presenting the website at a HAC meeting shortly after its launch in 2009. As I said then in introducing the website, our goal in creating history.state.gov was to uphold the best traditions of diplomatic documentary editing while leveraging the flexibility and power of the internet to give readers new tools for researching with FRUS and our other publications and datasets that were impossible or impractical with print. Next slide, please.
I just have a couple of slides here showing the website. Here is a landing page of a FRUS volume with the ability to search within the volume, download PDF and eBook versions of the volume. Next slide. Here is a document view showing the text of the document, links to the original page images, and a virtual table of contents helping you know where you are in the volume however you get to it. Next slide. This is continuing in the document view showing the persons sidebar, which is a virtual kind of dynamic list of the people who were mentioned in the document and the descriptions of the people from the person’s list in the volumes front matter.
The key to achieving that vision was adopting a new electronic format for our volumes that could capture the content, structure, and semantics of our volumes. The new format couldn’t be too rigid or it wouldn’t have been able to accommodate the natural variations that marked FRUS over its 150 years, now 163 years, in print. Among the various choices for digital formats, we adopted the Text Encoding Initiative, or TEI, the de facto standard for digital text projects in the humanities. And here is an example. We’ll stay on this slide for the next paragraph or so, showing the underlying TEI behind the document that we just viewed.
Having selected TEI as our format, we turned to digitizing and releasing our volumes. Thanks to a partnership with the University of Wisconsin-Madison, which had already scanned 100 years’ worth of our publications, 1861 through 1960, we were able to create TEI editions of our volumes without needing to re-scan these books. Next slide. Month by month, year by year, our digitization vendors gradually converted all of our printed publications to the new format, which we, namely Virginia Kinniburgh here, reviewed for accuracy. Next slide. Our website and GitHub repositories, shown here, now offer all printed volumes in the FRUS series, as well as all of our legacy electronic-only publications and five of 13 legacy microfiche supplements, the remaining eight of which are in the pipeline for release as resources allow. Each volume encoded in TEI can be browsed like a book or searched like a database. Next slide, please. Together readers can search across the entire corpus, now over 310,000 documents, using keywords and dates, as shown here.
Each year—next slide, please—each year, the website receives over 10 million visitors and is among the Department of State’s top five public engagement websites. It’s notable that nearly half of our visitors come from outside the United States. In the coming months, we are excited to make the FRUS digital edition even more accessible by bringing FRUS to Libby, the e-book lending app used by many libraries. More to come on that soon.
Today, I am presenting on an area of great promise and great uncertainty, artificial intelligence, or AI. To be precise, I am discussing the so-called generative AI tools and large language models, LLMs, made popular by OpenAI’s ChatGPT, not the larger field of machine learning, or ML. I will describe our experiments with these tools, which began toward the end of last year. First, a caveat. For the rest of my talk, my mentions of ChatGPT and the other tools in this field doesn’t constitute an endorsement of these products.
ChatGPT was released in November 2022, and it was quickly joined by a robust group of competitors. Next slide, please. But for a full year—possibly because I suffer from a lack of imagination—I couldn’t see a direct application of this technology to our work. When you ask one of these tools a question, they present a confident answer. But where did this answer come from? How were these tools trained? Whereas Kathy Rasmussen, the General Editor of the FRUS series, put the question to a team of engineers we were meeting with: “Where did this AI go to school?” The answer is that ChatGPT and similar generative AI tools are trained on a vast swath of the internet at the cost of tens of millions of dollars for each generation of the tool. But the companies that produce these tools don’t list the precise sources that they consult. When you ask the tool to provide citations for an answer, they will. But as with any factual questions you may put to these tools, the answers may well be made up or hallucinated. So obtaining a reliable provenance of information about the training inputs and generated outputs is a major question for generative AI tools.
It’s easy to fall under the illusion that the tool producing words on the screen is a thinking sentient being containing the sum of all human knowledge. But despite their impressive abilities, these generative AI tools don’t actually think at all. They merely use a statistical model to generate an answer, one word at a time, each next word being what the tool judges to have the highest statistical likelihood of being used in the context of your conversation with it and its vast training set and data from the internet. This is why the tools are called large language models or LLMs. They are remarkable, but they are models of linguistic probability produced by training on large amounts of text.
To address the training and provenance problems, we could theoretically train our own model on our own documents, but the costs of training models make this option prohibitive, not to mention that truly vast amounts of data are required to train a model from scratch, far larger than say the corpus of FRUS.
So short of training our own model, what could we use these generative AI tools for? Many reports in the media praise ChatGPT’s ability to write in the style of Shakespeare or in the voice of a pirate. Should we instruct ChatGPT to compose a sonnet in the style of Henry Kissinger? This was hardly a compelling idea, and I haven’t tried. These are the reasons why I initially dismissed tools like ChatGPT as an impressive technical feat, but ultimately a parlor trick without direct applications for historical inquiry. What could you do with a tool whose output, you have to assume, is a hallucination that might some of the time be accurate?
My thinking began to change one year later, this past November, when four developments came to my attention in rapid succession. These discoveries led me and several colleagues to dedicate time since then to investigate the potential uses of AI for historical inquiry in general and in documentary editing specifically.
What capabilities grabbed our attention? Next slide, please. Tools like ChatGPT and Claude began to allow users to upload a file containing, say, a document or an article, and ask the tool questions about the file. What is this article about? What does the author argue? Finally, we could focus the tool on our own data, on the data of interest to us, instead of relying on the tool’s undisclosed training materials for answers of dubious origin, we could do something never possible before, interrogate our own documents using plain English, thanks to the LLM’s ability to process natural language. This was starting to get interesting. If you haven’t done it before, I would encourage you to go to ChatGPT, Claude, or Gemini, upload an article, and ask the tool to summarize the article, or ask a question about some of the arguments in the article.
Still, this technique has some limits. Next slide, please. These tools can only keep a certain amount of information in their head during a conversation. Specifically, they have a fixed limit on the number of words they can keep in their short-term memory. This limit is called the context window limit. The longer you talk to the tool, the more likely it is that the sum of the words in your conversation will exceed this limit, and the tool will literally forget the beginning of your conversation. As of my last survey a few months ago, ChatGPT’s limit was approximately 25,000 words, or 80 pages. The practical effect of this context window limit is that you might upload an article or a document that is so long that the tool can’t keep the whole thing in its memory, or quickly exhausts the limit after a few questions. As a result, it won’t be able to provide a comprehensive answer. Competing tools offer higher limits than ChatGPT, Anthropic’s Claude Opus model boasts a context window six times higher, 150,000 words, or 500 pages. Google’s Gemini 1.5 Pro offers a three times limit higher still, 750,000 words, or 2,500 pages. That is finally enough to encompass even the largest FRUS volume and is probably enough room for two average-sized FRUS volumes.
Besides limits on the amount of data that the tools can accept as input, the tools also limit the number of words they can produce as output. ChatGPT is capped at 1,500 words, or about five pages. Claude doubles that, 3,000 words, or about 10 pages. Gemini doubles that again, 6,000 words, or about 20 pages. But no matter which you use, the length of the answers to your questions is finite. Nonetheless, the possibilities are intriguing.
Can you think of any uses for asking questions about articles or documents? One political scientist who uploaded all of his published papers to Google’s Gemini wrote that he was impressed with the tools’ quality of answers and confirmed that they reflected the conclusions he had reached in his works. But for the Office of the Historian, with our collection of over 550 FRUS volumes, even the largest models lack a sufficiently large context window limit to be able to answer arbitrary questions about our corpus, or to ask questions about even larger collections of data, such as the archives we do our research in. So for now, the context window limit prevents us from directly interrogating large corpora. On the other hand, we are confident that since historians are hardly the only profession that would benefit from the ability to ask questions of a large corpus of data, the companies that produce these tools have a massive commercial incentive to serve these markets. So we expect these context window limits to be eased more or less gradually.
In the meantime, engineers have developed a technique to partially address the context window limit, which enables an expanded range of historical inquiry. Next slide, please. The technique is called Retrieval Augmented Generation, or RAG. This was the second development I learned about that caused me to get excited about the possibilities of generative AI tools.
The idea behind RAG is to pair a large language model with a database containing your corpus of articles or documents. When you ask a question, the tool first searches the database for portions of the documents that were semantically most relevant to your question. Then it presents these excerpts of the matching documents to the AI tool, and the LLM uses these excerpts to compose an answer. This technique sidesteps the context window limit by feeding excerpts to the model instead of complete documents. The idea has the potential to be able to select the right information to answer your question—potential.
After experimenting with many tools, which couldn’t handle a corpus the size of FRUS, an engineer from Google introduced me to one of their tools that was able to index all of FRUS. It’s called Google Vertex AI Search Agent. That’s a mouthful. I set it up, pointed it at history.state.gov, and after several hours during which the tool indexed every FRUS document, I was able to begin submitting questions. Next slide, please. It limits its answers to the top 10 most relevant document segments and composes answers four to five paragraphs long with citations to each document it used for its claims. Here you see a question that I asked at the very top and the AI tool’s multi-paragraph answer. There’s another paragraph not shown here. At certain points along the way, you’ll see a circle with a link icon. When you click on that, it reveals what’s shown at the end of that first pink arrow, which is a card summarizing or displaying the source FRUS document that it drew that information from. If you follow that link to the original document, I’ve taken a screenshot of the text in the document that you can verify it drew from for its answer, rather than using its general internet training set, it used this document to formulate its answer. This means that questions that require 10 document fragments or more, or more than 10 document fragments, will be unavoidably incomplete. So this answer shows three little citation blocks. Each one of those might have one or more links, but you’re only going to see 10 citations.
So this RAG approach with Google’s implementation has a limit of 10 links or 10 excerpts that it can draw from to formulate an answer. So to use it effectively, you really have to think. Is the question I’m answering, could it be answered with 10 fragments or fewer? Or am I asking a question that’s much broader and would require consulting a larger base of documents? Besides its multi-paragraph answers and citations, it also presents a list of all documents that it thinks are relevant. So more than 10, thousands maybe, it would show you the list of all the relevant documents. You could go and look to yourself for more information about the question you asked, but its summary answer only includes 10 fragments.
In my preliminary testing, it has been quite impressive, with better results than competing tools I had tried. I wouldn’t advise copying and pasting any AI answer directly into an email or paper, but I think that with training in the tool’s limitations, my colleagues here could already use this tool to help them with their own research. Extensive testing and refinement would be needed though before we could offer such a tool on our public website.
The third development emerged from a fortuitous meeting with an Amazon engineer. I explained that we have a complex style guide, a lengthy manual for annotating FRUS documents. My colleague James Wilson championed the idea of an AI tool that could review our draft annotations for FRUS volumes against the rules in the style guide. The engineer happened to have created a proof of concept for just such a tool for a different style guide and offered to show it to us and adapt it to our style guide. We provided him with an excerpt of our guide and samples of FRUS annotations containing intentional mistakes we had introduced to test the tool’s ability to catch those mistakes. Next slide, please.
I don’t have a screenshot of the resulting tool, but this slide just shows you a sample annotation sheet and the deletion that James inserted where he removed some words in the document heading. In the lower portion of the screenshot, I have a cutout of the style guide which indicates how the headings should be formulated for meeting minutes. The tool did successfully tell us that the document heading was lacking the “Minutes of a…” prefix. It was able to apply that rule based on these examples listed in the style guide. The tool correctly identified some of the mistakes, missing required components of archival citations. In other cases, it flagged some issues that were not mistakes but that our editors deemed worth checking. The tool also revealed some inconsistencies in the style guide itself that inevitably slipped in over the years that we had not detected. As a result, we are exploring using AI not only to check annotations against our style guide but also to improve the style guide itself, so it’s more reliable for humans, too. Still, we’re quite some distance from being able to use such a tool in our internal systems.
The fourth and final development that piqued my interest was the appearance of tools that could transcribe handwritten historical documents. For decades, we’ve enjoyed the use of digital scanning technologies and optical character recognition or OCR technology for recognizing the letters and words in typed and printed documents and allowing us to search and mine them. As good as OCR technology is, it still produces output that can be riddled with errors, and such levels of error may be unacceptable for certain use cases. But for others, we have to laboriously proofread OCR output to achieve a certain level of quality. Next slide, please.
In this slide, I have a scanned image of a document from the Reagan Library. It looks very clean. You would think this would be amenable to OCR, but one of the leading OCR tools produced the result in the bottom. That is the raw OCR output. It’s not hard to see problems. If traditional OCR still has room for improvement on typed or printed documents, the situation is much worse for handwritten texts. Traditional OCR struggles or completely fails to recognize and extract handwritten text. Although many documents produced over the last century have been typed, we still have enormous quantities of handwritten documents and continue to produce them. If text can’t be extracted, we can’t search its contents and readers with visual disabilities are hindered from accessing the information. Being able to effectively digitize paper records and make them accessible and ready for research is a challenge not just for historians, but for any organization seeking to take advantage of the power of digital tools for searching and analyzing text.
There are two ways in which generative AI tools are improving on traditional OCR. First, generative AI tools are able to take error-laden OCR text, such as that shown here, and try to fix the errors in the text. Next slide, please. So in the top portion, we see the original OCR output riddled with errors, and in the bottom text shows the result of the prompt asking ChatGPT to correct obvious errors.
Still, I wished for a tool that could perform its own OCR that wouldn’t be limited to cleaning up bad OCR. In the last few months, ChatGPT and its peers began releasing models with this capability. These models are called “multimodal” meaning that they can take as their input not only text, but also images, audio, and video. To trigger the tool’s document vision capabilities, you can’t upload a PDF. You have to upload an image of a document, a JPEG or a PNG. Once you upload one of these types of images, you can ask the tool for a transcription.
But how good are the results? Next slide, please. As a test, I chose a particularly challenging set of documents, a set of 6,500 handwritten index cards that we had scanned a decade ago, but could not effectively exploit because OCR tools could not decipher the handwriting. The index cards contain listings of consular officials at U.S. diplomatic and consular posts from 1789 to 1960. Next slide, please. I have a slightly zoomed-in version so you can see more detail. The cards are nearly all handwritten in ornate cursive. Each card contains a mix of tables and marginal notes and headings.
After experimenting with ChatGPT and Cloud with mixed results, I finally derived very impressive results from Google’s Gemini 1.5 model. Next slide, please. It transcribed the text of many cards perfectly and was able to capture the card’s mixture of tabular and non-tabular comments and marginalia, a feat that no other model matched. So here we see the results of Gemini’s transcription of this card. If you look closely, you will see a few mistakes or variations, like the middle initial of the person. But it did a very respectable job with the text, and it captured the structure of the card— the heading of the card, and the column headings, the cell boundaries—it did quite a good job. Next slide, please.
And unlike other tools, this is the transcription of the bottom portion of the card where there’s a new heading and several comments that are not part of the table that are sort of inserted over the table. It captured those non-tabular remarks perfectly. It wasn’t flawless. For some cards, it merged the contents of adjoining cells or omitted certain columns. For about 5 percent of the cards, it produced scrambled results for reasons we have not yet had the chance to investigate. Next slide, please. But in most cases, it correctly or nearly correctly transcribed the names, birthplaces, and dates and places of service of the officials listed on the cards. The tool took around 15 to 20 seconds per card and completed all 8,600 scanned images in 48 hours, overnight as I slept. We were astounded. It took 48. I was able to sleep longer because of this tool, yes.
We were astounded that after such a short time, we could search the cards in ways that we had been unable to for a full decade or even before in their original paper form. The cards will need to be reviewed, but this review will be starting from a respectable first draft. And here in this image, you’ll see the transcription of one row, George H. Jackson, and we noticed—this experiment wasn’t the first to notice this phenomenon—that in the cards, some individuals appended to their name in red pen was a designation colored. But having the searchable form of it, we were able to search for all instances of that word in the cards. And the 25 cards that had that designation, sometimes in ditto marks underneath, revealed the names of employees we did not previously know about in our lists that we’ve been compiling of Black Americans in the State Department or employees of the State Department. In most cases, it correctly or nearly correctly transcribed the names. Next slide, please.
In conclusion, the results of these experiments show that generative AI tools have great potential for some portions of the work of transcribing, annotating, and querying our historical documents and sources. We can use natural language to query individual documents or articles. We can perform semantic search across large corpuses of data and obtain draft answers to questions based on short excerpts of relevant texts. We can receive automated feedback on our annotations, and we can produce usable draft transcriptions of complex historical documents. The tools exhibit clear shortcomings, making them inappropriate for some tasks. But for other tasks, we were able to mitigate these issues and derive utility through persistent and careful experimentation and close review.
In addition, we noticed these tools improving during the course of our experiments. So if you begin experimenting and hit some disappointing results, you might put your experiments down and wait a few weeks or months. By the time you try again, a new model may have emerged that addressed the flaws of the previous generation. Given the massive and growing scale of commercial investment that I mentioned before, we can anticipate that many of the limitations that we see today will dissipate, and new paradigms will quickly replace today’s offerings. New capabilities are sure to come to these tools, and we should be ready to evaluate them. In the meantime, we are finding valid use cases for these tools in certain limited scenarios, when paired with a healthy dose of caution and skepticism.
Thank you.