tales

Ghost writers on the web

Feb. 1, 2007
New detection software programs that generate relevant text on a given subject are getting better at identifying existing content, but a casual reading shows most of what they reveal has no meaning.
By Jeffrey R. Harrow, Principal Technologist, The Harrow GroupAdmit it—technology is fun. Whether your specialty is electronics, fluidics, hydraulics, software or other areas, it’s likely that you’re most intrigued with the technological aspects of your job.

If only our jobs were that simple, and we only had to deal with technology issues. The reality is that, as in school, we have other less interesting tasks that are important parts of the job. For example, some of us cringe at having to write long, detailed reports, especially when we suspect that they’ll just be shelved and rarely—if ever—read.

Considering all the technology around us—There Must Be A Better Way.

Yesterday
Perhaps there is…Many (pre-PC) decades ago, a committee asked me to prepare a paper to justify some obscure point. It was clear that this was just a CYA document that would never actually be read.

I had recently come across a software program that would generate text relating to a given subject. When skimmed, pages of these paragraphs seemed to be relevant to the subject at hand and flowed together as you’d expect. But even a casual reading quickly made it clear that the text was semantically null—it sounded good, but it had absolutely no meaning!

So who could resist such an opportunity? I had the program print a 120-page “paper,” complete with chapter headings and the like. I added a title page and fancy binding and turned it in.

I was praised for having done a good job!

(Of course, I had also written a real paper appropriate to the assignment that I turned in at the next meeting—after I brought my shenanigans to light.)

More than a few of the committee members, who had ostensibly read the report, had embarrassed red faces, but they all got the point—that one day computers might indeed be able to generate valid—rather than null—content, and that would change a lot of rules. (My cheap shot at the foibles of “committees” had nothing to do with this. Really.)

Today
Moving forward to the present, consider this excerpt from the introduction to a computer science paper that might seem at home in Google Labs (the full report also contains charts, graphs and an extensive set of reference citations):

“Many cyberinformaticians would agree that, had it not been for signed theory, the study of RPCs might never have occurred. The notion that analysts agree with reliable modalities is largely outdated. The notion that cyberneticists cooperate with optimal algorithms is generally satisfactory. The improvement of evolutionary programming would tremendously amplify concurrent epistemologies.

“We propose an analysis of SCSI disks (Sen), which we use to argue that A* search can be made optimal, interactive and trainable. While conventional wisdom states that this challenge is largely solved by the simulation of B-trees, we believe that a different approach is necessary. Sen enables compact information. Existing linear-time and introspective systems use the evaluation of kernels to investigate systems. Though conventional wisdom states that this quagmire is entirely overcome by the analysis of access points, we believe that a different approach is necessary.”

It seems to make sense (if only I understood the subject a little better), yet it is quite meaningless. (Or perhaps it’s very insightful and I just can’t “get it…”)

But no human brain cells were harmed in the production of this paper—it was “written” by SCIgen, an automatic computer science paper generator created by MIT graduate students.

And these folks were far more audacious than I was with my committee report. SCIgen created a full paper that was submitted for presentation at the WMSCI 2005 ISAS conference.

IT WAS ACCEPTED!

The “author” commented: “Using SCIgen to generate submissions for conferences like this gives us pleasure to no end.”

(The conference organizers did eventually realize what they had accepted and rescinded their invitation.)

Clearly the authors were having fun with this, but it does presage some very interesting challenges that lie ahead as our computers actually do develop linguistic skills.

Give this a try—create your own paper at http://pdos.csail.mit.edu/scigen/ (just hope your manager doesn’t read this article).

Literary “Battle of the Bots”
Unscrupulous students are already using technology to skirt some assignments by turning in papers purchased or purloined from Internet sites. Educators, not to be outflanked, have stepped up to the challenge by using technology such as TurnItIn to analyze their students’ papers against “billions of pages from both current and archived instances of the Internet, millions of student papers previously submitted to TurnItIn, and commercial databases of journal articles and periodicals,” says the website. TurnItIn is offered by iParadigms LLC.

Another arrow in the educator’s quiver is professor Ed Brent’s “SAGrader.” It analyzes students’ draft papers, and highlights areas where they don’t sufficiently address the assignment. After the student gets the opportunity for rework, the professor can even use SAGrader to grade the final papers! (And we thought it was unfair to have graduate students, rather than the professor, grade our papers.)

Similar software, “e-Rater,” has been grading 400,000 essays annually as part of ETS’ Graduate Management Admission Test. A literary computer now helps decide who gets in to business school! (The service is also available to teachers as “Criterion.”)

These techniques are hardly isolated to education. A typical company generates an immense amount of intellectual property that is shared through business partnering and other legitimate activities, or sometimes through carelessness or industrial espionage. Programs such as iThenticate have appeared to help companies “check the originality of documents and manuscripts instantly, or to find out if any of their current intellectual property is being misappropriated somewhere on the Internet.”

Tomorrow?
I see this as just the beginning of the “Battle of Literary Software,” a technological escalation where teachers’ and students’ and companies’ technological tools will continually evolve to fleetingly grasp the upper hand. Text generators will likely become more sophisticated at paraphrasing, while detection software will get ever better at identifying existing content. Either way, I suspect people will be increasingly careful about schoolwork and work that carries their names.

  About the Author
Jeffrey R. Harrowis principal technologist for The Harrow Group. He can be reached at[email protected].