networks: August 2005 Archives



I asked Lotaria if she has already read some books of mine that I lent her. She said no, because here she doesn't have a computer at her disposal.

She explained to me that a suitably programmed computer can read a novel in a few minutes and record the list of all the words contained in the text, in order of frequency. "That way I can have an already completed reading at hand," Lotaria says, "with an incalculable saving of time. What is the reading of a text, in fact, except the recording of certain thematic recurrences, certain insistences of forms and meanings..."

Italo Calvino, If on a winter's night a traveler

Derek and I have been laying the full-court press on all things related to CCC Online this week, in the hopes of rolling the site out publicly with the release of the first issue of this year's volume (57.1), and we're getting pretty close. Upon its release, the site will have archived the past four years of essays, and while that doesn't sound like a lot, believe me when I say that it has been. There's been a great deal of information to compile, and we've also had to design a workflow in the process, one that will enable us to continue working backwards in time.

Anyhow, one of the major features of the site is ready to roll. In addition to publishing the metadata on each article, we've been generating some additional material in the form of keywords. Beginning with this year, CCC authors will supply a set of keywords for their articles, but as you can imagine, trying to track down the authors of 50-odd years of articles and get keywords didn't strike us as a winning proposition.

And so, like Lotaria above, we looked for technological assistance. Inspired in part by the work of people like Cameron Marlow and Anjo Anjewierden, what we needed was a way to "read" these essays, reducing them to a set of 10-15 keywords, a way that wasn't prohibitive in terms of time or labor. After much searching, despairing, tweaking, and yes, whining, we ended up with a couple of Perl scripts that seem to be doing the trick. You could almost hear the relief, I imagine.

The results of our text parsing are valuable in and of themselves, I think, and they show up for the individual entries on CCC Online, under the heading of Tags. While they can't fully account for a given article's complexity or nuance, we're operating according to a principle a lot like the power law--our attitude is that the majority of an essay's message is concentrated in the handful of words that appear just below the threshhold of articles, pronouns, prepositions, etc. They represent that "thematic recurrence" or the "insistences of meaning."

For instance, in Diana George's essay, which I mentioned a few days ago, we isolated close to 1600 nouns and noun phrases, appearing a total of 3500 times. (These are really rough numbers, for reasons that I could explain if anyone's really interested.) Now, the top 1% of those noun/phrase/s, or about 16 of them, account for around 500 appearances (approx. 15%). Expand the selection to 5% (80 nouns), and the appearances jump to 1200 (almost 33%). 10% of the words (160) gives us a little less than half at 1600 (about 45%). And 20% (320), a magic percentage for power laws, yields 2100 instances, or around 60%. This may not be interesting to anyone but me, but while it doesn't quite match up with the power law, it's close enough to be suggestive. And the roughness of my numbers is rough in the right direction for the claim I could make.

Here's where it gets really cool, though. We've generated lists of keywords for all of the articles published in CCC over the past four years, and placed those keywords on the individual pages themselves. Because we're using MT to publish these entries, though, we've made them available for services like CiteULike and del.icio.us. And so, we've established a CCC Online account at del.icio.us (http://del.icio.us/ccco/), where we've first bookmarked all of the articles from the last four years, and then used our keywords as tags for the articles themselves. And the keywords on each entry at CCCO are links to our del.icio.us page for that tag.

For those unfamiliar with del.icio.us, I recommend scrolling down and finding Options at the bottom of the right-hand column, and starting with "View as Cloud," "Sort by Alpha," and "Show Bundles." The option that appears in black is the one that's active. The cloud uses color and size to indicate which tags are most frequent, and we've separated out the issues themselves into a separate bundle. We also added tags for CCCC Chair's Addresses and the Braddock Award winners. Spaces aren't permitted, and so you'll notice that we're doing the WikiWord thing for phrases.

Tags at the top and bottom of the frequency list are less than optimal, of course. "Students" appears as one of the top 10 in 69 of the 84 essays we've tagged thus far, for example, which isn't particularly useful except as an example of the kinds of concerns most likely to appear in CCC. And at the bottom are a mix of tags, some of which will probably rise as we expand the range and others which will end up being something like Amazon's statistically improbable phrases (SIPs). The range in the middle, though, we hope will help researchers in our field by seeding their bibliographic work (it is only a single journal, after all).

More importantly, though, I think that del.icio.us provides us with the beginnings of a map of the journal--whether it's extrapolable to the field as a whole I'm reserving judgment about, but I'm excited about the possibility. It's an eminently searchable map, as well as one that permits the kind of exploration that isn't nearly as convenient otherwise. There's plenty more to say about it, I'm sure, but right now, I kind of want to just sit back and feel a little pride.

So, yeah, that's part of what we've been up to.

Communificationalized

| | Comments (4)

Or something like that.

One of two things happened today: either I was better prepared somehow than I thought I was, or I've somehow learned to be a little more comfortable with my intrinsic lack of organization. If I had to lay money, I'd probably bet on the latter. I ended up doing what I could to sort of shrink the scale of my talk a bit, from the Discipline to the textual networks that make up both the discipline(s) as well as our experiences of it. And this segued nicely into a quick show and tell about CCC Online. Most of the work that we've done on the site has been this summer, and so most others in my department haven't seen what we've accomplished, much less heard me make the case for why it's important.

I'm still working the kinks out in terms of my ability to articulate the needs that the site fills and the way that it goes about filling them, but I figure that'll come with practice. So last week-ish, I talked about doing the keyword parsing of each article and each issue. Today, one of the things I showed off was our attempt to provide what we're calling reversible bibliographies, or "works citing" as opposed to "works cited."

Since we're managing the site through Movable Type, one of the things we're doing is to place links in the works cited to other CCC articles. Makes sense, right? Well, MT allows us not only to place links to those cited works, but to make them trackbacks as well. So articles that have been cited will themselves contain links to the essays that cite them. We're only four years deep so far, so we don't have lots of examples of this, but you can see what I mean by looking at the entry for Diana George's From Analysis to Design, which is, in the pages of CCC itself, the most frequently cited CCC article from the past four years. And it's only been twice.

(And that's something that I didn't talk about today, but could have. It's interesting to look at what I'd called insular citation patterns (CCC articles cited in CCC articles). The two most citation-heavy articles from the last three years have both been CCCC Chair's Addresses, for example. There are roughly 61 CCC articles cited in the most recent Volume (4 issues), but half of those are accounted for by Kathi Yancey's Chair's Address (21) and Richard Fulkerson's article (10), with 12 of 20 essays citing either zero or one. That's up from the year before--in Vol. 55, there are only 37 CCC articles cited in 20 essays. These are rough estimates, though, because there are some instances of Cross-Talk citations that I haven't traced out to their origins.)

As we get deeper into the archives, though, the idea of including a Works Citing list as well as Works Cited will become more relevant, I think. It would also be improved if/when the scope of this project expanded beyond CCC. But that's a concern for another year...



We're not quite ready to roll out CCC Online officially yet, but this summer, Derek, Madeline, and I have been laying the groundwork for the site, and experimenting with the kinds of data (and metadata) that the site will provide.

A few weeks ago, I was complaining about having to work from scratch with Perl, but it looks like we're now in business, thanks to D. One of our challenges has been figuring out how we're going to tag articles in the absence of abstracts (which are a relatively recent phenomenon in the journal) and without having to do close readings of 50 years worth of journal articles. Compared to that latter task, Perl sounds like a walk in the park, yes?

Anyhow, we're getting closer to having a workable system for parsing an article, isolating nouns and noun phrases, and applying a relatively systematic set of rules for generating keywords on the fly. It still involves a fair amount of intensive labor, and I may end up having to learn how to write Excel macros to get it working a little better, but I'm feeling comfortable with the results. They won't be absolutely accurate--there will always be problems with synonyms and diverging meanings (for example, "of course" and "writing course" are treated as 2 equal instances of the word "course")--but I'm hoping that as tentative snapshots, some of this data that we're generating will be of use to the field.

For example, here's a keyword list from the latest issue of CCC. I've tried to collapse when possible (e.g., "students" combines both "students" and "student"), and the number in parentheses is the number of appearances. We've stripped out articles, pronouns, prepositions, etc., and tried to stick to "significant" terminology.

Students (604), Writing (265), Courses (245), Papers (208), Composition (163), Portfolios (139), Summer (107), Textbooks (97), Work (78), Teachers (74), Assessment (69), Approaches (67), Authorship (67), Process (65), Studies (65), Essays (64), Class (63), Part (59), College (58), Teaching (57), Classes (55), Time (54), Author (52), Study (50), Writers (50)

These are the top 25 terms of more than 7000 total, and a word count (remember, minus a lot of other words) of around 18,000. And there are some obvious spots where specific articles are skewing the issue's rankings (White's article on assessment accounts for 137 of the 139 appearances of the word portfolio(s), and Ritter's responsible for all but one appearance of the word authorship). In the five articles, "students" is the top noun in 3 of them, and no lower than the seventh most frequent in the other two.

Of course, where this information will become more useful, I think, is when we've gathered data for a broad range of issues. Patterns, I'm hoping, will emerge over time (as the field itself has), perhaps shifting from one editor to the next, with certain terms waxing and waning in the journal as their relative fortunes in the field have. And so on. Over the next couple of months, I'll probably be posting about CCCO more often here, and sharing some of these speculations.

If anyone is interested, we'll most likely be doing a small-scale usability study on the site as well, probably in the first half of the fall semester. Leave me a comment or drop me an email if you'd be interested in participating...

The velvet rope?

| | Comments (2)

IHE makes note of a study published this month in Academe by Stephen Wu, called "Where do Faculty Receive Their PhD's?." It's notable for me mainly because it's an initial stab at a network analysis of the largely invisible prestige economy that operates in the academy.

The results aren't especially surprising: top schools hire from top schools:

This study shows that graduates from the top-rated PhD programs continue to hold an overwhelming share of faculty positions at leading colleges and universities. Still, there is a fair amount of variation by field as well by institution type. The reasons for these disparities are unclear, but they merit further investigation.

One of the reasons for disparity is the blunt instrument that Wu uses for his data, namely the USNWR rankings. English is one of the "fields" that he looks at, and English departments include all sorts of areas that aren't articulated by those rankings. The presence of comprhet in most departments guarantees, for example, that the percentage of faculty from the Ivies will decrease. The absence of comprhet from USNWR rankings means that our program at SU is basically invisible.

I have more to say, but I wanted to jot down the observation that what "merits further investigation" is the degree to which the USNWR rankings (among other, similar, "objective" tools) function as a velvet rope to keep subdisciplines and interdisciplines marginalized and unrecognized. I like the basic direction of Wu's piece, but there are really fundamental problems with treating English as monolithically as he does, and even though it's the initial fault of USNWR, he perpetuates it here by not inquiring into the powerful ways that those rankings help to produce his results.

Archives

Pages

Powered by Movable Type 4.1

About this Archive

This page is a archive of entries in the networks category from August 2005.

networks: June 2005 is the previous archive.

networks: September 2005 is the next archive.

Find recent content on the main index or look in the archives to find all content.