Coming soon to a screen near you...

| | Comments (2)

We're not quite ready to roll out CCC Online officially yet, but this summer, Derek, Madeline, and I have been laying the groundwork for the site, and experimenting with the kinds of data (and metadata) that the site will provide.

A few weeks ago, I was complaining about having to work from scratch with Perl, but it looks like we're now in business, thanks to D. One of our challenges has been figuring out how we're going to tag articles in the absence of abstracts (which are a relatively recent phenomenon in the journal) and without having to do close readings of 50 years worth of journal articles. Compared to that latter task, Perl sounds like a walk in the park, yes?

Anyhow, we're getting closer to having a workable system for parsing an article, isolating nouns and noun phrases, and applying a relatively systematic set of rules for generating keywords on the fly. It still involves a fair amount of intensive labor, and I may end up having to learn how to write Excel macros to get it working a little better, but I'm feeling comfortable with the results. They won't be absolutely accurate--there will always be problems with synonyms and diverging meanings (for example, "of course" and "writing course" are treated as 2 equal instances of the word "course")--but I'm hoping that as tentative snapshots, some of this data that we're generating will be of use to the field.

For example, here's a keyword list from the latest issue of CCC. I've tried to collapse when possible (e.g., "students" combines both "students" and "student"), and the number in parentheses is the number of appearances. We've stripped out articles, pronouns, prepositions, etc., and tried to stick to "significant" terminology.

Students (604), Writing (265), Courses (245), Papers (208), Composition (163), Portfolios (139), Summer (107), Textbooks (97), Work (78), Teachers (74), Assessment (69), Approaches (67), Authorship (67), Process (65), Studies (65), Essays (64), Class (63), Part (59), College (58), Teaching (57), Classes (55), Time (54), Author (52), Study (50), Writers (50)

These are the top 25 terms of more than 7000 total, and a word count (remember, minus a lot of other words) of around 18,000. And there are some obvious spots where specific articles are skewing the issue's rankings (White's article on assessment accounts for 137 of the 139 appearances of the word portfolio(s), and Ritter's responsible for all but one appearance of the word authorship). In the five articles, "students" is the top noun in 3 of them, and no lower than the seventh most frequent in the other two.

Of course, where this information will become more useful, I think, is when we've gathered data for a broad range of issues. Patterns, I'm hoping, will emerge over time (as the field itself has), perhaps shifting from one editor to the next, with certain terms waxing and waning in the journal as their relative fortunes in the field have. And so on. Over the next couple of months, I'll probably be posting about CCCO more often here, and sharing some of these speculations.

If anyone is interested, we'll most likely be doing a small-scale usability study on the site as well, probably in the first half of the fall semester. Leave me a comment or drop me an email if you'd be interested in participating...


Sounds promising. I hope you'll share this code. It would be very useful for enhancing something like CompPile, or even the directory full of PDFs on my hard drive. And as an earnest Perl hacker, I'd publish any changes, of course.

I'd be happy to help with the usability stuff, as well.

Perhaps there is some potential for overcoming the challenge presented by the absent abstracts by asking people who are preparing for their comps to forward their abstract-like notes on CCC's articles. I'm not sure if it would be useful to you or how you would organize it, but at the very least it'd be a cool collaborative act for isolated, exam-preparing folks...
I've noticed revisionspiral has some annotated notes posted. Though i don't post 'em, i'm doing the same thing...

Leave a comment



Powered by Movable Type 4.1

About this Entry

This page contains a single entry by cgbrooke published on August 13, 2005 5:05 PM.

All groaned up, or, My blog has two mommies was the previous entry in this blog.

Devil vs. Deep Blue Sea is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.