home Mail List
Info
Info
Meetings
Goals
Upcoming
Projects
FAQ
Security
Links

[Date Prev][Date Next] [Chronological] [Thread] [Top]

[NMLUG] Python tight loop causing massive CPU barfage



The following python code searched a heiarchy of 28,517 
directories and 222,766 files.  For each "regular" file 
found, it opened and searched the first 1024 lines for any 
line beginning with 'Subject'.

Ran in 7 minutes 13 seconds on a fast linux system.

import os, os.path
import stat
import re
import sys

def foo( total, dirname, filelist):

     total['dirs'] += 1

     for mf in filelist:
        fn = dirname + '/' + mf
        try:
            mode = os.stat(fn)[stat.ST_MODE]
        except:
            continue
   
        myre = re.compile('^Subject')

        if stat.S_ISREG(mode):

            total['files'] += 1

            mf_fd = os.open( fn, os.O_RDONLY )
            the_file = os.fdopen(mf_fd, "r", 1024)
            the_lines = the_file.readlines(1024)
            for line in the_lines:
                m = myre.match(line)
                if m:
                    total['matches'] += 1
            the_file.close()

total = {}
total['dirs'] = 0
total['files'] = 0
total['matches'] = 0

os.path.walk('.', foo, total)

print total

Output:

{'dirs': 28517, 'files': 222766, 'matches': 402}

real    7m13.328s
user    0m26.100s
sys     0m10.020s

> On Tue, 08 Feb 2005 20:44:17 -0700
>  Paul Tietjens <paul.tietjens@moriarty.k12.nm.us> wrote:
>> I have a python script that essentially opens a few 
>>thousand (between 70,000 and 230,000 or so) files, reads 
>>the first 1024 bytes and looks for a string match.
>> 
>> The goal is to search an entire partition full of 
>>Maildirs for specific emails.
>> 
>> I want the process to happen as fast as possible.  So 
>>far, it takes around 21 minutes - but there's a snag. 
>> While this script is running, every other process on the 
>>machine becomes sluggish to the point of 
>>nonresponsiveness.
>> 
>> No amount of playing with nice and priority levels seems 
>>to help.
>> 
>> What has helped, is a small sleep() in the loop - but 
>>that raises the amount of time taken to complete the 
>>tasks fairly rapidly (from 21 minutes to over an hour).
>> 
>> In the end, I set up a goofy sort of throttling that 
>>alters the amount of time sleep()ing by the average load.
>> 
>> Is there a better way to do this?  I'm not much of a 
>>coder, and I know there are a couple on this list - so 
>>any tips offered, no matter how nebulous, would be great.
>> 
>> Thanks in advance!



Please send sugestions and comments to webmaster@nmlug.org.
Valid XHTML 1.1! Valid CSS! Powered by Debian Powered by Apache