









|
[Date Prev][Date Next]
[Chronological]
[Thread]
[Top]
[NMLUG] Python tight loop causing massive CPU barfage
On a slower (800MHz Pentium III) linux system with a
flatter directory heirarchy, the previous python code
searched 4000 directories and 175,000 files in 9 minutes:
{'dirs': 4000, 'files': 735000, 'matches': 1878000}
real 8m59.230s
user 8m14.010s
sys 0m38.450s
You can tell from the large matches it's a disk with
nothing but mail files.
On Wed, 09 Feb 2005 17:29:48 -0500
Peter Espen <peter@espen.net> wrote:
>
> The following python code searched a heiarchy of 28,517
>directories and 222,766 files. For each "regular" file
>found, it opened and searched the first 1024 lines for
>any line beginning with 'Subject'.
>
> Ran in 7 minutes 13 seconds on a fast linux system.
>
> import os, os.path
> import stat
> import re
> import sys
>
> def foo( total, dirname, filelist):
>
> total['dirs'] += 1
>
> for mf in filelist:
> fn = dirname + '/' + mf
> try:
> mode = os.stat(fn)[stat.ST_MODE]
> except:
> continue
>
> myre = re.compile('^Subject')
>
> if stat.S_ISREG(mode):
>
> total['files'] += 1
>
> mf_fd = os.open( fn, os.O_RDONLY )
> the_file = os.fdopen(mf_fd, "r", 1024)
> the_lines = the_file.readlines(1024)
> for line in the_lines:
> m = myre.match(line)
> if m:
> total['matches'] += 1
> the_file.close()
>
> total = {}
> total['dirs'] = 0
> total['files'] = 0
> total['matches'] = 0
>
> os.path.walk('.', foo, total)
>
> print total
>
> Output:
>
> {'dirs': 28517, 'files': 222766, 'matches': 402}
>
> real 7m13.328s
> user 0m26.100s
> sys 0m10.020s
>
>> On Tue, 08 Feb 2005 20:44:17 -0700
>> Paul Tietjens <paul.tietjens@moriarty.k12.nm.us> wrote:
>>> I have a python script that essentially opens a few
>>>thousand (between 70,000 and 230,000 or so) files, reads
>>>the first 1024 bytes and looks for a string match.
>>>
>>> The goal is to search an entire partition full of
>>>Maildirs for specific emails.
>>>
>>> I want the process to happen as fast as possible. So
>>>far, it takes around 21 minutes - but there's a snag.
>>> While this script is running, every other process on the
>>>machine becomes sluggish to the point of
>>>nonresponsiveness.
>>>
>>> No amount of playing with nice and priority levels seems
>>>to help.
>>>
>>> What has helped, is a small sleep() in the loop - but
>>>that raises the amount of time taken to complete the
>>>tasks fairly rapidly (from 21 minutes to over an hour).
>>>
>>> In the end, I set up a goofy sort of throttling that
>>>alters the amount of time sleep()ing by the average load.
>>>
>>> Is there a better way to do this? I'm not much of a
>>>coder, and I know there are a couple on this list - so
>>>any tips offered, no matter how nebulous, would be great.
>>>
>>> Thanks in advance!
> _______________________________________________
> NMLUG mailing list
> NMLUG@nmlug.org
> http://www.nmlug.org/mailman/listinfo/nmlug
|
|