News:

SMF 2.1.4 has been released! Take it for a spin! Read more.

Main Menu

Writing converter have a few Q's

Started by wing, July 21, 2005, 08:09:32 AM

Previous topic - Next topic

wing

Hi, so I've started to write a converter for Discuss, as mentioned here:
http://www.simplemachines.org/community/index.php?topic=41113.0

I will repeat the relevant info from that other thread:

There is no "Database"  It stores everything as HTML.

It creates a directory for each topic IE: Dir "1"  == "General Talk" and Dir "2" = "Specific Talk"

Then inside of each directory there is a file that corresponds with the directory name so "1.html" and "2.html"  The "1.html" file and the "2.html" contain links to each thread within that topic area.

Also within these directories are more files, each thread is an html file.  "3.html" for instance would be the first topic in General Talk, "4.html" would be the second topic.  And it's completely sequential, so "5.html" would be the 3rd topic ever created but if it was created under "Specific Talk" it would be in the "2" directory. 

This is all stored in the dicuss/data directory.  The only other directories are kind of like caches for searching, they store portions of the messages in another file for quicker searching.

Each HTML page is fairly easy to parse there is a <!--POST> at the start and <!--/POST> at the end of every post.  Within those tags there is all the  data IE: Text, User, Post #, Date etc etc etc.


From what I gather smf stores it's information in the mysql database table smf_messages, I believe there is also another table for smf_topics or something (don't have DB to look at right now).

My plan is to parse out the Post Data from the Discuss threads, I will create new database entries for each of the sections of the forum.  Say numbered 1,2,3,4,5 etc.

From what I gather each message has 3 keys, the Thread Number, the section # and the topic # or something to that effect.
Can create these numbers sequentially?    I won't have numbers to go by from the Discuss convert, so I'll just use a counter and populate it myself.

Some other information IE: IP address and such may not be retrievable, but is it required?

Is there anyother tables I have to fill with data before these will work?  Or does SMF just read search for all posts in thread ID "1" and then stick them into the page.

Anything obvious I might be missing here?

Oh the one thing I did notice in Discuss is each post has some # assigned to it, this number seems to be attached to the search cache I mentioned and to the userinfo of the poster (pottentially to find out who posted what), but I don't think I really care about that.  My main concern is just getting the post data out, sticking it into same topics etc in the SMF database.

wing

Ok, so I doubled checked and yes it is all stored as HTML.

I also figured out what the index numbers are for, as far as I can tell all they are used for is to display on the main page the last poster that posted in each topic.  All it keeps track of is the post # and the username.  There is another file called data.txt.  All this contains is the number of threads, I created 2 threads and the number was 3 I added 1 and it was 4, I added a post to a thread and it doesn't change.  This continues with my theory of directories and files in sequential order.

Wow, pretty ******ty system ;)

wing

Anothing question.

I was going to do all the database adding just using mysql_query's but looking quickly at a converter file I see I might be able to use a built in function in smf if I include the file.

Any hints on which functions would be usefull for doing this?

Grudge

I really would advise that you use the yabb_to_smf.php converter as a base. It already contains the database query functions, timeout protection etc - and makes it obvious what you need to convert. All you do is change the way it reads the data - the converter is attached to the first post in the topic called "Converters for 1.1".
I'm only a half geek really...

wing

#4
Ok I'll try that, like you said, getting the data will be different but I guess the rest should be the same.  I was going to store everything in Arrays when getting the data.  I'll look at that converter.

Ok, took a quick look.  Should I have to worry about convert.php or just the yabb file?

What I'm going to do for now is worry about extracting all the data I require from my "database" of flat files.  Once I figured out how to do all that and build it into some type of usefull datastructure that can be accessed easily I'll look at these converters again. 

All the info I really want is "Username" "Post data" "Date" (no really required), "title of thread".  Everything else is basic HTML that can be pasted right into the body field :)

Grudge

wing,

The "flatfile" converters (Basically just yabb and e-blah) don't use convert.php - they are self contained scripts. Basically, use them and edit the bits where it gets the data for your needs.

Personally, I think the way discus has stored it is going to make getting the data horrific (I've got a discus install as I was gonna write a converter a while back but didn't have the drive)

Doing the users is quite easy I think. You open the passwd.txt file and extract the data. The user data in that file is one user per line, with fields colon seperated (easy)

Everything else is harder. I *think* (having looked harder) the HTML isn't what you want to look at! In the msg_index directory there are lots of files beginning with a number, then a name, then .txt. The number represents the board.

I think #-tree.txt is the board tree. With the first line being the actual board name, and the other lines maybe listing the threads on that board?

I think #-search.txt is all the messages for that board?

I haven't looked into it much (Just five minutes now) but I'm sure this is the easiest way to get the data...
I'm only a half geek really...

wing

Ok thanks. 

Grudge, you have explained pretty much what I have figured out of discus, except the #-search.txt

All that basically is, is keywords to make search faster.  The #-search.txt does not contain the posts, only portions of it.  The HTML files contain the posts, there is no place else they are stored.

I looked and looked at all those text files and nothing in there contains the entire thread body.  I seems to log the first 10 words or so, if you type a long post it only has the first 10 words.  Then only the HTML has then entire post.

I'll do some more checking.  Although using the HTML seems to be the easiest way anyways, and the HTML is in a directory called "messages" so that seems rather intuitive ;) 

Parsing the HTML is a breeze, and the rest of the data files I don't think are a big deal.

The thing is I don't really care to use their number system or ID system, nor do I care to figure it out, I would rather just make up my own numbers and recursively go through the html pages grabbing posts.

I'll check again though.

Grudge

That seems to make sense. If you can get the boards, topics and members through the data files - then grabbing the posts contents should be a breeze...
I'm only a half geek really...

wing

Grudge, you had a me a little bit worried, so I did a check by deleting 1 html file.

Sure enough with the HTML file gone the topic no longer works.    This proves without a doubt that's how it works.

Phew!   On with the coding!

wing

Ok, I wrote the parser, and I'm on my way to changing the yabb converter.

It would have been nice if you had comments in there though :(  Some places I don't know what the heck info was trying to be fetched, there's a regexp but without the data to go with it, I haven't a clue what was trying to be split out :(.

Oh well, chugging along I go!

wing

Anyone know what's in the yabb Settings.pl file that the yabb converter is trying to regexp out with "match"  It seems like it's only getting 1 value, maybe a directory or database entry or something.....

Grudge

I believe this:

preg_match('~\$([^ =]+?)\s*=\s*[q]?([\^"\']?)(.+?)\\2;~', $line, $match);


Is basically looking for variable names like:
$boarddir = '/public_html/yabb';

So that foreach loop finds every variable declared in YaBB and puts it in the $yabb array.
I'm only a half geek really...

wing

Ok, thanks, that's what I figured, not really important for my application.

Advertisement: