PCPartPicker

  • Log In
  • Register

I might need a bit of help programming something

Forum Search

Guidelines

  • Be respectful to others
  • No spam
  • No NSFW content
  • No piracy or key resellers
  • No link shorteners
  • Offensive content will be removed

Topic

tragiktimes101 6 months ago

So, I'm not a great programmer. To say I'm a novice would be far over stating my abilities. More like, I understand it....a bit. But, where my knowledge falls off it falls off fast. I am working on a task at work that involves taking a large text file with multiple entries separated by a carriage return, line feed. In each entry there is data in the text file that contains text I need to extract. As of now we are doing it manually via copy and paste, which takes an exorbitantly long time (doing this 4-5 thousand times). I would really like to be able to parse the text into two strings, one that gets deleted and another that get saved to a new file or preserved on the same file without the first part. Generally, it's delimited by a dash and a space like this "- ". That denotes where the beginning text becomes the "features" which I need to extract. Here is an example of what I am working with:


Berkley Fusion19 hooks are targeted to everyone, from the novice to the avid angler. The Heavy Cover hook is an extremely strong hook used for flipping into the heaviest cover. The Heavy Cover hook features a stainless steel bait keeper designed to stay rigged cast after cast. Each front of every package provides soft-bait recommendations.Features:- The Heavy Cover flipping hook sizes include 6/0 to 3/0- Needle point with SlickSet Coating for easier penetration- Stainless steel wire bait keeper- Closed eyelet for line securitySpecifications:- Hook Size: 4/0- Color: Smoke Satin- Quantity: Per 4


What I need to do is extract everything between "features" and "specifications." But, it can't stop based on specifications because not all entries specifically denote that. Some may not have any specifications. Same with features. It can't use "features" to delimit it because sometimes the entry doesn't include that word. Sometimes specifications need to be used instead. But, one thing that is consistent is features always come before specifications. So, I figured it might be possible to parse out the text up until the first "- " and then stop the parsing at the next CRLF unless the word "specification" is seen at which point it will stop the parsing there.

This is somewhat confusing to explain, so if I can't get any help I understand. But, if someone could help, it would be awesome and literally save me a month of monotonous work that could be spent doing other tasks.

Comments Sorted by:

vagabond139 5 Builds 2 points 6 months ago

If it lacks features and/or specifications then where do you begin and end?

tragiktimes101 submitter 1 Build 1 point 6 months ago

It will always have the "- " but sometimes it won't specifically denote the specification / features. It all depends on how the info was extracted from the source but some manufacturers don't denote specification / features. They just use the "- " to start them. This is why dealing with some manufacturers is a pain in the ***. There's not much that is universal.

As for the ending, it always uses a CRLF to go to the next entry. And, it always ends with either specifications or features. Sometimes they don't specify features and instead I use specification in the features category. That's why I would like to avoid using those words to delimit it.

I know, it's a pain.

MichelWeber 2 points 6 months ago

There are a few industry standards for data being moved around. Once is CSV - Comma Separated Values, another is JSON (Javascript Object Notation). Your file doesn't match either directly. However, the '-' looks like it could be used as the comma, but in the example I see 'soft-bait' which would break it.

Where does this file come from? Who ever produces it needs to follow a standard.

Oh wait - I see 'soft-bait' isn't part of what you want to extract.

You could possibly use a spreadsheet tool to import and use that to specify the delimiter being the - and it will separate that into columns.