Regular Expression question

Started by Krischan, October 10, 2017, 21:10:36

Previous topic - Next topic

Krischan

Hi,

I'm working on a difficult RegEx expression for a given System->Star->Planet->Moon hierarchy but can't get it running. Perhaps you can give me a hint.

The input string syntax is like that: [Systemname] [Star] [Planet] [Moon] (with a space as a divider between the four levels), so a given String could look like:

1) Plio Eurl GC-D d12-17 ABC 3 d
2) S171 9 BCD 1 i
3) HIP 17729 A
4) Poulva
and so on...

General Rules I've found out so far:
- [Systemname] can contain any characters like A-Z, a-z, 0-9 and some special characters like -*'
- [Systemname] can consist of more than one word
- [Star] can contain A-Z only, but can be A, AB, ABC, ABCD or B, BC, DEF and so on (but always in a row, for example DBC is NOT possible)
- [Planet] can be 1-99 only
- [Moon] can be [a-z] only
- the hierarchy is always: Systemname > Star > Planet > Moon
- Systems can have more than one star, Planets orbit these stars and moons orbit these planets
- [Systemname] always exists and [Star], [Planet] and [Moon] can be optional added to the [Systemname] - but only when keeping the hierarchy Star>Planet or Planet>Moon pair
- so we can have [Systemname] [Star] or [Systemname] [Planet] or a [Systemname] [Planet] [Moon] but not [Systemname] [Moon] because [Moon] would then appear as a [Planet] instead

I hope I didn't forget something. So how would a RegEx look like to separate these four classes from a given string? I'm using Bruceys RegEx module here and currently working on a new tool for the game Elite Dangerous and need to separate these values. I think using a regular expression is the only proper way to solve this.

I've added a text file with ca. 6.000 system name examples to pick from (all objects I've scanned this year), an example system map and here is a small piece of code to play with:

SuperStrict

Framework brl.basic

Import brl.retro

Import bah.RegEx

Local bodyname:String = "Plio Eurl GC-D d12-17 ABC 3"
Local pattern:String = "^(.*?)[ |]A(BC?)?"

Print "#" + Trim(RegFind(bodyname, pattern, 0)) + "#"

Function RegFind:String(s:String, pattern:String, p:Int = 0)

Local regex:TRegEx = TRegEx.Create(pattern)

Try

Local match:TRegExMatch = regex.Find(s)
If match Then Return match.SubExp(p)

Catch e:TRegExException

Return Null

End Try

End Function
Kind regards
Krischan

Windows 10 Pro | i7 9700K@ 3.6GHz | RTX 2080 8GB]
Metaverse | Blitzbasic Archive | My Github projects

Goodlookinguy

#1
Would this work?: \A([A-Za-z0-9 \-]+?)(?:\ ([A-Z]{1,3}))?(?:\ ([0-9]{0,2}))?(?:\ ([a-z]{0,1}))?\z

I used to work with regex a lot in the past.
I'm insane and not in a funny or good way! nrgs.org

Krischan

#2
WOW. I didn't expect a quick answer here (if any) and - its working! Great! I think I need weeks to understand this pattern, this is still rocket science to me.

SuperStrict

Framework brl.basic

Import brl.retro

Import bah.RegEx

Local bodyname:String = "Plio Eurl GC-D d12-17 ABC 3 d"
Local pattern:String = "\A([A-Za-z0-9 \-]+?)(?:\ ([A-Z]{1,3}))?(?:\ ([0-9]{0,2}))?(?:\ ([a-z]{0,1}))?\z"

For Local i:Int = 0 To 4

Print i + ": " + Trim(RegFind(bodyname, pattern, i))

Next

Function RegFind:String(s:String, pattern:String, p:Int = 0)

Local regex:TRegEx = TRegEx.Create(pattern)

Try

Local match:TRegExMatch = regex.Find(s)
If match Then Return match.SubExp(p)

Catch e:TRegExException

Return Null

End Try

End Function


Result:
0: Plio Eurl GC-D d12-17 ABC 3 d
1: Plio Eurl GC-D d12-17
2: ABC
3: 3
4: d

:o ??? :o 8) Thanks alot!

Kind regards
Krischan

Windows 10 Pro | i7 9700K@ 3.6GHz | RTX 2080 8GB]
Metaverse | Blitzbasic Archive | My Github projects

Krischan

A notice: it works in many cases ;D, but there are still some system names which confuses the RegEx, for example:

NGC 3590 MV 6 B (1 should be "NGC 3590 MV 6" and 2 "B")
0: NGC 3590 MV 6 B
1: NGC 3590
2: MV
3: 6
4: B

NGC 3590 36 A (1 should be "NGC 3590 36" and 2 "A")
0: NGC 3590 36 A
1: NGC 3590
2:
3: 36
4: A

S171 33 B (1 should be "S171 33" and 2 "B")
0: S171 33 B
1: S171
2:
3: 33
4: B

* tet02 Orionis C A (0 should be the complete name, 1 should be "* tet02 Orionis C" and 2 "A")
0:
1:
2:
3:
4:

Any idea how to fix this?

I've changed the pattern to support ' * and + as some star names contain these special characters (works so far, I hope it is correct this way):
\A([A-Za-z0-9 \-\+\'\*]+?)(?:\ ([A-Z]{1,3}))?(?:\ ([0-9]{0,2}))?(?:\ ([a-z]{0,1}))?\z

I've attached a complete CSV dump of the results in my database after splitting the input names with the RegEx and saved the results to a SQL table to check:
Kind regards
Krischan

Windows 10 Pro | i7 9700K@ 3.6GHz | RTX 2080 8GB]
Metaverse | Blitzbasic Archive | My Github projects

Goodlookinguy

#4
Sorry I didn't get back sooner. I actually forgot about writing this earlier right before I had to leave. I seriously only wrote that regex in like 3 minutes, so I apologize about the issues I overlooked.

This one will fail if the pattern does not match. It also supports trailing spaces at the end just in case.
\A([A-Za-z0-9\-\+\*' ]+?)(?:\ ([A-Z]{1,3})(?:\ ([0-9]{1,2})(?:\ ([a-z]{1})?)?)?)?\s*?\z

I tested it here to make sure it worked: https://regex101.com/

Edit: Slightly alternative version for any whitespace...
\A([A-Za-z0-9\-\+\*'\s]+?)(?:\s([A-Z]{1,3})(?:\s([0-9]{1,2})(?:\s([a-z]{1})?)?)?)?\s*?\z

Edit 2: I fixed a little issue. I think it follows the system...maybe. I need to re-examine the star->planet, planet->moon thing.

Edit 3: In order to achieve the star-planet, planet->moon thing you desire, you'll have to add an OR conditional that will add capture group 5 & 6, which are really 3 & 4 if they're set. If not, just ignore them. If you'd like to do that, this works...

\A([A-Za-z0-9\-\+\*'\s]+?)(?:(?:\s([A-Z]{1,3})\s([0-9]{1,2})(?:\s([a-z]{1})?)?)|(?:\s([0-9]{1,2})\s([a-z]{1}))?)?\s*?\z
I'm insane and not in a funny or good way! nrgs.org

Hardcoal

This website is surprisingly more efficient than I thought , I just desire to see more of the blitz users hanging around here.
Its a bit less convenient in a way than the original blitz website and i still hope a dedicated website for blitz will be reopened..

Code

Krischan

#6
I've tested variant 2 and 3 and found out that variant 2 fits better even if variant 3 is more sophisticated. Variant 3 produces more incorrect results, but I don't know why. But it looks like we can't catch all systems with a RegEx, I've reviewed all of them and still found some names who won't fit. An example is a system Name like "S171 33" where a separate number at the end of the system name is part of the system name (without any additional stars, planets or moons). You can't predict if 33 means: it is the 33rd planet or just part of the System name. So I'm currently using variant 2 to catch as much as systems possible and perform some logical additional checks for the records which are not absolutely clear. And I saw in my results that even a moon can have a moon, too - rare but it happens sometimes. And I forgot that there are Asteroid belts which can have a syntax like "IC 1396 Sector PJ-N c8-0 A Belt Cluster 1", so these are two additional rules to consider.

But - beside that I've found out that I can retrieve a second dataset from the same logfiles where all System Hyperjumps are recorded with the System name only, too! As a rule of thumb you must first jump into a system before you can start scanning a body (makes sense ;-) ) - so this list should be complete. And because I must FIRST jump into a system BEFORE I scan a body I can perhaps perform this check in a single loop or a loop in a loop. A log entry timeline looks like (this is JSON format):

{ "timestamp":"2017-10-05T21:12:22Z", "event":"FSDJump", "StarSystem":"S171 33", "StarPos":[-2426.750,295.656,-1323.656], "SystemAllegiance":"", "SystemEconomy":"$economy_None;", "SystemEconomy_Localised":"None", "SystemGovernment":"$government_None;", "SystemGovernment_Localised":"None", "SystemSecurity":"$GAlAXY_MAP_INFO_state_anarchy;", "SystemSecurity_Localised":"Anarchy", "Population":0, "JumpDist":0.633, "FuelUsed":0.000273, "FuelLevel":31.169727 }

[...other entries...]

{ "timestamp":"2017-10-05T21:12:38Z", "event":"Scan", "BodyName":"S171 33 A", "DistanceFromArrivalLS":0.000000, "StarType":"O", "StellarMass":100.531250, "Radius":12368407552.000000, "AbsoluteMagnitude":-13.955948, "Age_MY":60, "SurfaceTemperature":103599.000000, "Luminosity":"V", "SemiMajorAxis":66382835712.000000, "Eccentricity":0.076873, "OrbitalInclination":-11.680526, "Periapsis":54.274986, "OrbitalPeriod":9795825.000000, "RotationPeriod":682069.375000, "AxialTilt":0.000000 }

[...other entries...]

{ "timestamp":"2017-10-05T21:16:03Z", "event":"Scan", "BodyName":"S171 33 B", "DistanceFromArrivalLS":1096.244019, "StarType":"H", "StellarMass":24.156250, "Radius":71260.468750, "AbsoluteMagnitude":20.000000, "Age_MY":60, "SurfaceTemperature":0.000000, "Luminosity":"VII", "SemiMajorAxis":276265926656.000000, "Eccentricity":0.076873, "OrbitalInclination":-11.680526, "Periapsis":234.274963, "OrbitalPeriod":9795825.000000, "RotationPeriod":0.076690, "AxialTilt":0.000000 }


For example if I have a (fictional) entry "S171 33 A 7 b" as a Body name - I can then compare the Jump system names (stored in an Array, TList or TMap for example) with the Body names using a simple RegEx or even the more simple Instr command and should be able to separate the System Name part ("S171 33") from the Body part ("A 7 b") very easy (by just replacing the System Name plus a space within the Body Name with "") and using another RegEx (the rear part of your expression) on just the remaining Body part (like "A 7 b") to extract the data I want.

I think this is the best way to get exact results? And thanks again for your efforts.
Kind regards
Krischan

Windows 10 Pro | i7 9700K@ 3.6GHz | RTX 2080 8GB]
Metaverse | Blitzbasic Archive | My Github projects